Back References

Back references are the mechanism you use to access captured subgroups, while the regex engine is executing. When I say, while the regex engine is executing, you can think of this as the regex engine's runtime. Thus, you can manipulate a subgroup from an earlier part of the match later on in the pattern.

For example, in Chapter 1, I discussed the pattern \b(\w+) \1\b to match repeated words. Here, when you use the \1, you're asking the regex engine to refer back to it itself and insert whatever had matched the (\w+) part of it. Why the (\w+) part? Because that is the capturing group with the index of 1. Remember that capturing groups are counted from the rightmost parenthesis, starting with the index of 1.

For a given pattern with subgroups, Java offers three mechanisms for referring to the corresponding group matches. The first, and most object-oriented, mechanism is to use the various Matcher object methods. These include the Matcher.group, Matcher.start, Matcher.end, and Matcher.replaceAll methods, as discussed in Chapter 2. However, this mechanism doesn't allow the regex pattern to refer back to itself during the regex runtime.

The second approach uses the \n nomenclature, where \n refers the nth capturing group, if it exists. This deserves a bit of explanation. Specifically, what does "if it exists" mean? For an answer, consider the simple regex pattern (\w)(\d)(\w) as applied to W3C. You could refer to group 0 by using the regex \0, group 1 by using \1, group 2 by using \2, and group 3 by using \3. However, \5 would be meaningless because there's no group 5 in the pattern, as would a reference to \33.

Well, yes and no regarding group 33. Although the regex engine knows that there is no group 5, the latter example, group 33, is open to interpretation. That is, the regex engine could, and does, decide that you meant group 3, followed by character 3. Thus, examining the back reference \33 from the preceding example would yield C3: C followed by the character 3. As shown previously, this mechanism does allow the regex pattern to refer back to itself during runtime.

Finally, there is a third way to refer to back references. Three replacement methods on the Matcher object, appendReplacement, ReplaceAll, and ReplaceFirst, as well as the String methods replaceFirst and replaceAll, also allow access to the captured back references by using the $n nomenclature, in which n represents the index of the group in question. Like the \n pattern discussed at the beginning of this section, use of $n will prompt the regex engine to take the most liberal interpretation of the pattern possible in order to facilitate a match.

Thus for the pattern (\w)(\d)(\w), using $33 will prompt the regex engine to assume you meant group 3 followed by the character 3. This is demonstrated in Listing 3-4.

Listing 3-4: Working with Back References

import java.util.regex.*;

public class ReplaceExample{
    public static void main(String args[]){
        //define the pattern
        String regex = "(\\w)(\\d)(\\w+)";

        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        //define the candidate string
        String candidate = "X99SuperJava";

        //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);

        //return a new string that has replaced
        //every matching part of the candidate string
        //with whatever was found in the third group,
        //followed by the digit three
        String tmp = matcher.replaceAll("$33");
        //returns C3
        System.out.println("REPLACEMENT: " + tmp);
        //notice that the original candidate string
        //is unchanged, as expected. After all, Strings
        //are immutable objects in Java.
        //returns W3C
        System.out.println("ORIGINAL: " + candidate);
    }
}

It's important to be careful when you're working with back references. You could be asking the regex engine to do things you had no idea you were asking for, and that, in turn, could cost you in terms of efficiency and/or correctness.

One final word of warning: Calling back references for a group that doesn't exist will cause an IndexOutOfBoundsException to be thrown. Make sure your back references exist before you refer to them.