Team LiB
Previous Section Next Section

Greedy Qualifiers

You might have noticed that the third subgroup of the pattern(\w)(\d\d)(\w+), namely(\w+), actually provided the regex engine with some discretion. Given the candidate X99SuperJava, the subexpression (w+) only had to match one or more word characters to meet its obligation, yet it chose to match them all. That is, it chose to match the candidate SuperJava when it could have just matched S, Su, Sup, and so on. After all, these also meet the requirement of being one or more word characters. Why did the engine decide to match as much as possible?

It did so because it is, in a word, greedy. The nature of the regex engine is to match as much as it possibly can, so long as that match doesn't interfere with another matching subexpression somewhere else in the pattern. Thus, it matches the entire candidate SuperJava. Table 3-1 presents the Java regex greedy qualifiers for your reference.

Table 3-1: Greedy Qualifiers

Regex

Description

?

The preceding is repeated once or not at all.

*

The preceding is repeated zero or more times.

+

The preceding is repeated one or more times.

{n}

The preceding is repeated exactly n times.

{n,}

The preceding is repeated at least n times.

{n,m}

The preceding is repeated at least n times, but no more than m times. This includes m repetitions.

Greedy matching has an interesting behavioral pattern that I think of as greedy-generous. The leftmost part of the pattern, the first (\w), attempts to match as much as possible. When it can't match anymore, the next part of the pattern is evaluated. This is the greedy part.

If that second part of the pattern fails to find any matches, and then the first matching pattern starts to slowly release characters that it has already collected, thus providing the second part of the pattern with more opportunities to match. This is the generous part of the behavior.

At this point, one of two things can happen. The first alternative is that the latter group can finally achieve a match, in which case the first group stops releasing characters. The second alternative is that the latter group can fail to match, even with the characters made available to it from the earlier group. If this happens, then the group that was releasing characters, in effect, collects those released characters again, and the regex engine goes on. If a later subexpression again fails to match, then the process is repeated.

So, what does all of this mean to you? Well, quite a bit, really. Consider the regex pattern (\w+)(\d\d)(\w+), which is almost identical to the pattern (\w)(\d\d)(\w+) except that first pattern uses + after the first \w, thus forming the group (\w+). That little + can make a huge difference in terms of efficiency, and one that you might not notice in casual use, because the result of applying the new pattern against the candidate X99SuperJava doesn't change.

Listing 3-5 examines what actually happens when the pattern is evaluated.

Listing 3-5: Greedy Qualifier Example
Start example
import java.util.regex.*;

public class GreedyExample{
    public static void main(String args[]){
        //define the pattern
        String regex = "(\\w+)(\\d\\d)(\\w+)";

        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        //define the candidate string
        String candidate = "X99SuperJava";

        //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);

        matcher.find();

        //extract the matching groups
        System.out.println(matcher.group(1));//returns 'X'
        System.out.println(matcher.group(2));//returns '99'
        System.out.println(matcher.group(3));//returns SuperJava
    }
}

End example

When group(1) runs, (\w+) examines every character in the candidate String X99SuperJava. That is, X is explicitly considered, passes inspection, and is put into the matching bag for this group. Because this pattern is greedy and has + after \w, it continues. Next, 9 is explicitly considered and passes inspection. Then, the next 9 is considered. This continues until the entire String X99SuperJava is consumed.

After group(1) is satisfied, group(2), namely (\d\d), gets an opportunity. Because group(2) is unable to match anything at all, group(1) releases the a character at the end of X99SuperJava. The a character is considered by group(2), found not be a digit, and considered not to be sufficient. Thus, group(1) releases the v character. group(2) inspects it, finds it lacking, and rejects it. Thus, group(1) releases the a character immediately before the v character in X99SuperJava. This continues until group(1) has released every character except X. Finally, the release of the two 9 characters allow group(2) to match. Now group(1) has X and group(2) has 99.

Finally, group(3) gets an opportunity to run. It starts examining the candidate String X99SuperJava at the point immediately following the second 9 character. And because it's greedy, it matches the entire String SuperJava.

Thus, the two patterns (\w)(\d\d)(\w+) and (\w+)(\d\d)(\w+) produce exactly the same result when applied to the String X99SuperJava but at vastly different efficiency costs. Although this may be insignificant when you're dealing with a small String, it's very significant when you're parsing a directory full of files; it could mean the difference between your application working and it running out of memory.


Team LiB
Previous Section Next Section