Noncapturing Subgroups

There may be times when you need to define a group, but you don't want that group to be captured—you simply want to treat it like a single logical entity. The major advantage of using these noncapturing groups is that they're less memory intensive because they don't require the regex engine to keep track of the matching parts.

Consider the pattern (\w)(\d\d)(\w+). Specifically, if you don't need access to the trailing (\w+), you can optimize a bit.

To mark a group as noncapturing, you simply follow the opening parameters of that group with the characters ?:. That is, you can write the expression as (\w)(\d\d)(?:\w+). Notice that the only difference between the original expression, (\w)(\d\d)(\w+), and the new expression, (\w)(\d\d)(?:\w+), is the use of the ?: that immediately precedes the last group, (\w+).

The most common use of noncapturing groups is for the sake of logical separation. For example, say you need to find out what kind of morning a person is having. You'll accept good morning, bad morning, terrible morning, great morning, and so on. For the sake of clarity, you write the expression as (good|bad|terrible|great) morning. That is, you want to treat the various kinds of mornings as a single logical unit.

However, say you don't need to capture the type of morning, because you're not going to be using it for anything—you just want to know it's there. You modify your expression to (?:good|bad|terrible|great) morning. Specifically, you insert ?: just inside the group definition, after the opening parenthesis of the group. This gives you the ability to treat the various kinds of mornings as a single logical unit, but it doesn't waste memory capturing the description.

Note

To make a group noncapturing, insert ?: inside the opening parenthesis of the group.

An added issue in working with noncapturing groups is that they aren't counted, as far as group indexing is concerned. This makes perfect sense, as you are, in effect, telling the regex engine that you aren't interested in these groups. Thus, why should the regex track them or provide a mechanism that allows you to refer to them? After all, you explicitly told the regex engine that you weren't interested in doing so.

So for the pattern (?:\w)(\d), group(0) is the entire pattern, namely \w\d, and group(1) is (\d). Notice that (?:\w) is not group(1), as it normally would be, because (?:\w) is a noncapturing group; it's preceded by ?:. Listing 3-3 demonstrates the use of a simple noncapturing subgroup.

Listing 3-3: Working with Noncapturing Subgroups

import java.util.regex.*;

public class NonCapturingGroupExample{
    public static void main(String args[]){
        //define the pattern
        String regex = "hello|hi|greetings|(?:good morning)";

        //define the candidate strings
        String candidate1 = "Tommy say hi to you";
        String candidate2 = "Tommy say good morning to you";
        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        //extract the first pattern
        Matcher matcher = pattern.matcher(candidate1);
        //show the number of groups
        System.out.println("GROUP COUNT:"+ matcher.groupCount());

        if (matcher.find())System.out.println("GOT 1:"+candidate1);


        //reuse the matcher, and check the second candidate string
        matcher.reset();
        matcher = pattern.matcher(candidate2);

        //show the number of groups
        System.out.println("GROUP COUNT:"+ matcher.groupCount());

        if (matcher.find())
        System.out.println("GOT 2:" +candidate2);
    }
}

The output of this example is shown in Output 3-1.

Output 3-1: Output of NonCapturingGroupExample

GROUP COUNT:0
GOT 1:Tommy say hi to you
GROUP COUNT:0
GOT 2:Tommy say good morning to you

If you had used a capturing group, then the group count could have been 1. Although this may seem like a fairly innocuous issue, it could grow exponentially more complex, as the number of capturing groups grow.