Understanding Lookarounds

There are times in programming, as in life, when you'd like to know what to expect before making a more serious effort. For example, you might want to know that your favorite restaurant is open before you go there to eat. How would you accomplish that? You would, of course, phone ahead. The same idea is used in regex lookarounds.

There are four flavors of lookarounds: positive lookaheads, negative lookaheads, positive lookbehinds, and negative lookbehinds. The following sections explain each in detail.

Note

Lookarounds are noncapturing groups, but they never consume text. Thus, verifying that a certain character exists further down the candidate string doesn't mean that the character in question has been exhausted by the regex pattern. Lookaheads don't match characters; they match positions.

Positive Lookaheads

Positive lookaheads allow your regex to "peek ahead" and make sure that the pattern does, in fact, exist somewhere down the line in your candidate string before the rest of the match is attempted. They don't consume that text, however—they just confirm the truth of its existence. They are, basically, a way to tell the regex engine "Don't bother looking at the candidate string if it doesn't have the lookahead." You form them by opening the group with the characters (?=. For example, the lookahead

(?=\d\d)

confirms that the candidate string contains two digits in a row. However, it doesn't consume those two digits. Combined with other regex patterns, positive lookaheads can a very powerful weapon in your regex arsenal.

Say you want to match IP addresses, but only if they begin with 255. Also, if they do begin with 255, you want the entire regex pattern. With lookaheads, this issue is easily solved, as demonstrated in Listing 3-8. Of course, this example assumes a great deal about the friendly nature of the data. Even so, it does nicely illustrate the usage of lookaheads, so all is forgiven. Table 3-2, which follows Listing 3-8, deconstructs the regex pattern (?=^255).*.

Table 3-2: The Pattern *(?=^255)*. ^[*]
Regex	Description
(?=	A positive lookahead consisting of
^	The beginning of line character followed by
2	The character 2 followed by
5	The character 5 followed by
5	The character 5 followed by
)	Close the lookahead group
.	Any character
*	Repeated zero or more times
^[]In English:* Capture the entire IP address if it starts with 255.

Listing 3-8: Simple Positive Lookahead Example

import java.util.regex.*;

public class PositiveLookaheadExample{
    public static void main(String args[]){
        //define the pattern
        String regex = "(?=^255).*";

        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        //define the candidate string
        String candidate = "255.0.0.1";

        //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);

        String ip ="not found";

        //if the candidate starts with 255, then the ip
        //will be populated with the correct information.
        if (matcher.find())
            ip=matcher.group();

        String msg ="ip: " + ip;

        System.out.println(msg);
    }
}

In Listing 3-8, the regex engine first confirms that the candidate string starts with 255 before attempting to execute the rest of the pattern. If the candidate String doesn't do so, then the rest of the pattern can't possibly match and no resources are wasted in attempting to do so.

Notice that using a noncapturing group (?:=^255) instead of (?=^255) to confirm the existence of 255 wouldn't work, because (?:=^255) consumes the characters 255, even though it doesn't capture them, and returns the .0.0.1 that follows them.

Negative Lookaheads

Negative lookaheads, like positive lookaheads, allow your regex to "peek ahead." However, they allow the engine to confirm that something does not exist somewhere down the line in your candidate string. Like all lookaheads, they don't consume text; they just confirm the truth of its absence. They're formed by opening the group with the characters (?!. For example:

(?!\d\d)

confirms that the candidate String doesn't contain two digits in a row. It doesn't consume those two digits.

Say you're parsing text and you want find reference to John and extract both the first name and the last name, unless that reference happens to John Smith. With negative lookaheads, this sort of exercise becomes very easy. Listing 3-9 demonstrates the code for doing so. Table 3-3 deconstructs the regex pattern used.

Table 3-3: The Pattern *John (?!Smith)[A-Z]\\w*+
Regex	Description
J	The character J followed by
o	The character o followed by
h	The character h followed by
n	The character n followed by
<space>	A space, followed by
(?!	A position in which you'll find anything but
S	The character S followed by
m	The character m followed by
i	The character i followed by
t	The character t followed by
h	The character h followed by
)	Close the lookahead group, followed by
[A-Z]	Any uppercase character followed by
\w	A word character
+	Repeated one or more times followed by
\w	Any word character
*	Repeated zero or more times
* In English: Find and capture occurrences of John followed by some capitalized word, unless that word is Smith.

Listing 3-9: Simple Negative Lookahead Example

import java.util.regex.*;
public class NegativeLookaheadExample{
    public static void main(String args[])
    throws Exception
    {
        //define the pattern
        String regex = "John (?!Smith)[A-Z]\\w+";

        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        String candidate = "I think that John Smith ";
        candidate +="is a fictional character. His real name ";
        candidate +="might be John Jackson, John Westling, ";
        candidate +="or John Holmes for all we know.";

        //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);

        String tmp=null;

        //extract the matching group. Notice that it's
        //the default group, since lookarounds are
        //noncapturing
        while (matcher.find()){
            tmp=matcher.group();
            System.out.println("MATCH:" + tmp);
        }
    }
}

In Listing 3-9, the regex engine first parses the candidate and considers successful matches to be those that consist of John when it's followed by some capitalized word, unless that capitalized word is Smith. Again, it's important to notice that using a noncapturing group allows you to capture the entire match, because it hasn't been consumed.

Positive Lookbehinds

So far, you've explored the ability to look to the right of the candidate String to "peek ahead" and see what the future has in store for your pattern. Similarly, there are times when it's useful to be able to look to the left of the current position being considered to see what the past had to say about a particular pattern. That is the purpose of lookbehinds.

Like lookaheads, lookbehinds come in two flavors. Positive lookbehinds confirm the existence of a pattern to the left of the current position, and negative lookbehinds confirm the absence of a pattern to the left of the current pattern. You form positive lookbehinds by opening a noncapturing group with (?<=. Thus, to confirm that two digits preceded the current expression, you might use the following positive lookbehind:

(?<=\d\d).*

This confirms that the candidate string was preceded by two digits in a row. It doesn't consume those two digits; however, it acts like it did because they're beyond the scope of the capture. This happens because the expression parser has already moved past them. That is, the parse has, by definition, already tried to match them and failed to do so. It if hadn't, it would have stopped trying to find the next match.

Consider the candidate 42 is the answer. When the regex engine compares this candidate String against the pattern (?<=\d\d).*, it starts by examining the first character, which is 4. Because two digits don't precede 4, it's rejected. Next, the engine compares the 2 character. Because 2 is also not preceded by two digits, it is discarded. Next, the regex engine examines the space character following 2 in the candidate string 42 is the answer. Because that space character is, in fact, preceded by two digits, namely 4 and 2, the regex engine happily starts to match. Of course, because the remaining part of the pattern is .*, every remaining character is matched. Thus, the space following 42 and everything thereafter is captured. But 4 and 2 aren't captured, because the regex engine already passed them.

Because the regex engine is already past the 4 and the 2 characters, it won't match them. This is an important and subtle distinction. Lookbehinds, like all lookarounds, are noncapturing. However, in this case, they appear to act as if they've already captured the 4 and the 2 characters. That is, the characters 4 and 2 are excluded from the capture set. However, that's because they've already been parsed, not because they've been captured. It's important to be able to see through this illusion.

Listing 3-10 demonstrates some code for using positive lookbehinds. The goal is to parse a document's content and extract any URLs used. Table 3-4 deconstructs the regex pattern used.

Table 3-4: The Pattern (?<=http://)\S+
Regex	Description
(?<=	Open a positive lookbehind group consisting of
h	The character h followed by
t	The character t followed by
t	The character t followed by
p	The character p followed by
:	The character : followed by
/	The character / followed by
/	The character / followed by
\S	A nonspace character
+	Repeated one or more times
* In English: Match a URL if that URL is preceded by http://.

Listing 3-10: Simple Positive Lookbehind Example

import java.util.regex.*;
public class PositiveLookBehindExample{
    public static void main(String args[])
    throws Exception
    {

        //define the pattern
        String regex = "(?<=http://)\\S+";

        //compile the pattern
        Pattern pattern = Pattern.compile(regex);

        String candidate = "The Apress website can be found at ";
        candidate +="http://www.apress.com. There, ";
        candidate +="you can find information about some of ";
        candidate +="best books in the industry, including the ";
        candidate +=" bestselling Sun Certified Java Developer ";
        candidate +=" Exam with J2SE(";
        candidate +="http://www.apress.com/book/bookDisplay.";
        candidate +="html?bID=39) as well as others.";


        //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);

        //if the url was found, print it out here.
        while (matcher.find()){
            String msg =":"+ matcher.group()+":";
            System.out.println(msg);
        }
    }
}

Negative Lookbehinds

Negative lookbehinds confirm the absence of a pattern to the left of the current pattern. They're a way of telling the regex engine, "I'm interested in the candidate String, so long as it isn't preceded by such and such." You form negative lookbehinds by opening a noncapturing group with (?<!.

Negative lookbehinds aren't as intuitive as the other lookarounds, so it's worthwhile to explore how they actually work. For example, consider the following negative lookbehind:

(?<!\d\d).*

The preceding seems to request that the candidate string not be preceded by two digits in a row. However, when you actually test it against the String 42 is the answer, it matches the entire candidate. What's going on here?

The problem is that the first element in the candidate 42 is the answer is 4. So the engine asks itself if the 4 character is preceded by two digits. Because the answer is no, the entire pattern is matched into group(0). Remember, .* is a greedy qualifier, so it matches as much as possible—in this case, the entire candidate string.