4.5 More About Greediness and Backtracking

Many concerns (and benefits) of greediness are shared by both an NFA and a DFA. (A DFA doesn't support laziness, which is why we've concentrated on greediness up to this point.) I'd like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there's not much more to say after that. It's very easy to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed nature provides. Besides lazy quantifiers, there are a variety of extra features an NFA can support, including lookaround, conditionals, backreferences, and atomic grouping. And on top of these, an NFA affords the regex author direct control over how a match is carried out, which can be a benefit when used properly, but it does create some efficiency-related pitfalls (discussed in Chapter 6.)

Despite these differences, the match results are often similar. For the next few sections, I'll talk of both engine types, but describe effects in terms of the regexdir ected NFA. By the end of this chapter, you'll have a firm grasp of just when the results might differ, as well as exactly why.

4.5.1 Problems of Greediness

As we saw with the last example, .* always marches to the end of the line.^[7] This is because .*just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.

^[7] With a tool or mode where a dot can match a newline, .* applied to strings that contain multiline data matches through all the logical lines to the end of the whole string.

Sometimes this can be a real pain. Consider a regex to match text wrapped in double quotes. At first, you might want to write ".*", but knowing what we know about .*, guess where it matches in:

     The name "McDonald's" is said "makudonarudo" in Japanese

Actually, since we understand the mechanics of matching, we don't need to guess, because we know. Once the initial quote matches, .*is free to match, and immediately does so all the way to the end of the string. It backs off (or, perhaps more appropriately, is backed off by the regex engine) only as much as is needed until the final quote can match. In the end, it matches

     The name "McDonald's" is said "makudonarudo" in Japanese

which is obviously not the double-quoted string that was intended. This is one reason why I caution against the overuse of .*, as it can often lead to surprising results if you don't pay careful attention to greediness.

So, how can we have it match "McDonald's" only? The key is to realize that we don't want "anything" between the quotes, but rather "anything except a quote." If we use [^"]* rather than .*, it won't overshoot the closing quote.

The regex engine's basic approach with "[^"]*" is exactly the same as before. Once the initial double quote matches, [^"]* gets a shot at matching as much as it can. In this case, that's up to the double quote after McDonald's, at which point it finally stops because [^"] can't match the quote. At that point, control moves to the closing ". It happily matches, resulting in overall success:

     The name "McDonald's" is said "makudonarudo" in Japanese

Actually, there's could be one unexpected change, and that's because in most flavors, [^"] can match a newline, while dot doesn't. If you want to keep the regex from crossing lines, use [^"\n].

4.5.2 Multi-Character "Quotes"

In the first chapter, I talked a bit about matching HTML tags, such as the sequence very that renders the "very" in bold if the browser can do so. Attempting to match a ··· sequence seems similar to matching a quoted string, except the "quotes" in this case are the multi-character sequences  and . Like the quoted string example, multiple sets of "quotes" cause problems if we use .*:

     ···<B>Billions</B> and <B>Zillions</B> of suns···

With .*, the greedy .* causes the match in progress to zip to the end of the line, backtracking only far enough to allow the  to match, matching the last  on the line instead of the one corresponding to the opening  at the start of the match.

Unfortunately, since the closing delimiter is more than one character, we can't solve the problem with a negated class as we did with double-quoted strings. We can't expect something like [^]* to work. A character class represents only one character and not the full  sequence that we want. Don't let the apparent structure of [^] fool you. It is just a class to match one character —any one except <, >, /, and B. It is the same as, say [^/<>B], and certainly doesn't work as an "anything not " construct. (With lookahead, you can insist that  not match at a particular point; we'll see this in action in the next section.)

4.5.3 Using Lazy Quantifiers

These problems arise because the standard quantifiers are greedy. Some NFAs support lazy quantifiers (see Section 3.4.5.8), with *? being the lazy version of *. With that in mind, let's apply .*? to:

     ···<B>Billions</B> and <B>Zillions</B> of suns···

After the initial  has matched, .*? immediately decides that since it doesn't require any matches, it lazily doesn't bother trying to perform any. So, it immediately passes control to the following <:

at '···Billions···' matching .*?

The < doesn't match at that point, so control returns back to .*? where it still has its untried option to attempt a match (to attempt multiple matches, actually). It begrudgingly does so, with the dot matching the underlined B in ···Billions···.Again, the *? has the option to match more, or to stop. It's lazy, so it first tries stopping. The subsequent < still fails, so .*? has to again exercise its untried match option. After eight cycles,.*? eventually matches Billions, at which point the subsequent < (and the whole  subexpression) is finally able to match:

     ···<B>Billions</B> and <B>Zillions</B> of suns···

So, as we've seen, the greediness of star and friends can be a real boon at times, while troublesome at others. Having non-greedy, lazy versions is wonderful, as they allow you to do things that are otherwise very difficult (or even impossible). Still, I've often seen inexperienced programmers use lazy quantifiers in inappropriate situations. In fact, what we've just done may not be appropriate. Consider applying .*? to:

     ···<B>Billions</B> and <B>Zillions</B> of suns···

It matches as shown, and while I suppose it depends on the exact needs of the situation, I would think that in this case that match is not desired. However, there's nothing about .*? to stop it from marching right past the Zillion's  to its .

This is an excellent example of why a lazy quantifier is often not a good replacement for a negated class. In the ".*" example, using [^"] as a replacement for the dot specifically disallows it from marching past a delimiter—a quality we wish our current regex had.

However, if negative lookahead (see Section 3.4.3.6) is supported, you can use it to create something comparable to a negated class. Alone, (?!) is a test that is successful if  is not at the current location in the string. Those are the locations that we want the dot of .*? to match, so changing that dot to ((?!).)creates a regex that matches where we want it, but doesn't match where we don't. Assembled all together, the whole thing can become quite confusing, so I'll show it here in a free-spacing mode (see Section 3.3.3.2) with comments:


      <B>                      #  Match the opening <B>

           (                   #  Now, only as many of the following as needed . . .

            (?! <B> )          #          If not <B> . . .

            .                  #  . . . any character is okay

          )*?                  #

      </B>                     #    . . . until the closing delimiter can match

With one adjustment to the lookahead, we can put the quantifier back to a normal greedy one, which may be less confusing to some:


       <B>                     #    Match the opening <B>

       (                       #    Now, only as many of the following as needed . . .

           (?! < /? B> )       #            If not <B>, and not </B> . . .

           .                   #             . . . any character is okay 

         )*                    #

       </B>                    #    . . . until the closing delimiter can match.

Now, the lookahead prohibits the main body to match beyond  as well as , which eliminates the problem we tried to solve with laziness, so the laziness can be removed. This expression can still be improved; we'll see it again during the discussion on efficiency in Chapter 6 (see Section 6.6.7).

4.5.4 Greediness and Laziness Always Favor a Match

Recall the price display example from Chapter 2 (see Section 2.3.2). We'll examine this example in detail at a number of points during this chapter, so I'll recap the basic issue: due to floating-point representation problems, values that should have been "1.625" or "3.00" were sometimes coming out like "1.62500000002828" and "3.00000000028822". To fix this, I used

     $price =~ s/(\.\d\d[1-9]?)\d*/$1/;

to lop off all but the first two or three decimal digits from the value stored in the variable $price. The \.\d\d matches the first two decimal digits regardless, while the [1-9]? matches the third digit only if it is non-zero.

I then noted:

Anything matched so far is what we want to keep, so we wrap it in parentheses to capture to $1. We can then use $1 in the replacement string. If this is the only thing that matches, we replace exactly what was matched with itself—not very useful. However, we go on to match other items outside the $1 parentheses. They don't find their way to the replacement string, so the effect is that they're removed. In this case, the "to be removed" text is any extra digits, the \d* at the end of the regex.

So far so good, but let's consider what happens when the contents of the variable $price is already well formed. When it is 27.625, the (\.\d\d[1-9]?) part matches the entire decimal part. Since the trailing \d* doesn't match anything, the substitution replaces the '.625' with '.625' — an effective no-op.

This is the desired result, but wouldn't it be just a bit more efficient to do the replacement only when it would have some real effect (that is, do the replacement only when \d* actually matches something) ? Well, we know how to write "at least one digit"! Simply replace \d*," with \d+:

     $price =~ s/(\.\d\d[1-9]?)\d+/$1/

With crazy numbers like "1.62500000002828", it still works as before, but with something such as "9.43", the trailing \d+ isn't able to match, so rightly, no substitution occurs. So, this is a great modification, yes? No! What happens with a threedigit decimal value like 27.625? We want this value to be left alone, but that's not what happens. Stop for a moment to work through the match of 27.625 yourself, with particular attention to how the '5' interacts with the regex.

In hindsight, the problem is really fairly simple. Picking up in the action once (\.\d\d[1-9]?)\d+ has matched 27.625, we find that \d+ can't match. That's no problem for the overall match, though, since as far as the regex is concerned, the match of '5' by [1-9] was optional and there is still a saved state to try. This state allows [1-9]? to match nothing, leaving the 5 to fulfill the must-match-one requirement of \d+. Thus, we get the match, but not the right match: .625 is replaced by .62, and the value becomes incorrect.

What if [1-9]? were lazy instead? We'd get the same match, but without the intervening "match the 5 but then give it back" steps, since the lazy [1-9]?? first skips the match attempt. So, laziness is not a solution to this problem.

4.5.5 The Essence of Greediness, Laziness, and Backtracking

The lesson of the preceding section is that it makes no difference whether there are greedy or lazy components to a regex; an overall match takes precedence over an overall non-match. This includes taking from what had been greedy (or giving to what had been lazy) if that's what is required to achieve a match, because when a "local failure" is hit, the engine keeps going back to the saved states (retracing steps to the piles of bread crumbs), trying the untested paths. Whether greedily or lazily, every possible path is tested before the engine admits failure.

The order that the paths are tested is different between greedy and lazy quantifiers (after all, that's the whole point of having the two!), but in the end, if no match is to be found, it's known only after testing every possible path.

If, on the other hand, there exists just one plausible match, both a regex with a greedy quantifier and one with a lazy quantifier find that match, although the series of paths they take to get there may be wildly different. In these cases, selecting greedy or lazy doesn't influence what is matched, but merely how long or short a path the engine takes to get there (which is an efficiency issue, the subject of Chapter 6).

Finally, if there is more than one plausible match, understanding greediness, laziness, and backtracking allows you to know which is selected. The ".*"? example has three plausible matches:

We know that .*, with the greedy star, selects the longest one, and that ".*?", with the lazy star, selects the shortest.

4.5.6 Possessive Quantifiers and Atomic Grouping

The '.625' example below shows important insights about NFA matching as we know it, and how with that particular example our naïve intents were thwarted. Some flavors do provide tools to help us here, but before looking at them, it's absolutely essential to fully understand the preceding section, "The Essence of Greediness, Laziness, and Backtracking." Be sure to review it if you have any doubts.

So, continuing with the '.625' example and recalling what we really want to happen, we know that if the matching can successfully get to the marked position in (\.\d\d[1-9]?)\d+, we never want it to go back. That is, we want [1-9] to match if possible, but if it does, we don't want that match to be given up. Saying it more forcefully, we would rather have the entire match attempt fail, if need be, before giving up something matched by the [1-9]. (As you'll recall, the problem before when this regex was applied to '.625' was that it indeed didn't fail, but instead went back to try the remaining skip-me alternative.)

Well, what if we could somehow eliminate that skip-me alternative (eliminate the state that ? saves before it makes the attempt to match [1-9])? If there was no state to go back to, a match of [1-9] wouldn't be given up. That's what we want! Ah, but if there was no skip-me state to go back to, what would happen if we applied the regex to '.5000'? The [1-9] couldn't match, and in this case, we do want it to go back and skip the [1-9] so that the subsequent \d+ can match digits to be removed.

It sounds like we have two conflicting desires, but thinking about it, what we really want is to eliminate the skip-me alternative only if the match-me alternative succeeds. That is, if [1-9] is indeed able to match, we'd like to get rid of the skipme saved state so that it is never given up. This is possible, with regex flavors that support (?>···) atomic grouping (see Section 3.4.5.3), or possessive quantifiers like [1-9]?+ (see Section 3.4.5.8). We'll look at atomic grouping first.

4.5.6.1 Atomic grouping with `(?>···)`

In essence, matching within (?>···) carries on normally, but if and when matching is able to exit the construct (that is, get past its closing parenthesis), all states that had been saved while within it are thrown away. In practice, this means that once the atomic grouping has been exited, whatever text was matched within it is now one unchangeable unit, to be kept or given back only as a whole. All saved states representing untried options within the parentheses are eliminated, so backtracking can never undo any of the decisions made within (at least not once they're "locked in" when the construct is exited).

So, let's consider (\.\d\d(?>[1-9]?))\d+. Quantifiers work normally within atomic grouping, so if [1-9]is not able to match, the regex returns to the skip-me saved state the ? had left. That allows matching to leave the atomic grouping and continue on to the \d+. In this case, there are no saved states to flush when control leaves the atomic grouping (that is, there are no saved states remaining that had been created within it).

However, when [1-9] is able to match, matching can exit the atomic grouping, but this time, the skip-me state is still there. Since it had been created within the atomic grouping we're now exiting, it is thrown away. This would happen when matching against both '.625', and, say, '.625000'. In the latter case, having eliminated the state turns out not to matter, since the \d+ has the '.625000' to match, after which that regex is done. With '.625' alone, the inability of \d+to match has the regex engine wanting to backtrack, but it can't since that skip-me alternative was thrown away. The lack of any state to backtrack to results in the overall match attempt failing, and '.625' is left undisturbed as we wish.

4.5.6.1.1 The essence of atomic grouping

The section "The Essence of Greediness, Laziness, and Backtracking," starting in Section 4.5.5, makes the important point that neither greediness nor laziness influence which paths can be checked, but merely the order in which they are checked. If no match is found, whether by a greedy or a lazy ordering, in the end, every possible path will have been checked.

Atomic grouping, on the other hand, is fundamentally different because it actually eliminates possible paths. Eliminating states can have a number of different consequences, depending on the situation:

No Effect If a match is reached before one of the eliminated states would have been called upon, there is no effect on the match. We saw this a moment ago with the '.625000' example. A match was found before the eliminated state would have come into play.
Prohibit Match The elimination of states can mean that a match that would have otherwise been possible now becomes impossible. We saw this with the '.625' example.
Different Match In some cases, it's possible to get a different match due to the elimination of states.
Faster Failure It's possible for the elimination of states to do nothing more than allow the regex engine, when no match is to be found, report that fact more quickly. This is discussed right after the quiz.

Here's a little quiz: what does the construct (?>.*?) do? What kind of things do you expect it can match? click here to check your answer.

Some states may remain. When the engine exits atomic grouping during a match, only states that had been created while inside the atomic grouping are eliminated. States that might have been there before still remain after, so the entire text matched by the atomic subexpression may be unmatched, as a whole, if backtracking later reverts to one of those previous states.

Faster failures with atomic grouping. Consider ^\w+: applied to 'Subject'. We can see, just by looking at it, that it will fail because the text doesn't have a colon in it, but the regex engine won't reach that conclusion until it actually goes through the motions of checking.

So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states—one "skip me" state for each match of \w by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

at 'Subject' matching ^\w+:

at which point the : fails again, this time trying to match 't'. This backtrack-testfail cycle happens all the way back to the oldest state:

at 'Subject' matching ^\w+:

After the attempt from the final state fails, overall failure can finally be announced.

All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can't match after the last letter, it certainly can't match one of the letters the + is forced to give up!

So, knowing that none of the states left by \w+ once it's finished, could possibly lead to a match, we can save the regex engine the trouble of checking them: \^(?>\w+): By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won't have mattered, but if there's not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly. (An advanced implementation may be able to apply this optimization for you automatically see Section 6.4.6.7.)

As we'll see in the Chapter 6 (see Section 6.6.6), this technique shows a very valuable use of atomic grouping, and I suspect it will become the most common use as well.

4.5.7 Possessive Quantifiers, `?+, *+, ++,` and `{m,n}+`

Possessive quantifiers are much like greedy quantifiers, but they never give up a partial amount of what they've been able to match. Once a plus, for example, finishes its run, it has created quite a few saved states, as we saw with the ^\w+example. A possessive plus simply throws those states away (or, more likely, doesn't bother creating them in the first place).

As you might guess, possessive quantifiers are closely related to atomic grouping. Something possessive like \w++ appears to match in the same way as (?>\w+); one is just a notational convenience for the other.^[8] With possessive quantifiers, ^(?>\w+): can be rewritten as \w++:, and (\.\d\d(?>[1-9]?))\d+ can be rewritten as (\.\d\d[1-9]?+)\d+.

^[8] A smart implementation may be able to make the possessive version a bit more efficient than its atomic-grouping counterpart (see Section 6.4.6.7).

Be sure to understand the difference between (?>M)+ and (?>M+). The first one throws away unused states created by M, which is not very useful since M doesn't create any states. The second one throws away unused states created by M+, which certainly can be useful.

When phrased as a comparison between (?>M)+ and (?>M+), it's perhaps clear that the second one is the one comparable to M++, but when converting something more complex like (\\"|[^"])*+ from possessive quantifiers to atomic grouping, it's tempting to just add '?>' to the parentheses that are already there: (?>\\"|[^"])*+. The new expression might happen to achieve your goal, but be clear that is not comparable to the original possessive-quantifier version; it's as if changing M++ to (?>M)+. Rather, to be comparable, remove the possessive plus, and then wrap what remains in atomic grouping: (?>(\\"|[^"])*).

4.5.8 The Backtracking of Lookaround

It might not be apparent at first, but lookaround (introduced in Chapter 2 see Section 2.3.5) is closely related to atomic grouping and possessive quantifiers. There are four types of lookaround: positive and negative flavors of lookahead and lookbehind. They simply test whether their subexpression can and can't match starting at the current location (lookahead), or ending at the current location (lookbehind).

Looking a bit deeper, how does lookaround work in our NFA world of saved states and backtracking? As a subexpression within one of the lookaround constructs is being tested, it's as if it's in its own little world. It saves states as needed, and backtracks as necessary. If the entire subexpression is able to match successfully, what happens? With positive lookaround, the construct, as a whole, is considered a success, and with negative lookaround, it's considered a failure. In either case, since the only concern is whether there's a match (and we just found out that, yes, there's a match), the "little world" of the match attempt, including any saved states that might have been left over from that attempt, is thrown away.

What about when the subexpression within the lookaround can't match? Since it's being applied in its "own little world," only states created within the current lookar ound construct are available. That is, if the regex finds that it needs to backtrack further, beyond where the lookaround construct started, it's found that the current subexpression can not match. For positive lookahead, this means failure, while for negative lookahead, it means success. In either case, there are no saved states left over (had there been, the subexpression match would not have finished), so there's no "little world" left to throw away.

So, we've seen that in all cases, once the lookaround construct has finished, there are no saved states left over from its application. Any states that might have been left over, such as in the case of successful positive lookahead, are thrown away.

Well, where else have we seen states being thrown away? With atomic grouping and possessive quantifiers, of course.

4.5.8.1 Mimicking atomic grouping with positive lookahead

It's perhaps mostly academic for flavors that support atomic grouping, but can be quite useful for those that don't: if you have positive lookahead, and if it supports capturing parentheses within the lookahead (most flavors do, but Tcl's lookahead, for example, does not), you can mimic atomic grouping and possessive quanti- fiers. (?> regex ) can be mimicked with (?=( regex ))\1. For example, compare ^(?>\w+): with ^(?=(\w+))\1:.

The lookahead version has \w+ greedily match as much as it can, capturing an entire word. Because it's within lookahead, the intermediate states are thrown away when it's finished (just as if, incidentally, it had been within atomic grouping). Unlike atomic grouping, the matched word is not included as part of the match (that's the whole point of lookahead), but the word does remain captured. That's a key point because it means that when \1 is applied, it's actually being applied to the very text that filled it, and it's certain to succeed. This extra step of applying \1 is simply to move the regex past the matched word.

This technique is a bit less efficient than real atomic grouping because of the extra time required to rematch the text via \1. But, since states are thrown away, it fails more quickly than a raw \w+: when the : can't match.

4.5.9 Is Alternation Greedy?

How alternation works is an important point because it can work in fundamentally different ways with different regex engines. When alternation is reached, any number of the alternatives might be able to match at that point, but which will? Put another way, if more than one can match, which will? If it's always the one that matches the most text, one might say that alternation is greedy. If it's always the shortest amount of text, one might say it's lazy? Which (if either) is it?

Let's look at the Traditional NFA engine used in Perl, Java packages, .NET languages, and many others (see Section 4.1.3). When faced with alternation, each alternative is checked in the right-to-left order given in the expression. With the example regex of ^(Subject|Date):•, when the Subject|Date alternation is reached, the first alternative, Subject, is attempted. If it matches, the rest of the regex (the subsequent :• ) is given a chance. If it turns out that it can't match, and if other alternatives remain (in this case, Date ), the regex engine backtracks to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, all alternatives) are exhausted.

So, with that common Traditional NFA engine, what text is actually matched by tour|to|tournament when applied to the string 'three•tournaments•won' ? All the alternatives are attempted (and fail) during attempts starting at each character position until the transmission starts the attempt at 'three•tournaments•won'. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

So, we see that alternation is neither greedy nor lazy, but ordered, at least for a Traditional NFA. This is more powerful than greedy alternation because it allows more control over just how a match is attempted — it allows the regex author to express "try this, then that, and finally try that, until you get a match."

Not all flavors have ordered alternation. DFAs and POSIX NFAs do have greedy alternation, always matching with the alternative that matches the most text (tournament in this case). But, if you're using Perl, a .NET language, virtually any Java regex package, or any other system with a Traditional NFA engine (list see Section 4.1.3), your alternation is ordered.

4.5.10 Taking Advantage of Ordered Alternation

Let's revisit the (\.\d\d[1-9]?)\d* example from Section 4.5.4. If we realize that \.\d\d[1-9]?, in effect, says "allow either \.\d\d or \.\d\d[1-9] ", we can rewrite the entire expression as (\.\d\d|\.\d\d[1-9])\d*. (There is no compelling reason to make this change — it's merely a handy example.) Is this really the same as the original? If alternation is truly greedy, then it is, but the two are quite different with ordered alternation.

Let's consider it as ordered for the moment. The first alternative is selected and tested, and if it matches, control passes to the \d* that follows the alternation. If there are digits remaining, the \d* matches them, including any initial non-zero digit that was the root of the original example's problem (if you'll recall the original problem, that's a digit we want to match only within the parentheses, not by the \d* after the parentheses). Also, realize that if the first alternative can't match, the second alternative will certainly not be able to, as it begins with a copy of the entire first alternative. If the first alternative doesn't match, though, the regex engine nevertheless expends the effort for the futile attempt of the second.

Interestingly, if we swap the alternatives and use (\.\d\d[1-9]|\.\d\d)\d*, we do effectively get a replica of the original greedy (\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the trailing [1-9], the second alternative still stands a chance. It's still ordered alternation, but now we've selected the order to result in a greedy-type match.

When first distributing the [1-9]? to two alternatives, in placing the shorter one first, we fashioned a non-greedy ? of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake. In one book I've read, a*((ab)*|b*) is used as an example in explaining something about regex parentheses. It's a pointless example because the first alternative, (ab)*, can never fail, so any other alternatives are utterly meaningless. You could add


 a*((ab)*|b*|.*|partridge•in•a•pear•tree|[a-z])

and it wouldn't change the meaning a bit. The moral is that with ordered alternation, when more than one alternative can potentially match the same text, care must be taken when selecting the order of the alternatives.

4.5.10.1 Ordered alternation pitfalls

Ordered alternation can be put to your advantage by allowing you to craft just the match you want, but it can also lead to unexpected pitfalls for the unaware. Consider matching a January date of the form 'Jan 31'. We need something more sophisticated than, say, Jan•[0123][0-9], as that allows "dates" such as 'Jan•00', 'Jan•39', and disallows, 'Jan•7'.

One way to match the date part is to attack it in sections. To match from the first through the ninth, using 0?[1-9] allows a leading zero. Adding [12][0-9] allows for the tenth through the 29^th, and 3[01] rounds it out. Putting it all together, we get Jan•(0?[1-9]|[12][0-9]|3[01]).

Where do you think this matches in 'Jan 31 is Dad's birthday'? We want it to match 'Jan 31', of course, but ordered alternation actually matches only 'Jan 3'. Surprised? During the match of the first alternative, 0?[1-9], the leading 0? fails, but the alternative matches because the subsequent [1-9] has no trouble matching the 3. Since that's the end of the expression, the match is complete.

When the order of the alternatives is adjusted so that the alternative that can potentially match a shorter amount of text is placed last, the problem goes away. This works: Jan•([12][0-9]|3[01]|0?[1-9]).

Another approach is Jan•(31|[123]0|[012]?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet, a third approach is Jan•(0[1-9]|[12][0-9]?|3[01]?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I'll leave for your free time, although the sidebar below should be helpful).

A Few Ways to Slice and Dice a Date

A few approaches to the date-matching problem posed in Section 4.5.10.1. The calendar associated with each regex shows what can be matched by each alternative color-coded within the regex.
figs/mre2_side.02.jpg

< Free Open Study >