6.6 Unrolling the Loop

Regardless of what native optimizations a system may support, perhaps the most important gains are to be had by understanding the basics of how the engine works, and writing expressions that help lead the engine to a match. So, now that we've reviewed the basics in excruciating detail, let's step up to the big leagues with a technique I call "unrolling the loop." It's effective for speeding up certain common expressions. Using it, for example, to transform the neverending match from near the start of this chapter (see Section 6.1.4 ) results in an expression that actually finishes a non-match in our lifetime, and as a bonus is faster with a match as well.

The "loop" in the name is the implicit loop imparted by the star in an expression that fits a ( this | that |···)* pattern. Indeed, our earlier "(\\.|[^"\\]+)*" never ending match fits this pattern. Considering that it takes approximately forever to report a non-match, it's a good example to try to speed up!

There are two competing roads one can take to arrive at this technique:

We can examine which parts of (\\.|[^"\\]+)* actually succeed during a variety of sample matches, leaving a trail of used subexpressions in its wake. We can then reconstruct an efficient expression based upon the patterns we see emerge. The (perhaps far-fetched) mental image I have is that of a big ball, representing a (···)* regex, being rolled over some text. The parts inside (···) that are actually used then stick to the text they match, leaving a trail of subexpressions behind like a dirty ball rolling across the carpet.
Another approach takes a higher-level look at the construct we want to match. We'll make an informed assumption about the likely target strings, allowing us to take advantage of what we believe will be the common situation. Using this point of view, we can construct an efficient expression.

Either way, the resulting expressions are identical. I'll begin from the "unrolling" point of view, and then converge on the same result from the higher-level view.

To keep the examples as uncluttered and as widely usable as possible, I'll use (···) for all parentheses. If (?:···) non-capturing parentheses are supported, their use imparts a further efficiency benefit. Later, we'll also look at using atomic grouping (see Section 3.4.5.3) and possessive quantifiers (see Section 3.4.5.10).

6.6.1 Method 1: Building a Regex From Past Experiences

In analyzing "(\\.|[^"\\]+)*", it's instructive to look at some matching strings to see exactly which subexpressions are used during the overall match. For example, with '"hi"', the expression effectively used is just "[^"\\]+". This illustrates that the overall match used the initial ", one application of the alternative [^"\\]+, and the closing ". With

     "he said \"hi there\" and left"

it is "[^"\\]+ \\.[^"\\]+\\.[^"\\]+". In this example, as well as in Table 6-2, I've marked the expressions to make the patterns apparent. It would be nice if we could construct a specific regex for each particular input string. That's not possible, but we can still identify common patterns to construct a more efficient, yet still general, regular expression.

Table 2. Unrolling-the-Loop Example Cases
Target String Effective Expression

"hi there"

"just one \" here"

"some \"quoted\" things"

"with \"a\" and \"b\"."

"[^"\\]+"

"[^"\\]+ \\.[^"\\]+"

"[^"\\]+ \\. [^"\\]+ \\" [^"\\]+"

"[^"\\]+ \\. [^"\\]+ \\. [^"\\]+ \\. [^"\\]+ \\. [^"\\]+"

"\"ok\"\n"

"empty \"\" quote"

" \\. [^"\\]+ \\. \\. "

"[^"\\]+ \\. \\. [^"\\]+"

For the moment, let's concentrate on the first four examples in Table 6-2. I've underlined the portions that refer to "an escaped item, followed by further normal characters." This is the key point: in each case, the expression between the quotes begins with [^"\\]+ and is followed by some number of \\.[^"\\]+ sequences. Rephrasing this as a regular expression, we get [^"\\]+( \\.[^"\\]+)*. This is a specific example of a general pattern that can be used for constructing many useful expressions.

6.6.1.1 Constructing a general "unrolling-the-loop" pattern

In matching the double-quoted string, the quote itself and the escape are "special" — the quote because it can end the string, and the escape because it means that whatever follows won't end the string. Everything else, [^"\\], is "normal." Looking at how these were combined to create [^"\\]+( \\.[^"\\]+)*, we can see that it fits the general pattern normal+( special normal +)*.

Adding in the opening and closing quote, we get "[^"\\]+( \\. [^"\\]+)*". Unfortunately, this won't match the last two examples in Table 6-2. The problem, essentially, is that our current expression's two [^"\\]+ require a normal character at the start of the string and after any special character. As the examples show, that's not always appropriate—the string might start or end with an escaped item, or there might be two escaped items in a row.

We could try changing the two pluses to stars: "[^"\\] * ( \\. [^"\\]* )*". Does this have the desired effect? More importantly, does it have any undesirable effects?

As far as desirable effects, it is easy to see that all the examples now match. In fact, even a string such as "\"\"\"" now matches. This is good. However, we can't make such a major change without being quite sure there are no undesirable effects. Could anything other than a legal double-quoted string match? Can a legal double-quoted string not match? What about efficiency?

Let's look at "[^"\\]* ( \\. [^"\\]* )*" carefully. The leading "[^"\\]* is applied only once and doesn't seem dangerous: it matches the required opening quote and any normal characters that might follow. No danger there. The subsequent (\\. [^"\\]*)* is wrapped by (···)*, so is allowed to match zero times. That means that removing it should still leave a valid expression. Doing so, we get "[^"\\]*", which is certainly fine — it represents the common situation where there are no escaped items.

On the other hand, if ( \\. [^"\\]*)* matches once, we have an effective "[^"\\]* \\.[^"\\]*". Even if the trailing [^"\\]* matches nothing (making it an effective "[^"\\]* \\. " ), there are no problems. Continuing the analysis in a similar way (if I can remember my high school algebra, it's "by induction"), we find that there are, indeed, no problems with the proposed changes.

So, that leaves us with the final expression to match a double-quoted string with escaped double quotes inside:

"[^"\\]*( \\. [^"\\]* )*"

6.6.2 The Real "Unrolling-the-Loop" Pattern

Putting it all together, then, our expression to match a double-quoted string with escaped-items is "[^"\\]*(\\.[^"\\]*)*". This matches exactly the same strings as our alternation version, and it fails on the same strings that the alternation version fails on. But, this unrolled version has the added benefit of finishing in our lifetime because it is much more efficient and avoids the neverending-match problem.

The general pattern for unrolling the loop is:

opening normal* ( special normal* ) * closing

6.6.2.1 Avoiding the neverending match

Three extremely important points prevent "[^"\\]*( \\. [^"\\]*)*" from becoming a neverending match:

6.6.2.1.1 The start of special and normal must never intersect

The special and normal subexpressions must be written such that they can never match at the same point. With our ongoing example, where normal is [^"\\] and special is \\., it's clear that they can never begin a match at the same character since the latter one requires a leading backslash, while the former one explicitly disallows a leading backslash.

On the other hand, \\. and [^"] can both match starting at '"Hello\n"', so they are inappropriate as special or normal. If there is a way they can match starting at the same location, it's not clear which should be used at such a point, and the non-determinism creates a neverending match. The 'm a k u d o n a r u d o' example illustrates this graphically (see Section 6.1.4.1). A failing match (or any kind of match attempt with POSIX NFA engines) has to test all these possibilities and permutations. That's too bad, since the whole reason to re-engineer in the first place was to avoid this.

If we ensure that special and normal can never match at the same point, special acts to checkpoint the nondeterminism that would arise when multiple applications of normal could, by different iterations of the (···)* loop, match the same text. If we ensure that special and normal can never match at the same point, there is exactly one "sequence" of specials and normals in which a particular target string matches. Testing this one sequence is much faster than testing a hundred million of them, and thus a neverending match is avoided.

6.6.2.1.2 Special must not match nothingness

The second important point is that special must always match at least one character if it matches anything at all. If it were able to match without consuming characters, adjacent normal characters would be able to be matched by different iterations of ( special normal *)*, bringing us right back to the basic (···*)* problem.

For example, choosing a special of (\\.)* violates this point. In trying to match the ill-fated "[^"\\]*( (\\.)* [^"\\]*)*" against '"Tubby' (which fails), the engine must try every permutation of how multiple [^"\\]* might match 'Tubby' before concluding that the match is a failure. Since special can match nothingness, it doesn't act as the checkpoint it purports to be.

6.6.2.1.3 Special must be atomic

Text matched by one application of special must not be able to be matched by multiple applications of special. Consider matching a string of optional Pascal {···} comments and spaces. A regex to match the comment part is \{[^}]*\}, so the whole (neverending) expression becomes (\{[^}]*\}| • +)*. With this regex, you might consider special and normal to be:

special normal
•+ \{[^}]*\}

Plugging this into the normal*( special normal*)* pattern we've developed, we get: (\{[^}]*\})*( •+(\{[^}]*\})* )*. Now, let's look at a string:

{comment}••• {another}••

A sequence of multiple spaces could be matched by a single •+, by many •+ (each matching a single space), or by various combinations of •+ matching differing numbers of spaces. This is directly analogous to our 'm a k u d o n a r u d o' problem.

The root of the problem is that special is able to match a smaller amount of text within a larger amount that it could also match, and is able to do so multiple times thanks to (···)*. The nondeterminism opens up the "many ways to match the same text" can of worms.

If there is an overall match, it is likely that only the all-at-once •+ will happen just once, but if no match is possible (such as might happen if this is used as a subexpr ession of a larger regex that could possibly fail), the engine must work through each permutation of the effective (•+)* to each series of multiple spaces. That takes time, but without any hope for a match. Since special is supposed to act as the checkpoint, there is nothing to check its nondeterminism in this situation.

The solution is to ensure that special can match only a fixed number of spaces. Since it must match at least one, but could match more, we simply choose • and let multiple applications of special match multiple spaces via the enclosing (···)*.

This example is useful for discussion, but in real-life use, it's probably more efficient to swap the normal and special expressions to come up with

•*( \{[^}]*\} •* )*

because I would suspect that a Pascal program has more spaces than comments, and it's more efficient to have normal be the most common case.

6.6.2.2 General things to look out for

Once you internalize these rules (which might take several readings and some practical experience), you can generalize them into guidelines to help identify regular expressions susceptible to a neverending match. Having multiple levels of quantifiers, such as (···*)*, is an important warning sign, but many such expressions are perfectly valid. Examples include:

(Re: •*)*, to match any number of 'Re:' sequences (such as might be used to clean up a 'Subject: •Re: •Re: •Re: •hey' subject line).
(•*\$[0-9]+)*, to match dollar amounts, possibly space-separated.
(.*\n)+, to match one or more lines. (Actually, if dot can match a newline, and if there is anything following this subexpression that could cause it to fail, this would become a quintessential neverending match.)

These are okay because each has something to checkpoint the match, keeping a lid on the "many ways to match the same text" can of worms. In the first, it's Re:, in the second it's \$, and in the third (when dot doesn't match newline), it's \n.

6.6.3 Method 2: A Top-Down View

Recall that I said that there were two paths to the same "unrolling the loop" expression. In this second path, we start by matching only what's most common in the target, then adding what's needed to handle the rare cases. Let's consider what the neverending ( \\. |[^"\\]+)* attempts to accomplish and where it will likely be used. Normally, I would think, a quoted string would have more regular characters than escaped items, so [^"\\]+ does the bulk of the work. The \\. is needed only to take care of the occasional escaped item. Using alternation to allow either makes a useful regex, but it's too bad that we need to compromise the efficiency of the whole match for the sake of a few (or more commonly, no) escaped characters.

If we think that [^"\\]+ will normally match most of the body of the string, we know that once it finishes we can expect either the closing quote or an escaped item. If we have an escape, we want to allow one more character (whatever it might be), and then match more of the bulk with another [^"\\]+. Every time [^"\\]+ ends, we are in the same position we were before: expecting either the closing quote or another escape.

Expressing this naturally as a single expression, we arrive at the same expression we had early in Method 1: "[^"\\]+( \\. [^"\\]+)*". Each time the matching reaches the point marked by , we know that we're expecting either a backslash or a closing quote. If the backslash can match, we take it, the character that follows, and more text until the next "expecting a quote or backslash" point.

As in the previous method, we need to allow for when the initial non-quote segment, or inter-quote segments, are empty. We can do this by changing the two pluses to stars, which results in the same expression as we ended up with in Section 6.6.2.

6.6.4 Method 3: An Internet Hostname

I promised two methods to arrive at the unrolling-the-loop technique, but I'd like to present something that can be considered a third. It struck me while working with a regex to match a hostname such as www.yahoo.com. A hostname is essentially dot-separated lists of subdomain names, and exactly what's allowed for one subdomain name is fairly complex to match (see Section 5.3.3), so to keep this example less cluttered, we'll just use [a-z]+ to match a subdomain.

If a subdomain is [a-z]+ and we want a dot-separated list of them, we need to match one subdomain first. After that, further subdomains require a leading period. Expressing this literally, we get: [a-z]+( \.[a-z]+)*. Now, if I add an underline and some gray, [a-z]+( \. [a-z]+)*, it sure looks like it almost fits a very familiar pattern, doesn't it!

To illustrate the similarity, let's try to map this to our double-quoted string example. If we consider a string to be sequences of our normal [^\\"], separated by special \\., all within '"···"', we can plug them into our unrolling-the-loop pattern to form "[^\\"]+( \\. [^\\"]+)*", which is exactly what we had at one point while discussing Method 1. This means that conceptually, we can take the view we used with a hostname—stuff separated by separators—and apply it to doublequoted strings, to give us "sequences of non-escaped stuff separated by escaped items." This might not seem intuitive, but it yields an interesting path to what we've already seen.

The similarity is interesting, but so are the differences. With Method 1, we went on to change the regex to allow empty spans of normal before and after each special, but we don't want to do that here because a subdomain part cannot be empty. So, even though this example isn't exactly the same as the previous ones, it's in the same class, showing that the unrolling technique is powerful and flexible.

There are two differences between this and the subdomain example:

Domain names don't have delimiters at their start and end.
The normal part of a subdomain can never be empty (meaning two periods are not allowed in a row, and can neither start nor end the match). With a double-quoted string, there is no requirement that there be any normal parts at all, even though they are likely, given our assumptions about the data. That's why we were able to change the [^\\"]+ to [^\\"]*. We can't do that with the subdomain example because special represents a separator, which is required.

6.6.5 Observations

Recapping the double-quoted string example, I see many benefits to our expression, "[^"\\]*(\\.[^"\\]*)*", and few pitfalls.

Pitfalls:

Readability The biggest pitfall is that the original "([^"\\]|\\.)*" is probably easier to understand at first glance. We've traded a bit of readability for efficiency.
Maintainability Maintaining "[^"\\]*(\\. [^"\\]*)*" might be more difficult, since the two copies of [^"\\] must be kept identical across any changes. We've traded a bit of maintainability for efficiency.

Benefits:

Speed The new regex doesn't buckle under when no match is possible, or when used with a POSIX NFA. By carefully crafting the expression to allow only one way for any particular span of text to be matched, the engine quickly comes to the conclusion that non-matching text indeed does not match.
More speed The regex "flows" well, a subject taken up in "The Freeflowing Regex" (see Section 6.7). In my benchmarks with a Traditional NFA, the unrolled version is consistently faster than the old alternation version. This is true even for successful matches, where the old version did not suffer the lockup problem.

6.6.6 Using Atomic Grouping and Possessive Quantifiers

The problem with our original neverending match regex, "(\\.|[^"\\]+)*", is that it bogs down when there is no match. When there is a match, though, it's quite fast. It's quick to find the match because the [^"\\]+ component is what matches most of the target string (the normal in the previous discussion). Because [···]+ is usually optimized for speed (see Section 6.4.6.2), and because this one component handles most of the characters, the overhead of the alternation and the outer (···)* quantifier is greatly reduced.

So, the problem with "(\\.|[^"\\]+)*", is that it bogs down on a non-match, backtracking over and over to what we know will always be unfruitful states. We know they're unfruitful because they're just testing different permutations of the same thing. (If abc doesn't match 'foo', neither will abc or abc (or abc, abc, or abc, for that matter). So, if we could throw those states away, this regex would report the non-match quickly.

There are two ways to actually throw away (or otherwise ignore) states: atomic grouping (see Section 3.4.5.3) and possessive quantifiers (see Section 3.4.5.10). At the time of this writing, only Sun's regex package for Java supports possessive quantifiers, but I believe they'll gain popularity soon, so I'll cover them here.

Before I get into the elimination of the backtracking, I'd like to swap the order of the alternatives from "(\\.|[^"\\]+)*" to "([^"\\]+|\\.)*", as this places the component matching "normal" text first. As has been noted a few times in the last several chapters, when two or more alternatives can potentially match at the same location, care must be taken when selecting their order, as that order can influence what exactly is matched. But if, as in this case, all alternatives are mutually exclusive (none can match at a point where another can match), the order doesn't matter from a correctness point of view, so the order can be chosen for clarity or efficiency.

6.6.6.1 Making a neverending match safe with possessive quantifiers

Our neverending regex "([^"\\]+|\\.)*" has two quantifiers. We can make one possessive, the other possessive, or both possessive. Does it matter? Well, most of the backtracking troubles were due to the states left by the [···]+, so making that possessive is my first thought. Doing so yields a regex that's pretty fast, even when there's no match. However, making the outer (···)* possessive throws away all the states from inside the parentheses, which includes both those of [···]+ and of the alternation, so if I had to pick one, I'd pick that one.

But I don't have to pick one because I can make both possessive. Which of the three situations is fastest probably depends a lot on how optimized possessive quantifiers are. Currently, they are supported only by Sun's Java regex package, so my testing has been limited, but I've run all three combinations through tests with it, and found examples where one combination or the other is faster. I would expect the situation where both are possessive could be the fastest, so these results tend to make me believe that Sun hasn't yet optimized them to their fullest.

6.6.6.2 Making a neverending match safe with atomic grouping

Looking to add atomic grouping to "([^"\\]+|\\.)*", it's tempting to replace the normal parentheses with atomic ones: "(?>[^"\\]+|\\.)*". It's important to realize that (?>···|···)* is very different from the possessive (···|···)*+ in the previous section when it comes to the states that are thrown away.

The possessive (···|···)*+ leaves no states when it's done. On the other hand, the atomic grouping in (?>···|···)* merely eliminates any states left by each alternative, and by the alternation itself. The star is outside the atomic grouping, so is unaffected by it and still leaves all its "can try skipping this match" states. That means that the individual matches can still be undone via backtracking. We want to eliminate the outer quantifier's states as well, so we need an outer set of atomic grouping. That's why (?>(···|···)*) is needed to mimic the possessive (···|···)*+. (···|···)*+ and (?>···|···)* are both certainly helpful in solving the neverending match, but which states are thrown away, and when, are different. (For more on the difference between the two, see Section 4.5.8.)

6.6.7 Short Unrolling Examples

Now that we've got the basic idea of unrolling under our belt, let's look at some examples from earlier in the book, and see how unrolling applies to them.

6.6.7.1 Unrolling "multi-character" quotes

In Chapter 4 in Section 4.5.4, we saw this example:


     <B>                    # Match the opening <B>

     (                      # Now, only as many of the following as needed . . .

         (?! </? B> )       #    If not <B>, and not </B> . . .

         .                  #       . . . any character is okay

     )*                     #

     </B>                   # . . . until the closing delimiter can match.

With a normal of [^<] and a special of (?!</?B>)<, here's the unrolled version:


     <B>                   # Match the opening <B>

         (?> [^<]*)        # Now match any "normal" . . .

         (?>               # Any amount of . . .

             (?! < /? B> ) #     if not at <B> or </B>,

             <             #     match one "special"

             [^<]*         #     and then any amount of "normal"

         )*                #

     </B>                  # And finally the closing </B>

The use of atomic grouping is not required, but does make the expression faster when there's only a partial match.

6.6.7.2 Unrolling the continuation-line example

The continuation-line example from the start of the previous chapter (see Section 5.2.1) left off with ^\w+=([^\n\\]|\\.)*. Well, that certainly looks ripe for unrolling:


      ^ \w+ =                  # leading field name and '='

      # Now read (and capture) the value . . .

      (

          (?> [^\n\\]* )       # "normal"*

          (?> \\. [^\n\\]* )*  # ( "special" "normal"* )*

      )

As with earlier examples of unrolling, the atomic grouping is not required for this to work, but helps to allow the engine to announce a failure more quickly.

6.6.7.3 Unrolling the CSV regex

Chapter 5 has a long discussion of CSV processing, which finally worked its way to this snippet, from Section 5.4.2.1:

(?:^|,)

(?:  # Now, match either a double-quoted field (inside, paired double quotes 

       are allowed) . . .

          "  # (double-quoted field's opening quote)

          (  (?:  [^"] | "" )*  )

          "  # (double-quoted field's closing quote)

  |

    # . . . or, some non-quote/non-comma text . . .

         (  [^ ", ]*  )

)

The text then went on to suggest adding \G to the front, just to be sure that the bump-along didn't get us in trouble as it had throughout the example, and some other efficiency suggestions. Now that we know about unrolling, let's see where in this example we can apply it.

Well, the part to match a Microsoft CSV string, (?:[^ "]|"")*, certainly looks inviting. In fact, the way it's presented already has our normal and special picked out for us: [^"] and "". Here's how it looks with that part unrolled, plugged back into the original Perl snippet to process each field:


      while ($line =~ m{

                  \G(?:^|,) 

                  (?:

                      # Either a double-quoted field (with "" for each ")···

                      " # field's opening quote

                       ( (?> [^"]* ) (?> "" [^"]* )* )

                      " # field's closing quote

                  # ..or···

                  |

                     # ··· some non-quote/non-comma text....

                     ( [^",]* )

                  )

            }gx)

      { 

          if (defined $2) {

             $field = $2;

          } else {

             $field = $1;

             $field =~ s/""/"/g;

          }

          print "[$field]"; # print the field, for debugging

          Can work with $field now . . .

      }

As with the other examples, the atomic grouping is not required, but may help with efficiency.

6.6.8 Unrolling C Comments

I'd like to give an example of unrolling the loop with a somewhat more complex target. In the C language, comments begin with /*, end with */, and can span across lines, but can't be nested. (C++, Java, and C# also allow this type of comment.) An expression to match such a comment might be useful in a variety of situations, such as in constructing a filter to remove them. It was when working on this problem that I first came up with my unrolling technique, and the technique has since become a staple in my regex arsenal.

6.6.8.1 To unroll or to not unroll . . .

I originally developed the regex that is the subject of this section back in the early 1990s. Prior to that, matching C comments with a regular expression was consider ed difficult at best, if not impossible, so when I developed something that worked, it became the standard way to match C comments. But, when Perl introduced lazy quantifiers, a much simpler approach became evident: a dot-matchesall application of /\*.*?\*/.

Had lazy quantifiers been around when I first developed the unrolling technique, I might not have bothered to do so, for the need wouldn't have been so apparent. Yet, such a solution was still valuable because with that first version of Perl supporting lazy quantifiers, the unrolled version is faster than the lazy-quantifier version by a significant amount (in the variety of tests I've done, anywhere from about 50% faster, to 3.6x faster).

Yet, with today's Perl and its different mix of optimizations, those numbers go the other way, with the lazy-quantifier version running anywhere from about 50% faster to 5.5x faster. So, with modern versions of Perl, I'd just use /\*.*?\*/ to match C comments and be done with it.

Does this mean that the unrolling-the-loop technique is no longer useful for matching C comments? Well, if an engine doesn't support lazy quantifiers, the ability to use the unrolling technique certainly becomes appealing. And not all regex engines have the same mix of optimizations: the unrolling technique is faster with every other language I've tested—in my tests, up to 60 times faster! The unrolling technique is definitely useful, so the remainder of this example explores how to apply it to matching C comments.

Since there are no escapes to be recognized within a C comment the way \" must be recognized within a double-quoted string, one might think that this should make things simpler, but actually, it's much more complex. This is because */, the "ending quote," is more than one character long. The simple /\*[^*]*\*/ might look good, but that doesn't match /**some comment here**/ because it has a '*' within. It should be matched, so we need a new approach.

6.6.8.1.1 Avoiding regex headaches

You might find that /\*[^*]*\*/ is a bit difficult to read, even with the subtle easy-on-the-eyes spacing I've used in typesetting this book. It is unfortunate for our eyes that one of the comment's delimiting characters, '*', is also a regex metacharacter. The resulting backslashes are enough to give me a headache. To make things more readable during this example, we'll consider /x ··· x/, rather than /* ··· */, to be our target comment. This superficial cosmetic change allows /\*[^*]*\*/ to be written as the more readable /x [^x]* x/. As we work through the example and the expression becomes more complex, our eyes will thank us for the reprieve.

6.6.8.2 A direct approach

In Chapter 5 (see Section 5.2.6), I gave a standard formula for matching delimited text:

Match the opening delimiter
Match the main text: really "match anything that is not the ending delimiter"
Match the ending delimiter

Our pseudo comments, with /x and x/ as our opening and closing delimiters, appear to fit into this pattern. Our difficulties begin when we try to match "anything that is not the ending delimiter." When the ending delimiter is a single character, we can use a negated character class to match all characters except that delimiter. A character class can't be used for multi-character subexpressions, but if you have negative lookahead, you can use something like (?:(?!x/).)*. This is essentially ( anything not x/)*.

Using that, we get /x(?:(?!x/).)*x/ to match comments. It works perfectly well, but it can be quite slow (in some of my tests, hundreds of times slower than what we'll develop later in this section). This approach can be useful, but it's of little use in this particular case because any flavor that supports lookahead almost certainly supports lazy quantifiers, so if efficiency is not an issue, you can just use /x.*?x/ and be done with it.

So, continuing with the direct, three-step approach, is there another way to match until the first x/? Two ideas might come to mind. One method is to consider x to be the start of the ending delimiter. That means we'd match anything not x, and allow an x if it is followed by something other than a slash. This makes the "anything that is not the ending delimiter" one of:

Anything that is not x: [^x]
x, so long as not followed by a slash: x[^/]

This yields ([^x]|x[^/])* to match the main text, and /x([^x]|x[^/])*x/ to match the entire pseudo comment. As we'll see, this doesn't work.

Another approach is to consider a slash as the ending delimiter, but only if preceded by x. This makes the "anything not the ending delimiter" one of:

Anything that is not a slash: [^/]
A slash, so long as not preceded by x: [^x]/

This yields ([^/]|[^x]/)* to match the main text, and /x([^/]|[^x]/)*x/ to match the whole comment.

Unfortunately, it also doesn't work.

For /x([^x]|x[^/])*x/, consider '/xx•foo•xx/' — after matching 'foo• ', the first closing x is matched by x[^/], which is fine. But then, x[^/] matches x x /, which is the x that should be ending the comment. This opens the door for the next iteration's [^x] to match the slash, thereby errantly matching past the closing x/.

As for /x([^/]|[^x]/)*x/, it can't match '/x/•foo•/x/' (the whole of which is a comment and should be matched). In other cases, it can march past the end of a comment that has a slash immediately after its end (in a way similar to the other method). In such a case, the backtracking involved is perhaps a bit confusing, so it should be instructive to understand why /x([^/]|[^x]/)*x/ matches

     years = days /x divide x//365; /x assume non-leap year x/

as it does (an investigation I'll leave for your free time).

6.6.8.3 Making it work

Let's try to fix these regexes. With the first one, where x[^/] inadvertently matches the comment-ending ···xx/, consider /x([^x]|x+[^/])*x/. The added plus will, we think, have x+[^/] match a row of x's ending with something other than a slash. Indeed it will, but due to backtracking, that "something other than a slash" can still be x. At first, the greedy x+ matches that extra x as we want, but backtracking will reclaim an x if needed to secure an overall match. So, it still matches too much of:



/xx A xx/ foo() /xx B xx/

The solution comes back to something I've said before: say what you mean. If we want "some x, if not followed by a slash" to imply that the non-slash also doesn't include an x, we should write exactly that: x+[^/x]. As we want, this stops it from eating '···xxx/', the final x of a row of x that ends the comment. In fact, it has the added effect of not matching any comment-ending x, so it leaves us at '···xxx/' to match the ending delimiter. Since the ending delimiter part had been expecting just the one x, it won't match until we insert x+/ to allow this final case.

Translating Between English and Regex

In Section 6.6.8.1.1, when discussing two ways one might consider the C comment "anything that is not the ending delimiter," I presented one idea as

" x, so long as not followed by a slash: x[^/] "

and another as:

" a slash, so long as not preceded by x: [^x]/ "

In doing so, I was being informal—the English descriptions are actually quite different from the regexes. Do you see how?

To see the difference, consider the first case with the string 'regex'—it certainly has an x not followed by a slash, but it would not be matched by match x[^/]. The character class requires a character to match, and although that character can't be a slash, it still must be something, and there's nothing after the x in 'regex'. The second situation is analogous. As it turns out, what I need at that point in the discussion are those specific expressions, so it's the English that is in error.

If you have lookahead, "x, so long as not followed by a slash" is simply x(?!/). If you don't, you might try to get by with x([^/]|$). It still matches a character after the x, but can also match at the end of the line. If you have lookbehind, "slash, so long as not preceded by x" becomes (?<!x)/. If you don't have it, you have to make due with (^|[^x])/.

We won't use any of these while working with C comments, but it's good to understand the issue.

This leaves us with: /x([^x]|x+[^/x])*x+/ to match our pseudo comments.

Phew! Somewhat confusing, isn't it? Real comments (with * instead of x) require /\*([^*]|\*+[^/*])*\*+/ which is even more confusing. It's not easy to read; just remember to keep your wits about you as you carefully parse complex expressions in your mind.

6.6.8.4 Unrolling the C loop

For efficiency's sake, let's look at unrolling this regex. Table 6-3 in the next section shows the expressions we can plug in to our unrolling-the-loop pattern.

Like the subdomain example, the normal* is not actually free to match nothingness. With subdomains, it was because the normal part was not allowed to be empty. In this case, it's due to how we handle the two-character ending delimiter. We ensure that any normal sequence ends with the first character of the ending delimiter, allowing special to pick up the ball only if the following character does not complete the ending.

Table 3. Unrolling-the-Loop Components for C Comments
opening normal* ( special normal* )* closing
Item What We Want Regex
opening start of comment
/x

normal* comment text up to, and including, one or more 'x' [^x]*x+
special something other than the ending slash (and also not 'x') [^/x]
closing trailing slash /

So, plugging these in to the general unrolling pattern, we get:

/x[^x]*x+( [^/x][^x]*x+ )*/.

Notice the spot marked with ? The regex engine might work to that spot in two ways (just like the expression in Section 6.6.3). The first is by progressing to it after the leading /x[^x]*x+. The second is by looping due to the (···)*. Via either path, once we're at that spot we know we've matched x and are at a pivotal point, possibly on the brink of the comment's end. If the next character is a slash, we're done. If it's anything else (but an x, of course), we know the x was a false alarm and we're back to matching normal stuff, again waiting for the next x. Once we find it, we're right back on the brink of excitement at the marked spot.

6.6.8.4.1 Return to reality

/x[^x]*x+([^/x][^x]*x+)*/ is not quite ready to be used. First, of course, comments are /* ··· */ and not /x ··· x/. This is easily fixed by substituting each x with \* (or, within character classes, each x with *) :

/\*[^*]*\*+([^/*][^*]*\*+)*/

A use-related issue is that comments often span across lines. If the text being matched contains the entire multiline comment, this expression should work. With a strictly line-oriented tool such as egrep, though, there is no way to apply a regex to the full comment. With most utilities mentioned in this book, you can, and this expression might be useful for, say, removing comments.

In practical use, a larger problem arises. This regex understands C comments, but does not understand other important aspects of C syntax. For example, it can falsely match where there is no comment:


    const char *cstart = "/*", *cend = "*/";

We'll develop this example further, right in the next section.

< Free Open Study >