6.7 The Freeflowing Regex

We just spent some time constructing a regex to match a C comment, but left off with the problem of how to stop comment-like items within strings from being matched. Using Perl, we might mistakenly try to remove comments with:


    $prog =~ s{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}{}g; # remove C comments (and more!)

Text in the variable $prog that is matched by our regex is removed (that is, replaced by nothing). The problem with this is that there's nothing to stop a match from starting within a string, as in this C snippet:


    char *CommentStart = "/*";  /* start of comment */

    char *CommentEnd   = "*/";  /* end of comment */

Here, the underlined portions are what the regex finds, but the bold portions are what we wish to be found. When the engine is searching for a match, it tries to match the expression at each point in the target. Since it is successful only from where a comment begins (or where it looks like one begins), it doesn't match at most locations, so the transmission bump-along bumps us right into the doublequoted string, whose contents look like the start of a comment. It would be nice if we could tell the regex engine that when it hits a double-quoted string, it should zip right on past it. Well, we can.

6.7.1 A Helping Hand to Guide the Match

Consider:

$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; # regex to match a comment

$DOUBLE = qr{"(?:\\.|[^"\\])*"};               #regex to match double-quoted string

$text =~ s/$DOUBLE|$COMMENT//g;

There are two new things here. One is that this time the regex operand, $DOUBLE|$COMMENT, is made up of two variables, each of which is constructed with Perl's special qr/···/ regex-style "double-quoted string" operator. As discussed at length in Chapter 3 (see Section 3.3), one must be careful when using strings that are meant to be interpreted as regular expressions. Perl alleviates this problem by providing the qr/···/ operator, which treats its operand as a regular expression, but doesn't actually apply it. Rather, it returns a "regex object" value that can later be used to build up a larger regular expression. It's extremely convenient, as we saw briefly in Chapter 2 (see Section 2.3.6.7). Like m/···/ and s/···/···/, you can pick delimiters to suit your needs (see Section 2.3.6.4), as we've done here using braces.

The other new thing here is the matching of double-quoted strings via the $DOUBLE portion. When the transmission has brought us to a position where the $DOUBLE part can match, it will do so, thereby bypassing the whole string in one fell swoop. It's possible to have both alternatives because they're entirely unambiguous with respect to each other. When applied to a string, as you read from the beginning, any point you that the text is:

Matchable by the comment part, thereby skipping immediately to the end of the comment, or...
Matchable by the double-quoted string part, thereby skipping immediately to the end of the string, or...
Not matchable by either, causing the attempt to fail. This means that the normal bump-along will skip only the one, uninteresting character.

This way, the regex will never be started from within a string or comment, the key to its success. Well, actually, right now it isn't helpful yet, since it removes the strings as well as the comments, but a slight change puts us back on track.

Consider:


$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; # regex to match a comment 

$DOUBLE = qr{"(?:\\.|[^"\\])*"};               # Regex to match double-quoted string

$text =~ s/

($DOUBLE)|$COMMENT/

$1/g;

The only differences are that we've:

Added the parentheses to fill $1 if the match is via the string alternative. If the match is via the comment alternative, $1 is left empty.
Made the replacement value that same $1. The effect is that if a double-quoted string is matched, the replacement is that same double-quoted string — the string is not removed and the substitute becomes an effective no-op (but has the side effect of getting us past the string, which is the reason to add it in the first place). On the other hand, if the comment alternative is the one that matches, the $1 is empty, so the comment is replaced by nothingness just as we want. ^[9]

^[9] In Perl, if $1 is not filled during the match, it's given a special "no value" value "undef". When used in the replacement value, undef is treated as an empty string, so it works as we want. But, if you have Perl warnings turned on (as every good programmer should), the use of an undef value in this way causes a warning to be printed. To avoid this, you can use the 'no warnings;' pragma before the regular expression is used, or use this special Perl form of the substitute operator:
```
    $text =~ s/($DOUBLE )|$COMMENT/

defined($1) ? $1 : ""

/ge;
```

Finally, we need to take care of single-quoted C constants such as '\t' and the like. This is easy—we simply add another alternative inside the parentheses. If we would like to removeC++ /Java/C# style // comments too, that's as simple as adding //[^\n]* as a fourth alternative, outside the parentheses:

$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/};  # regex to match a comment

$COMMENT2 = qr{//[^\n]*};                       # regex to match a C++ // comment

$DOUBLE = qr{"(?:\\.|[^"\\])*"};                # regex to match double-quoted string

$SINGLE = qr{'(?:\\.|[^'\\])*'};                # regex to match single-quoted string



$text =~ s/($DOUBLE|$SINGLE)|$COMMENT|$COMMENT2/$1/g;

The basic premise is quite slick: when the engine checks the text, it quickly grabs (and if appropriate, removes) these special constructs. On my system, this Perl snippet took about 16.4 seconds to remove all the comments from a 16-megabyte, 500,000-line test file. This is fast, but we'll speed it up considerably.

6.7.2 A Well-Guided Regex is a Fast Regex

With just a little hand holding, we can help direct the flow of the regex engine's attention to match much faster. Let's consider the long spans of normal C code between the comments and strings. For each such character, the regex engine has to try each of the four alternatives to see whether it's something that should be gobbled up, and only if all four fail does it bump-along to bypass the character as uninteresting. This is a lot of work that we really don't need to do.

We know, for example, that for any of the alternatives to have a chance at matching, the lead character must be a slash, a single quote, or a double quote. One of these doesn't guarantee a match, but not being one does guarantee a non-match. So, rather than letting the engine figure this out the slow and painful way, let's just tell it directly by adding [^'"/] as an alternative. In fact, any number of such characters in a row can be scooped right up, so let's use [^'"/]+ instead. If you remember the neverending match, you might feel worried about the added plus. Indeed, it could be of great concern if it were within some kind of (···)* loop, but in this stand-alone case it's quite fine (there's nothing that follows that could force it to backtrack at all). So, adding:


$OTHER = qr{[^"'/]}; # Stuff that couldn't possibly begin one of the other alternatives

   .

   .

   .

$text =~ s/($DOUBLE|$SINGLE|$OTHER+)|$COMMENT|$COMMENT2/$1/g;

For reasons that will become apparent after a bit, I've put the plus quantifier after $OTHER, rather than part of the contents of $OTHER.

So, I retry my benchmarks, and wow, this one change cuts the time by over 75%! We've crafted the regex to remove most of the overhead of having to try all the alternatives so often. There are still a few cases where none of the alternatives can match (such as at 'c / 3.14'), and at such times, we'll have to be content with the bump-along to get us by.

However, we're not done yet—we can still help the engine flow to a faster match:

In most cases, the most popular alternative will be $OTHER+, so let's put that first inside the parentheses. This isn't an issue for a POSIX NFA engine because it must always check all alternatives anyway, but for a Traditional NFA, which stops once a match has been found, why make it check for relatively rare matches before checking the one we believe will match most often?
After one of the quoted items matches, it will likely be followed by some $OTHER before another string or a comment is found. If we add $OTHER* after each item, we tell the engine that it can immediately flow right into matching $OTHER without bothering with the /g looping. This is similar to the unrollingthe- loop technique. In fact, unrolling the loop gains much of its speed from the way it leads the regex engine to a match, using our global knowledge to create the local optimizations that feed the engine just what it needs to work quickly.
Note that it is very important that this $OTHER, added after each string-matching subexpression, be quantified with star, while the previous $OTHER (the one we moved to the head of the alternation) be quantified by plus. If it's not clear, consider what could happen if the appended $OTHER had plus and there were, say, two double-quoted strings right in a row. Also, if the leading $OTHER used star, it would always match!

These changes yield

($OTHER+|$DOUBLE$OTHER*|$SINGLE$OTHER*)|$COMMENT|$COMMENT2

as the regex, and further cuts the time by an additional five percent.

Let's step back and think about these last two changes. If we go to the trouble of scooping up $OTHER* after each quoted string, there are only two situations in which the original $OTHER+ (which we moved to be the first alternative) can match: 1> at the very start of the whole s/···/···/g, before any of the quoted strings get a chance to match, and 2) after any comment. You might be tempted to think "Hey, to take care of point #2, let's just add $OTHER* after the comments as well!" This would be nice, except everything we want to keep must be inside that first set of parentheses—putting it after the comments would throw out the baby code with the comment bathwater.

So, if the original $OTHER+ is useful primarily only after a comment, do we really want to put it first? I guess that depends on the data—if there are more comments than quoted strings, then yes, placing it first makes sense. Otherwise, I'd place it later. As it turns out with my test data, placing it first yields better results. Placing it later takes away about half the gains we achieved in the last step.

6.7.3 Wrapup

We're not quite done yet. Don't forget, each of the quoted-string subexpressions is ripe for unrolling — heck, we spent a long section of this chapter on that very topic. So, as a final change, let's replace the two string subexpressions with:


    $DOUBLE = qr{"[^"\\]*(?:\\.[^"\\]*)*"};

    $SINGLE = qr{'[^'\\]*(?:\\.[^'\\]*)*'};

This change yields yet another 15 percent gain. Just a few changes has sped things up from 16.4 seconds to 2.3 seconds—a speedup of over 7x.

This last change also shows how convenient a technique it can be to use variables to build up a regular expression. Individual components, such as $DOUBLE, can be considered in relative isolation, and can be changed without having to wade into the full expression. There are still some overall issues (the counting of capturing parentheses, among others) that must be kept in mind, but it's a wonderful technique.

One of the features that makes it so convenient in this case is Perl's qr/···/ operator, which acts like a regex-related type of "string." Other languages don't have this exact functionality, but many languages do have string types that are amenable to building regular expressions. See "Strings as Regular Expressions" starting on Section 3.3.1 .

You'll particularly appreciate the building up of regular expressions this way when you see the raw regex. Here it is, broken across lines to fit the page:


    ([^"\'/]+|"[^"\\]*(?:\\.[^"\\]*)*"[^"\'/]*|'[^'\\]*

    (?:\\.[^'\\]*)*'[^"\'/]*)|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*

< Free Open Study >