< Free Open Study > |
6.7 The Freeflowing RegexWe just spent some time constructing a regex to match a C comment, but left off with the problem of how to stop comment-like items within strings from being matched. Using Perl, we might mistakenly try to remove comments with: $prog =~ s{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}{}g; # remove C comments (and more!) Text in the variable $prog that is matched by our regex is removed (that is, replaced by nothing). The problem with this is that there's nothing to stop a match from starting within a string, as in this C snippet: char *CommentStart = "/*"; /* start of comment */ char *CommentEnd = "*/"; /* end of comment */ Here, the underlined portions are what the regex finds, but the bold portions are what we wish to be found. When the engine is searching for a match, it tries to match the expression at each point in the target. Since it is successful only from where a comment begins (or where it looks like one begins), it doesn't match at most locations, so the transmission bump-along bumps us right into the doublequoted string, whose contents look like the start of a comment. It would be nice if we could tell the regex engine that when it hits a double-quoted string, it should zip right on past it. Well, we can. 6.7.1 A Helping Hand to Guide the MatchConsider: $COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; # regex to match a comment $DOUBLE = qr{"(?:\\.|[^"\\])*"}; #regex to match double-quoted string $text =~ s/$DOUBLE|$COMMENT//g; There are two new things here. One is that this time the regex operand, $DOUBLE|$COMMENT, is made up of two variables, each of which is constructed with Perl's special qr/···/ regex-style "double-quoted string" operator. As discussed at length in Chapter 3 (see Section 3.3), one must be careful when using strings that are meant to be interpreted as regular expressions. Perl alleviates this problem by providing the qr/···/ operator, which treats its operand as a regular expression, but doesn't actually apply it. Rather, it returns a "regex object" value that can later be used to build up a larger regular expression. It's extremely convenient, as we saw briefly in Chapter 2 (see Section 2.3.6.7). Like m/···/ and s/···/···/, you can pick delimiters to suit your needs (see Section 2.3.6.4), as we've done here using braces. The other new thing here is the matching of double-quoted strings via the $DOUBLE portion. When the transmission has brought us to a position where the $DOUBLE part can match, it will do so, thereby bypassing the whole string in one fell swoop. It's possible to have both alternatives because they're entirely unambiguous with respect to each other. When applied to a string, as you read from the beginning, any point you that the text is:
This way, the regex will never be started from within a string or comment, the key to its success. Well, actually, right now it isn't helpful yet, since it removes the strings as well as the comments, but a slight change puts us back on track. Consider: $COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; # regex to match a comment $DOUBLE = qr{"(?:\\.|[^"\\])*"}; # Regex to match double-quoted string $text =~ s/ ($DOUBLE)|$COMMENT/ $1/g; The only differences are that we've:
Finally, we need to take care of single-quoted C constants such as '\t' and the like. This is easy—we simply add another alternative inside the parentheses. If we would like to removeC++ /Java/C# style // comments too, that's as simple as adding //[^\n]* as a fourth alternative, outside the parentheses: $COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; # regex to match a comment $COMMENT2 = qr{//[^\n]*}; # regex to match a C++ // comment $DOUBLE = qr{"(?:\\.|[^"\\])*"}; # regex to match double-quoted string $SINGLE = qr{'(?:\\.|[^'\\])*'}; # regex to match single-quoted string $text =~ s/($DOUBLE|$SINGLE)|$COMMENT|$COMMENT2/$1/g; The basic premise is quite slick: when the engine checks the text, it quickly grabs (and if appropriate, removes) these special constructs. On my system, this Perl snippet took about 16.4 seconds to remove all the comments from a 16-megabyte, 500,000-line test file. This is fast, but we'll speed it up considerably. 6.7.2 A Well-Guided Regex is a Fast RegexWith just a little hand holding, we can help direct the flow of the regex engine's attention to match much faster. Let's consider the long spans of normal C code between the comments and strings. For each such character, the regex engine has to try each of the four alternatives to see whether it's something that should be gobbled up, and only if all four fail does it bump-along to bypass the character as uninteresting. This is a lot of work that we really don't need to do. We know, for example, that for any of the alternatives to have a chance at matching, the lead character must be a slash, a single quote, or a double quote. One of these doesn't guarantee a match, but not being one does guarantee a non-match. So, rather than letting the engine figure this out the slow and painful way, let's just tell it directly by adding [^'"/] as an alternative. In fact, any number of such characters in a row can be scooped right up, so let's use [^'"/]+ instead. If you remember the neverending match, you might feel worried about the added plus. Indeed, it could be of great concern if it were within some kind of (···)* loop, but in this stand-alone case it's quite fine (there's nothing that follows that could force it to backtrack at all). So, adding: $OTHER = qr{[^"'/]}; # Stuff that couldn't possibly begin one of the other alternatives . . . $text =~ s/($DOUBLE|$SINGLE|$OTHER+)|$COMMENT|$COMMENT2/$1/g; For reasons that will become apparent after a bit, I've put the plus quantifier after $OTHER, rather than part of the contents of $OTHER. So, I retry my benchmarks, and wow, this one change cuts the time by over 75%! We've crafted the regex to remove most of the overhead of having to try all the alternatives so often. There are still a few cases where none of the alternatives can match (such as at 'c / 3.14'), and at such times, we'll have to be content with the bump-along to get us by. However, we're not done yet—we can still help the engine flow to a faster match:
These changes yield ($OTHER+|$DOUBLE$OTHER*|$SINGLE$OTHER*)|$COMMENT|$COMMENT2 as the regex, and further cuts the time by an additional five percent. Let's step back and think about these last two changes. If we go to the trouble of scooping up $OTHER* after each quoted string, there are only two situations in which the original $OTHER+ (which we moved to be the first alternative) can match: 1> at the very start of the whole s/···/···/g, before any of the quoted strings get a chance to match, and 2) after any comment. You might be tempted to think "Hey, to take care of point #2, let's just add $OTHER* after the comments as well!" This would be nice, except everything we want to keep must be inside that first set of parentheses—putting it after the comments would throw out the baby code with the comment bathwater. So, if the original $OTHER+ is useful primarily only after a comment, do we really want to put it first? I guess that depends on the data—if there are more comments than quoted strings, then yes, placing it first makes sense. Otherwise, I'd place it later. As it turns out with my test data, placing it first yields better results. Placing it later takes away about half the gains we achieved in the last step. 6.7.3 WrapupWe're not quite done yet. Don't forget, each of the quoted-string subexpressions is ripe for unrolling — heck, we spent a long section of this chapter on that very topic. So, as a final change, let's replace the two string subexpressions with: $DOUBLE = qr{"[^"\\]*(?:\\.[^"\\]*)*"}; $SINGLE = qr{'[^'\\]*(?:\\.[^'\\]*)*'}; This change yields yet another 15 percent gain. Just a few changes has sped things up from 16.4 seconds to 2.3 seconds—a speedup of over 7x. This last change also shows how convenient a technique it can be to use variables to build up a regular expression. Individual components, such as $DOUBLE, can be considered in relative isolation, and can be changed without having to wade into the full expression. There are still some overall issues (the counting of capturing parentheses, among others) that must be kept in mind, but it's a wonderful technique. One of the features that makes it so convenient in this case is Perl's qr/···/ operator, which acts like a regex-related type of "string." Other languages don't have this exact functionality, but many languages do have string types that are amenable to building regular expressions. See "Strings as Regular Expressions" starting on Section 3.3.1 . You'll particularly appreciate the building up of regular expressions this way when you see the raw regex. Here it is, broken across lines to fit the page: ([^"\'/]+|"[^"\\]*(?:\\.[^"\\]*)*"[^"\'/]*|'[^'\\]* (?:\\.[^'\\]*)*'[^"\'/]*)|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]* |
< Free Open Study > |