6.5 Techniques for Faster Expressions

The previous sections list the kinds of optimizations that I've seen implemented in Traditional NFA engines. No one program has them all, and whichever ones your favorite program happens to have now, they're certain to change sometime in the future. But, just understanding the kinds of optimizations that can be done gives you an edge in writing more efficient expressions. Combined with the understanding of how a Traditional NFA engine works, this knowledge can be applied in three powerful ways:

Write to the optimizations Compose expressions such that known optimizations (or ones that might be added in the future) can kick in. For example, using xx* instead of x+ can allow a variety of optimizations to more readily kick in, such as the check of a required character or string (see Section 6.4.4.2), or initialcharacter discrimination (see Section 6.4.5.4).
Mimic the optimizations There are situations where you know your program doesn't have a particular optimization, but by mimicking the optimization yourself, you can potentially see a huge savings. As an example that we'll expand on soon, consider adding (?=t) to the start of this|that, to somewhat mimic the initial-character discrimination (see Section 6.4.5.4) in systems that don't already determine from the regex that any match must begin with 't'.
Lead the engine to a match Use your knowledge of how a Traditional NFA engine works to lead the engine to a match more quickly. Consider the this|that example. Each alternative begins with th; if the first's alternative can't match its th, the second alternative's th certainly can't match, so the attempt to do so is wasted. To avert that, you can use th(?:is|at) instead. That way, the th is tested only once, and the relatively expensive alternation is avoided until it's actually needed. And as a bonus, the leading raw-text thof th(?:is|at) is exposed, potentially allowing a number of other optimizations to kick in.

It's important to realize that efficiency and optimizations can sometimes be touchy. There are a number of issues to keep in mind as you read through the rest of this section:

Making a change that would seem to be certainly helpful can, in some situations, slow things down because you've just untweaked some other optimization that you didn't know was being applied.
If you add something to mimic an optimization that you know doesn't exist, it may well turn out that the work required to process what you added actually takes more time than it saves.
If you add something to mimic an optimization that you know doesn't currently exist, it may defeat or duplicate the real optimization if it's later added when the tool is upgraded.
Along the same lines, contorting an expression to try to pique one kind of optimization today may prohibit some future, more advantageous optimization from kicking in when the tool is upgraded.
Contorting an expression for the sake of efficiency may make the expression more difficult to understand and maintain.
The magnitude of the benefit (or harm) a particular change can have is almost certainly strongly dependent on the data it's applied to. A change that is bene- ficial with one set of data may actually be harmful with another type of data.

Let me give a somewhat crazy example: you find (000|999)$ in a Perl script, and decide to turn those capturing parentheses into non-capturing parentheses. This should make things a bit faster, you think, since the overhead of capturing can now be eliminated. But surprise, this small and seemingly beneficial change can slow this regex down by several orders of magnitude (thousands and thousands of times slower). What!? It turns out that a number of factors come together just right in this example to cause the end of string/line anchor optimization (see Section 6.4.5.4) to be turned off when non-capturing parentheses are used. I don't want to dissuade you from using non-capturing parentheses with Perl—their use is beneficial in the vast majority of cases—but in this particular case, it's a disaster.

So, testing and benchmarking with the kind of data you expect to use in practice can help tell you how beneficial or harmful any change will be, but you've still got to weigh all the issues for yourself. That being said, I'll touch on some techniques that can be used toward squeezing out the last bit of efficiency out of an engine.

6.5.1 Common Sense Techniques

Some of the most beneficial things you can do require only common sense.

6.5.1.1 Avoid recompiling

Compile or define the regular expression as few times as possible. With object-oriented handling (see Section 3.2.2), you have the explicit control to do this. If, for example, you want to apply a regex in a loop, create the regex object outside of the loop, then use it repeatedly inside the loop.

With a procedural approach, as with GNU Emacs and Tcl, try to keep the number of regular expressions used within a loop below the cached threshold of the tool (see Section 6.4.4.1.2).

With an integrated approach like Perl, try not to use variable interpolation within a regex inside a loop, because at a minimum, it causes the regex value to be reevaluated at each iteration, even if you know the value never changes. (Perl does, however, provide efficient ways around the problem see Section 7.9.2.)

6.5.1.2 Use non-capturing parentheses

If you don't use the capturing aspect of capturing parentheses, use non-capturing (?:···) parentheses (see Section 2.2.3.1). Besides the direct savings of not having to capture, there can be residual savings because it can make the state needed for backtracking less complex, and hence faster. It can also open up additional optimizations, such as needless-parentheses elimination (see Section 6.4.6.3).

6.5.1.3 Don't add superfluous parentheses

Use parentheses as you need them, but adding them otherwise can prohibit optimizations from kicking in. Unless you need to know the last character matched by .*, don't use (.)*. This may seem obvious, but after all, this is the "common sense techniques" section.

6.5.1.4 Don't use superfluous character classes

This may seem to be overly obvious as well, but I've often seen expressions like ^.*[:] from novice programmers. I'm not sure why one would ever use a class with a single character in it—it incurs the processing overhead of a class without gaining any multi-character matching benefits of a class. I suppose that when the character is a metacharacter, such as [.] and [*], it's probably because the author didn't know about escaping, as with \. and \*. I see this most often with whitespace in a free-spacing mode (see Section 3.3.3.2).

Somewhat related, users of Perl that read the first edition of this book may sometimes write something like ^[Ff][Rr][Oo][Mm]: instead of a case-insensitive use of ^from:. Old versions of Perl were very inefficient with their case-insensitive matching, so I recommended the use of classes like this in some cases. That recommendation has been lifted, as the case-insensitive inefficiency has been fixed for some years now.

6.5.1.5 Use leading anchors

Except in the most rare cases, any regex that begins with .* should probably have ^ or \A (see Section 3.4.3) added to the front. If such a regex can't match when applied at the beginning of the string, it won't be able to match any better when the bump-along applies it starting at the second character, third character, and so on. Adding the anchor (either explicitly, or auto-added via an optimization see Section 6.4.5.4) allows the common start-of-line anchor optimization to kick in, saving a lot of wasted effort.

6.5.2 Expose Literal Text

Many of the native optimizations we've seen in this chapter hinge on the regex engine's ability to recognize that there is some span of literal text that must be part of any successful match. Some engines are better at figuring this out than others, so here are some hand-optimization techniques that help "expose" literal text, increasing the chances that an engine can recognize more of it, allowing the various literal-text optimizations to kick in.

6.5.2.1 "Factor out" required components from quantifiers

Using xx* instead of x+ exposes 'x' as being required. The same logic applies to the rewriting of -{5,7} as ------{0,2}.

6.5.2.2 "Factor out" required components from the front of alternation

Using th(?:is|at) rather than (?:this|that) exposes that th is required. You can also "factor out" on the right side, when the common text follows the differing text: (?:optim|standard)ization. As the next section describes, these can be particularly important when what is being factored out includes an anchor.

6.5.3 Expose Anchors

Some of the most fruitful internal regex optimizations are those that take advantage of anchors (like ^, $, and \G ) that tie the expression to one end of the target string or another. Some engines are not as good as others at understanding when such an optimization can take place, but there are techniques you can use to help.

6.5.3.1 Expose ^ and \G at the front of expressions

^abc|^123 and ^(?:abc|123) are logically the same expression, but many more regex engines can apply the Start of string/line anchor optimization (see Section 6.4.5) with the first than the second. So, choosing to write it the first way can make it much more efficient. PCRE (and tools that use it) is efficient with either, but most other NFA tools are much more efficient with the exposed version.

Another difference can be seen by comparing (^abc) and ^(abc). The former doesn't have many redeeming qualities, as it both "hides" the anchor, and causes the capturing parentheses to be entered before the anchor is even checked, which can be inefficient with some systems. Some systems (PCRE, Perl, the .NET languages) are efficient with either, but others (Ruby and Sun's Java regex library) recognize the optimization only with the exposed version.

Python doesn't seem to have the anchor optimization, so these techniques don't currently matter for it. Of course, most optimizations in this chapter don't apply to Tcl (see Section 6.4.4.1.2).

6.5.3.2 Expose $ at the end of expressions

This is conceptually very similar to the previous section, where abc$|123$ and (?:abc|123)$ are logically the same expression, but can be treated differently by the optimizers. Currently, there is a difference only for Perl, as only Perl currently has the End of string/line anchor optimization (see Section 6.4.5.4). The optimization kicks in with (···|···)$ but not with (···$|···$).

6.5.4 Lazy Versus Greedy: Be Specific

Usually, the choice between lazy and greedy quantifiers is dictated by the specific needs of the regex. For example, ^.*: differs substantially from ^.*?: in that the former one matches until the final colon, while the latter one matches until the first. But, suppose that you knew that your target data had exactly one colon on it. If that's the case, the semantics of both are the same ("match until the colon"), so it's probably smart to pick the one that will run fastest.

It's not always obvious which is best, but as a rule of thumb when the target strings are long, if you expect the colon to generally be near the start of the string, using the lazy quantifier allows the engine to find the colon sooner. Use the greedy quantifier if you expect the colon to be toward the end of the string. If the data is random, and you have no idea which will be more likely, use a greedy quantifier, as they are generally optimized a bit better than non-greedy quantifier, especially when what follows in the regex disallows the character following lazy quantifier optimization (see Section 6.4.6.5).

When the strings to be tested are short, it becomes even less clear. When it comes down to it, either way is pretty fast, but if you need every last bit of speed, benchmark against representative data.

A somewhat related issue is in situations where either a lazy quantifier or a negated class can be used (such as ^.*?: vs. ^[^:]*:), which should be used? Again, this is dependent on the data and the language, but with most engines, using a negated class is much more efficient than a lazy quantifier. One exception is Perl, because it has that character following lazy quantifier optimization.

6.5.5 Split Into Multiple Regular Expressions

There are cases where it's much faster to apply many small regular expressions instead of one large one. For a somewhat contrived example, if you wanted to check a large string to see if it had any of the month names, it would probably be much faster to use separate checks of January, February, March, etc., than to use one January|February|March|···. With the latter, there's no literal text known to be required for any match, so an embedded literal string check optimization (see Section 6.4.5.5) is not possible. With the all-in-one regex, the mechanics of testing each subexpression at each point in the text can be quite slow.

Here's an interesting situation I ran into at about the same time that I was writing this section. When working with a Perl data-handling module, I realized that I had a bug with my client program that caused it to sent bogus data that looked like 'HASH(0x80f60ac)' instead of the actual data. So, I thought I'd augment the module to look for that kind of bogus data and report an error. The straightforward regex for what I wanted is \b(?:SCALAR|ARRAY|···|HASH)$0x[0-9a-fA-F]+$.

This was a situation where efficiency was extremely important. Would this be fast? Perl has a debugging mode that can tell you about some of the optimizations it uses with any particular regex (see Section 7.9.6), so I checked. I hoped to find that the precheck of required string optimization (see Section 6.4.4.2) would kick in, since an advanced enough engine should be able to figure out that '(0x' is required in any match. Knowing the data that I'd apply this to would almost never have '(0x', I knew that such a pre-check would eliminate virtually every line. Unfortunately, Perl didn't pick this out, so I was left with a regex that would entail a lot of alternation at every character of every target string. That's slower than I wanted.

Since I was in the middle of researching and writing about optimizations, I thought hard about how I could rewrite the regex to garner some of the better optimizations. One thought I had was to rewrite it along the form of the somewhat complex $0x (?<=(?:SCALAR|···|HASH)\(0x)[0-9a-fA-F]+$. The approach here is that once \(0x has matched, the negative lookahead (underlined for clarity) makes sure that what came before is allowed, and then checks that what comes after is expected as well. The whole reason to go through these regex gymnastics is to get the regex to lead with non-optional literal text \(0x, which allows a lot of good optimizations to kick in. In particular, I'd expect that pre-check of required string optimization to kick in, as well as the initial character/substring discrimination optimization (see Section 6.4.5.4). I'm sure that these would have made it very fast, but Perl doesn't allow variable-length lookbehind (see Section 3.4.3.6), so I was back to square one.

However, I realized that since Perl wasn't doing the pre-check for \(0x for me, I could just do it myself:


     if ($data =~ m/\(0x/

         and

         $data =~ m/(?:SCALAR|ARRAY|···|HASH)\(0x[0-9a-fA-F]+\)/)

     {

         # warn about bogus data···

     }

The check of \(0x eliminates virtually every line, so this leaves the check of the relatively slow full regex for only when the likelihood of a match is high. This created a wonderful balance of efficiency (very high) and readability (very high). ^[8]

^[8] You can see this in action for yourself. The module in question, DBIx::DWIW, is available on CPAN. It's a module that allows very easy access to a MySQL database. Jeremy Zawodny and I developed it at Yahoo!.

6.5.6 Mimic Initial-Character Discrimination

If the initial-character discrimination optimization (see Section 6.4.5.4) its not done by your implementation, you can mimic it yourself by adding appropriate lookahead (see Section 3.4.3.6) to the start of the regex. The lookahead can "pre-check" that you're at an appropriate starting character before you let the rest of the regex match. For example, for Jan|Feb|···|Dec, use (?=[JFMASOND])(?:Jan|Feb|···|Dec). The leading [JFMASOND] represents letters that can begin the month names in English. This must be done with care, though, because the added overhead of the lookahead may overshadow the savings. In this particular example, where the lookahead is pre-checking for many alternatives that are likely to fail, it is beneficial for most systems I've tested (Java, Perl, Python, Ruby, .NET languages, and PCRE), none of which apparently are able to derive [JFMASOND] from Jan|Feb|···|Dec themselves. (Tcl, of course, can do it perfectly see Section 6.4.4.1.2.)

A behind-the-scenes check of [JFMASOND] by an engine's native optimization is certainly faster than the same check explicitly added by us to the regex proper. Is there a way we can modify the regex so that the engine will check natively? Well, with many systems, you can by using the horribly contorted:

[JFMASOND](?:(?<=J)an|(?<=F)eb|···|(?<=D)ec)

I don't expect you to be able to understand that regex at first sight, but taking the time to understand what it does, and how, is a good exercise. The simple class leading the expression can be picked up by most systems' initial-character discrimination optimization, thereby allowing the transmission itself to effectively pre-check [JFMASOND]. If the target string has few matching characters, the result can be substantially faster than the Jan|···|Dec original, or our prepended-lookahead. But, if the target string has many first-character matches, the extra overhead of all the added lookbehind can actually make things slower. On top of this worry, it's certainly a much less readable regular expression. But, the exercise is interesting and instructive nevertheless.

6.5.6.1 Don't do this with Tcl

The previous example shows how hand tweaking has the potential to really make things worse. The sidebar in Section 6.4.4.1.1 notes that regular expressions in Tcl are mostly immune to the form of the expression, so for the most part attempts to hand optimize are meaningless. Well, here's an example where it does matter. Adding the explicit (?=[JFMASOND]) pre-check causes Tcl to slow down by a factor of about 100x in my tests.

6.5.7 Use Atomic Grouping and Possessive Quantifiers

There are many cases when atomic grouping (see Section 3.4.5.3) and possessive quantifiers (see Section 3.4.5.10) can greatly increase the match speed, even though they don't change the kind of matches that are possible. For example, if ^[^:]+: can't match the first time the colon is attempted, it certainly can't match after backtracking back into the [^:]+, since any character "given up" by that backtracking, by definition, can't match a colon. The use of atomic grouping ^(?>[^:]+): or a possessive quantifier ^[^:]++: causes the states from the plus to be thrown away, or not created in the first place. Since this leaves nothing for the engine to backtrack to, it ensures that it doesn't backtrack unfruitfully. (The sidebar in Section 6.4.6.8 suggests that this can be done automatically by a smart enough engine.)

However, I must stress that misusing either of these constructs can inadvertently change what kind of matches are allowed, so great care must be taken. For example, using them with ^.*:, as with ^(?>.*):, guarantees failure. The entire line is matched by .*, and this includes any colon that the later : needs. The atomic grouping removes the ability for the backtracking required to let : match, so failure is guaranteed.

6.5.8 Lead the Engine to a Match

One concept that goes a long way toward more efficient NFA regular expressions is pushing "control" issues as far back in the matching process as possible. One example we've seen already is the use of th(?:is|at) instead of this|that. With the latter, the alternation is a top-level control issue, but with the former, the relatively expensive alternation is not considered until th has been matched.

The next section, "Unrolling the Loop," is an advanced form of this, but there are a few simple techniques I can mention here.

6.5.8.1 Put the most likely alternative first

Throughout the book, we've seen a number of situations where the order in which alternatives are presented matters greatly (see Section 1.5.4.4, Section 4.5.10.1, Section 5.2.2.1, Section 5.4.2.1). In such situations, the correctness of the match take precedence over optimization, but otherwise, if the order doesn't matter to the correctness, you can gain some efficiency by placing the most-likely alternatives first.

For example, when building a regex to match a hostname (see Section 5.3.5) and listing the final domain parts, some might find it appealing to list them in alphabetical order, as with (?:aero|biz|com|coop|···). However, some of those early in the list are new and not currently popular, so why waste the time to check for them first when you know they will likely fail most of the time? An arrangement with the more popular first, such as (?:com|edu|org|net|···), is likely to lead to a match more quickly, more often.

Of course, this matters only for a Traditional NFA engine, and then, only for when there is a match. With a POSIX NFA, or with a failure, all alternatives must be checked and so the ordering doesn't matter.

6.5.8.2 Distribute into the end of alternation

Continuing with a convenient example, compare (?:com|edu|···|[a-z][a-z])\b with com\b|edu\b|···\b|[a-z][a-z]\b. In the latter, the \b after the alternation has been distributed onto the end of each alternative. The possible benefit is that it may allow an alternative that matches, but whose match would have been undone by the \b after the alternation, to fail a bit quicker, inside the alternation. This allows the failure to be recognized before the overhead of exiting the alternation is needed.

This is perhaps not the best example to show the value of this technique, since it shows promise only for the specific situation when an alternative is likely to match, but what comes right after is likely to fail. We'll see a better example of this concept later in this chapter—look for the discussion of $OTHER* in Section 6.7.3.

This optimization can be dangerous. One very important concern in applying this hand optimization is that you take care not to defeat more profitable native optimizations. For example, if the "distributed" subexpression is literal text, as with the distribution of the colon from (?:this|that):) to this:|that:, you're directly conflicting with some of the ideas in the "Expose Literal Text" section (see Section 6.5.2). All things being equal, I think that those optimizations would be much more fruitful, so be careful not to defeat them in favor of this one.

A similar warning applies to distributing a regex-ending $ on systems that benefit from an exposed end-anchor (see Section 6.5.3.1). On such systems, (?:com|edu|···)$ is much faster than the distributed com$|edu$|···$. (Among the many systems I tested, only Perl currently supports this.)

< Free Open Study >