Zen and the Art of Efficient Expressions

This section deals presents some techniques you can use to optimize your regular expressions in J2SE. It's designed to help you deal with regular expressions that already exist, as well as establish a framework for writing new ones. The suggestions in this section, along with your own intuition and an active awareness of the nature of your data, will help you optimize your own regex.

Use Noncapturing Groups Where Possible

Capturing groups requires that the JVM keep track of them. This can be very helpful if you need to extract the groups or if you need back references to them later on in the expression. However, if you're using groups strictly for logical purposes, it's worthwhile to make them noncapturing, as this conserves memory use.

The example given earlier for the pattern (?:good|bad|terrible|great) morning groups the various kinds of mornings into a logical unit, but it has no need to capture them. Thus, the group is noncapturing and opens with ?: inside the opening parenthesis.

Precheck Your Candidate Strings

If you're looking for a specific string, you might save CPU cycles by first making sure that the string in question actually exists in your candidate. For example, say you want to parse a string that might contain an e-mail address, and if it does, you want to extract the domain of the e-mail address. It makes sense to first make sure that the candidate contains an @ symbol, and then begin the regex search. Thus, the following is an inexpensive way to check for the necessity of even compiling your pattern:

if (candidate.indexOf("@")) //..try to extract domain information.

Offer the Most Likely Alternative First

Say your regular expression is designed to validate the title of members of an all-physicians' health club. You expect 90 percent of your clients to have the title Dr, but some might not. Thus, you title-matching pattern should probably be something along the lines of .*\b(?:Dr|Mr|Mrs|Miss|Brother|Mister) .*.

This pattern increases the likelihood of a successful match happening sooner rather than later, thus reducing the number of processes the regex engine has to step through. For example, imagine that the candidate string is Please meet Dr Hana Saez. The way the pattern is currently written, the engine will match the D, the r, the space, and everything thereafter on the first pass. Thus, it will never attempt Mr, Mrs, and so on.

However, if the pattern had been .*\b(?:Mr|Mrs|Miss|Brother|Mister| Dr) .*, the engine would have stepped through the entire candidate string once for Mr, then again for Mrs, then again for Miss, then again for Brother, and so on until it finally matched Dr. The actual result would been the same, but the net path there would have been much more resource intensive.

Be as Specific as Possible

If you know that your regex pattern should only match numbers at a given point, then don't use \w to define that part of the match—use \d. This will allow the regex engine to narrow the scope of its search and filter more quickly. Or, if you know that your pattern must contain a given word, then use that very word in the pattern. This is the same principle that makes it easier for you to write code to specific requirements. The more focused and detailed the requirements, the easier your job becomes.

For example, say you know that your candidate string must start with a capital letter, with lowercase letters following the initial character. It's more efficient to use a pattern such as [A-Z][a-z]+ than it is use a pattern such as \w*. Both might work, but [A-Z][a-z]+ allows the engine to produce an accurate result faster than \w*, because it can refuse to consider digits and lowercase letters for the first character, and refuse to consider digits and uppercase letters for the latter parts.

The J2SE regex implementation is much faster with literal strings than it is with characters classes—it can literally zero-in on specific strings. Thus, if you're looking for a number between 10 and 19, it's more efficient to use 1\d than to use \d+, \d\d, \d{2}, or any such variation.

Specify the Position of Your Match

If you know that the candidate string can only occur after a beginning of line, or right before an end of line, or after punctuation, then say so in your regex. The pattern ^Beth will match faster and more efficiently than the pattern Beth, when you know that it must occur after a newline.

This type of optimization can be particularly powerful because it allows the engine to stop searching after examining only the first two characters of the candidate string. Look for opportunities to take advantage of this sort of thing in your regex.

Specify the Size of Your Match

If you're looking for a match that must be at least n characters long, then say so in your regex. Or, if you know that your match can't be more then m characters long, say that too.

For example, imagine that you're parsing a large file for references to first names. It's probably reasonable to assume that the names you want contain fewer than 20 characters. So, although you could use a pattern like \b[A-Z][a-z]+, it's probably better to use something like \b[A-Z][a-z]{1,20}, because the engine can abandon any searches that are longer than 20 characters. Or, if you know that your candidate string must be six or more characters, it's better to use \w\w\w\w\w\w+ than \w+.

Note

J2SE regex finds specific repetitions much more quickly than quantified ones. Thus, \w\w\w\w\w\w is much faster than \w{6}.

Limit the Scope of Your Alternatives

It's generally more efficient to offer small alternatives than to offer large ones, and it's better to offer them later than earlier. Thus, if you have the pattern Good Morning|Good Evening, you would be better served by the pattern Good (?:Morning|Evening). In the latter example, the regex engine doesn't have to make any decisions until after it has established that Good is part of the candidate string.

In the former example, the engine might be obligated to look twice, even if the candidate string doesn't contain the word Good. That is, even if the candidate string is Bad year and can't possibly match, the pattern Good Morning|Good Evening searches twice anyway: once for Good morning and then again for Good Evening.