Previous Section  < Free Open Study >  Next Section

5.2 A Few Short Examples

5.2.1 Continuing with Continuation Lines

With the continuation-line example from the previous chapter (see Section 4.6.2), we found that figs/boxdr.jpg^\w+=.*(\\\n.*)* figs/boxul.jpg applied with a Traditional NFA doesn't properly match both lines of:

     SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \

            missing.c msg.c node.c re.c version.c

The problem is that the first figs/boxdr.jpg.*figs/boxul.jpg matches past the backslash, pulling it out from under the figs/boxdr.jpg(\\\n.*)*figs/boxul.jpg that we want it to be matched by. Well, here's the first lesson of the chapter: if we don't want to match past the backslash, we should say that in the regex. We can do this by changing each dot to figs/boxdr.jpg[^\n\\]figs/boxul.jpg . (Notice how I've made sure to include \n in the negated class? You'll remember that one of the assumptions of the original regex was that dot didn't match a newline, and we don't want its replacement to match a newline either see Section 3.4.2.2.)

Making that change, we get:

figs/boxdr.jpg^\w+=[^\n\\]*(\\\n[^\n\\]*)*figs/boxul.jpg

This now works, properly matching continuation lines, but in solving one problem, we have perhaps introduced another: we've now disallowed backslashes other than those at the end of lines. This is a problem if the data to which it will be applied could possibly have other backslashes. We'll assume it could, so we need to accommodate the regex to handle them.

So far, our approaches have been along the lines of "match the line, then try to match a continuation line if there." Let's change that approach to one that I find often works in general: concentrate on what is really allowed to match at any particular point. As we match the line, we want either normal (non-backslash, non-newline) characters, or a backslash-anything combination. If we use figs/boxdr.jpg\\.figs/boxul.jpg for the backslash-anything combination, and apply it in a dot-matches-all mode, it also can match the backslash-newline combination.

So, the expression becomes figs/boxdr.jpg^\w+=([^\n\\]|\\.)* figs/boxul.jpg in a dot-matches-all mode. Due to the leading figs/boxdr.jpg^figs/boxul.jpg , an enhanced line anchor match mode (see Section 3.3.3.4) may be useful as well, depending on how this expression is used.

But, we're not quite done with this example yet—we'll pick it up again in the next chapter where we work on its efficiency (see Section 6.6.7).

5.2.2 Matching an IP Address

As another example that we'll take much further, let's match an IP (Internet Protocol) address: four numbers separated by periods, such as 1.2.3.4. Often, the numbers are padded to three digits, as in 001.002.003.004. If you want to check a string for one of these, you could use figs/boxdr.jpg[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*figs/boxul.jpg , but that's so vague that it even matches 'and then.....?'. Look at the regex: it doesn't even require any numbers—its only requirements are three periods (with nothing but digits, if anything, between).

To fix this regex, we first change the star to a plus, since we know that each number must have at least one digit. To ensure that the entire string is only the IP address, we wrap the regex with figs/boxdr.jpg^···$figs/boxul.jpg . This gives us:

figs/boxdr.jpg^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$figs/boxul.jpg

Using figs/boxdr.jpg\dfigs/boxul.jpg instead of figs/boxdr.jpg[0-9]figs/boxul.jpg , it becomes figs/boxdr.jpg^\d+\.\d+\.\d+\.\d+$figs/boxul.jpg , which you may find to be more easily readable, [1] but it still matches things that aren't IP addresses, like figs/boxdr.jpg1234.5678.9101112.131415figs/boxul.jpg . (IP addresses have each number in the range of 0-255.) As a start, you can enforce that each number be three digits long, with figs/boxdr.jpg^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$figs/boxul.jpg . but now we are too specific. We still need to allow one- and two-digit numbers (as in 1.234.5.67). If the flavor supports {min,max}, you can use figs/boxdr.jpg^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$figs/boxul.jpg . If not, you can always use figs/boxdr.jpg\d\d?\d?figs/boxul.jpg or figs/boxdr.jpg\d(\d\d?)?figs/boxul.jpg for each part. These allow one to three digits, each in a slightly different way.

[1] Or maybe not — it depends on what you are used to. In a complex regex, I find figs/boxdr.jpg\dfigs/boxul.jpg more readable than figs/boxdr.jpg[0-9]figs/boxul.jpg , but note that on some systems, the two might not be exactly the same. Systems that support Unicode, for example, may have their figs/boxdr.jpg\dfigs/boxul.jpg match non-ASCII digits as well (see Section 3.4.2.4).

Depending on your needs, you might be happy with some of the various degrees of vagueness in the expressions so far. If you really want to be strict, you have to worry that figs/boxdr.jpg\d{1,3}figs/boxul.jpg can match 999, which is above 255, and thus an invalid component of an IP address.

Several approaches would ensure that only numbers from 0 to 255 appear. One silly approach is figs/boxdr.jpg0|1|2|3|···253|254|255figs/boxul.jpg . Actually, this doesn't allow the zeropadding that is allowed, so you really need figs/boxdr.jpg0|00|000|1|01|001|···figs/boxul.jpg , whose length becomes even more ridiculous. For a DFA engine, it is ridiculous only in that it's so long and verbose — it still matches just as fast as any regex describing the same text. For an NFA, however, all the alternation kills efficiency.

A realistic approach concentrates on which digits are allowed in a number, and where. If a number is only one or two digits long, there is no worry as to whether the value is within range, so figs/boxdr.jpg\d|\d\dfigs/boxul.jpg takes care of it. There's also no worry about the value for a three-digit number beginning with a 0 or 1, since such a number is in the range 000-199 and is perfectly valid. This lets us add figs/boxdr.jpg[01]\d\dfigs/boxul.jpg , leaving us with figs/boxdr.jpg\d|\d\d|[01]\d\dfigs/boxul.jpg . You might recognize this as being similar to the time example in Chapter 1 (see Section 1.5.4.4), and date example of the previous chapter (see Section 4.6).

Continuing with our regular expression, a three-digit number beginning with a 2 is allowed if the number is 255 or less, so a second digit less than 5 means the number is valid. If the second digit is 5, the third must be less than 6. This can all be expressed as figs/boxdr.jpg2[0-4]\d|25[0-5]figs/boxul.jpg .

This may seem confusing at first, but the approach should make sense upon reflection. The result is figs/boxdr.jpg\d|\d\d|[01]\d\d|2[0-4]\d|25[0-5]figs/boxul.jpg . Actually, we can combine the first three alternatives to yield figs/boxdr.jpg [01]?\d\d?|2[0-4]\d|25[0-5]figs/boxul.jpg . Doing so is more efficient for an NFA, since any alternative that fails results in a backtrack. Note that using figs/boxdr.jpg\d\d? figs/boxul.jpg in the first alternative, rather than figs/boxdr.jpg\d?\dfigs/boxul.jpg , allows an NFA to fail just a bit more quickly when there is no digit at all. I'll leave the analysis to you—walking through a simple test case with both should illustrate the difference. We could do other things to make this part of the expression more efficient, but I'll leave that for the Chapter 6.

Now that we have a subexpression to match a single number from 0 through 255, we can wrap it in parentheses and insert it in place of each figs/boxdr.jpg\d{1,3}figs/boxul.jpg in the earlier regex. This gives us (broken across lines to fit the width of the page):

figs/boxdr.jpg^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$figs/boxul.jpg

Quite a mouthful! Was it worth the trouble? You have to decide for yourself based upon your own needs. It matches only syntactically correct IP addresses, but it can still match semantically incorrect ones, such as 0.0.0.0 (invalid because all the digits are zero). With lookahead (see Section 3.4.3.6), you can disallow that specific case by putting figs/boxdr.jpg(?!0+\.0+\.0+\.0+$)figs/boxul.jpg after figs/boxdr.jpg^figs/boxul.jpg , but at some point, you have to decide when being too specific causes the cost/benefit ratio to suffer from diminishing returns. Sometimes it's better to take some of the work out of the regex. For example, if you go back to figs/boxdr.jpg^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$figs/boxul.jpg and wrap each component in parentheses to stuff the numbers into the program's version of $1, $2, $3, and $4, you can then validate them by other programming constructs.

5.2.2.1 Know your context

It's important to realize that the two anchors, figs/boxdr.jpg^figs/boxul.jpg and figs/boxdr.jpg$figs/boxul.jpg , are required to make this regex work. Without them, it can match ip=72123.3.21.993, or for a Traditional NFA, even ip=123.3.21.223.

In that second case, the expression does not even fully match the final 223 that should have been allowed. Well, it is allowed, but there's nothing (such as a separating period, or the trailing anchor) to force that match. The final group's first alternative, figs/boxdr.jpg[01]?\d\d?figs/boxul.jpg , matched the first two digits, and without the trailing figs/boxdr.jpg$figs/boxul.jpg , that's the end of the regex. As with the date-matching problem in the previous chapter (see Section 4.5.10.1), we can arrange the order of the alternatives to achieve the desired result. In this case, we would put the alternatives matching three digits first, so any proper three-digit number is matched in full before the two-digit-okay alternative is given a chance. (DFAs and POSIX NFAs don't require the reordering, of course, since they choose the longest match, regardless.)

Rearranged or not, that first mistaken match is still a problem. "Ah!" you might think, "I can use word boundary anchors to solve this problem." Unfortunately, that's probably not enough, since such a regex could still match 1.2.3.4.5.6. To disallow embedded matches, you must ensure that the surrounding context has at least no alphanumerics or periods. If you have lookaround, you can wrap the regex in figs/boxdr.jpg(?<![\w.])···(?![\w.])figs/boxul.jpg to disallow matches that follow just after (or end just before) where figs/boxdr.jpg[\w.]figs/boxul.jpg can match. If you don't have lookaround, simply wrapping it in figs/boxdr.jpg(^|•)···(•|$)figs/boxul.jpg might be satisfactory for some situations.

5.2.3 Working with Filenames

Working with file and path names, like /usr/local/bin/perl on Unix, or perhaps something like \Program Files\Yahoo!\Messenger on Windows, can provide many good regular-expression examples. Since "using" is more interesting than "reading," I'll sprinkle in a few examples coded in Perl, Java, and VB.NET. If you're not interested in these particular languages, feel free to skip the code snippets—it's the regex concepts used in them that are important.

5.2.3.1 Removing the leading path from a filename

As a first example, let's remove the leading path from a filename, turning /usr/local/bin/gcc, for instance, into gcc. Stating problems in a way that makes solutions amenable is half of the battle. In this case, we want to remove anything up to (and including) the final slash (backslash for Windows pathnames). If there is no slash, it's fine as is, and nothing needs to be done. I've said a number of times that figs/boxdr.jpg.*figs/boxul.jpg is often overused, but its greediness is desired here. With figs/boxdr.jpg^.*/figs/boxul.jpg , the figs/boxdr.jpg.*figs/boxul.jpg consumes the whole line, but then backs off (that is, backtracks) to the last slash to achieve the match.

Here's code to do it in our three test languages, ensuring that a filename in the variable f has no leading path. First, for Unix filenames:

Language Code Snippet
Perl $f =~ s{ ^.*/ }{};
java.util.regex f = f.replaceFirst("^.*/", "");
VB.NET f = Regex.Replace(f, "^.*/", "")

The regular expression (or string to be interpreted as a regular expression) is underlined, and regex components are bold.

For comparison, here they are for Windows filenames:

Language Code Snippet
Perl $f =~ s/ ^.*\\ //;
java.util.regex f = f.replaceFirst("^.*\\\\", "");
VB.NET f = Regex.Replace(f, "^.*\\", "")

It's interesting to compare the differences needed for each language when going from one example to the other, particularly the quadruple backslashes needed in Java (see Section 3.2.5).

Please keep in mind this key point: always consider what will happen if there is no match. In this case, if there is no slash in the string, no substitution is done and the string is left unchanged. That's just what we want.

For efficiency's sake, it's important to remember how the regex engine goes about its work, if it is NFA-based. Let's consider what happens if we omit the leading caret (something that's easy to forget) and match against a string that doesn't happen to have a slash. As always, the regex engine starts the search at the beginning of the string. The figs/boxdr.jpg.*figs/boxul.jpg races to the end of the string, but must back off to find a match for the slash or backslash. It eventually backs off everything that figs/boxdr.jpg.*figs/boxul.jpg had gobbled up, yet there's still no match. So, the regex engine decides that there is no possible match when starting from the beginning of the string, but it's not done yet!

The transmission kicks in and retries the whole regex from the second character position. In fact, it needs (in theory) to go through the whole scan-and-backtrack routine for each possible starting position in the string. Filenames tend to be short, so it's probably not such a big deal in this case, but the principle applies to many situations. Were the string long, there's a potential for a lot of backtracking. (A DFA has no such problem, of course.)

In practice, a reasonably optimized transmission realizes that almost any regex starting with figs/boxdr.jpg.*figs/boxul.jpg that fails at the beginning of the string can never match when started from anywhere else, so it can shift gears and attempt the regex only the one time, at the start of the string (see Section 6.4.5.2). Still, it's smarter to write that into our regex in the first place, as we originally did.

5.2.3.2 Accessing the filename from a path

Another approach is to bypass the path and simply match the trailing filename part without the path. The final filename is everything at the end that's not a slash: figs/boxdr.jpg[^/]*$figs/boxul.jpg . This time, the anchor is not just an optimization; we really do need dollar at the end. We can now do something like this, shown with Perl:

$WholePath =~ m{ ([^/]*)$ }; # Check variable $WholePath with regex.

$FileName = $1; # Note text matched

You'll notice that I don't check to see whether the regex actually matches, because I know it will match every time. The only requirement of that expression is that the string has an end to match dollar, and even an empty string has an end. Thus, when I use $1 to reference the text matched within the parenthetical subexpression, I'm assured it will have some value (although that value will be empty when the filename ends with a slash).

Another comment on efficiency: with an NFA, figs/boxdr.jpg[^/]*$figs/boxul.jpg is very inefficient. Carefully run through how the NFA engine attempts the match and you see that it can involve a lot of backtracking. Even the short sample '/usr/local/bin/perl' backtracks over 40 times before finally matching. Consider the attempt that starts at ···figs/U25B4small.giflocal/···. Once figs/boxdr.jpg[^/]*figs/boxul.jpg matches through to the second l and fails on the slash, the figs/boxdr.jpg$figs/boxul.jpg is tried (and fails) for each l, a, c, o, l saved state. If that's not enough, most of it is repeated with the attempt that starts at ···lfigs/U25B4small.gifocal/···, and then again ···lofigs/U25B4small.gifcal/···, and so on.

It shouldn't concern us too much with this particular example, as filenames tend to be short. (And 40 backtracks is nothing — 40 million is when they really matter!) Again, it's important to be aware of the issues so the general lessons here can be applied to your specific needs.

This is a good time to point out that even in a book about regular expressions, regular expressions aren't always The Best Answer. For example, most programming languages provide non-regex routines for dealing with filenames. But, for the sake of discussion, I'll forge ahead.

5.2.3.3 Both leading path and filename

The next logical step is to pick apart a full path into both its leading path and filename component. There are many ways to do this, depending on what we want. Initially, you might want to use figs/boxdr.jpg^(.*)/(.*)$figs/boxul.jpg to fill $1 and $2 with the requisite parts. It looks like a nicely balanced regular expression, but knowing how greediness works, we are guaranteed that the first figs/boxdr.jpg.*figs/boxul.jpg does what we want, never leaving anything with a slash for $2. The only reason the first figs/boxdr.jpg.*figs/boxul.jpg leaves anything at all is due to the backtracking done in trying to match the slash that follows. This leaves only that "backtracked" part for the later figs/boxdr.jpg.*figs/boxul.jpg . Thus, $1 is the full leading path and $2 the trailing filename.

One thing to note: we are relying on the initial figs/boxdr.jpg(.*)/figs/boxul.jpg to ensure that the second figs/boxdr.jpg(.*)figs/boxul.jpg does not capture any slash. We understand greediness, so this is okay. Still I like to be specific when I can, so I'd rather use figs/boxdr.jpg[^/]*figs/boxul.jpg for the filename part. That gives us figs/boxdr.jpg^(.*)/([^/]*)$figs/boxul.jpg . Since it shows exactly what we want, it acts as documentation as well.

One big problem is that this regex requires at least one slash in the string, so if we try it on something like file.txt, there's no match, and thus no information. This can be a feature if we deal with it properly:

if ( $WholePath =~ m!^(.*)/([^/]*)$! ) {

# Have a match -- $1 and $2 are valid

$LeadingPath = $1;

$FileName = $2;

} else {

# No match, so there's no '/' in the filename

$LeadingPath = "."; # so "file.txt" looks like ". / file.txt" ("." is the current directory)

$FileName = $WholePath;

}

5.2.4 Matching Balanced Sets of Parentheses

Matching balanced sets of parentheses, brackets, and the like presents a special difficulty. Wanting to match balanced parentheses is quite common when parsing many kinds of configuration files, programs, and such. Imagine, for example, that you want to do some processing on all of a function's arguments when parsing a language like C. Function arguments are wrapped in parentheses following the function name, and may themselves contain parentheses resulting from nested function calls or math grouping. At first, ignoring that they may be nested, you might be tempted to use figs/boxdr.jpg\bfoo\( [^)]*\)figs/boxul.jpg , but it won't work.

In hallowed C tradition, I use foo as the example function name. The marked part of the expression is ostensibly meant to match the function's arguments. With examples such as foo(2,•4.0) and foo(somevar,•3.7) , it works as expected. Unfortunately, it also matches foo(bar(somevar),•3.7), which is not as we want. This calls for something a bit "smarter" than figs/boxdr.jpg[^)]*figs/boxul.jpg .

To match the parenthesized expression part, you might consider the following regular expressions, among others:

1. \(.*\) literal parentheses with anything in between
2. \([^)]*\) from an opening parenthesis to the next closing parenthesis
3. \([^()]*\) from an opening parenthesis to the next closing parenthesis, but no other opening parentheses allowed in between

Figure 5-1 illustrates where these match against a sample line of code.

Figure 1. Match locations of our sample regexes
figs/mre2_0501.jpg

We see that regex #1 matches too much,[2] and regex #2 matches too little. Regex #3 doesn't even match successfully. In isolation, #3 would match '(this)', but because it must come immediately after the foo, it fails. So, none of these work.

[2] The use of figs/boxdr.jpg.*figs/boxul.jpg should set off warning alarms. Always pay particular attention to decide whether dot is really what you want to apply star to. Sometimes that is exactly what you need, but figs/boxdr.jpg.*figs/boxul.jpg is often used inappropriately.

The real problem is that on the vast majority of systems, you simply can't match arbitrarily nested constructs with regular expressions. For a long time, this was universally true, but recently, both Perl and .NET offer constructs that make it possible. (see Section 7.8.1 and Section 9.6.2, respectively.) But, even without these special constructs, you can still build a regex to match things nested to a certain depth, but not to an arbitrary level of nesting. Just one level of nesting requires

figs/boxdr.jpg
\([^()]*(\([^()]*\)[^()]*)*\)
figs/boxul.jpg

so the thought of having to worry about further levels of nesting is frightening. But, here's a little Perl snippet that, given a $depth, creates a regex to match up to that many levels of parentheses beyond the first. It uses Perl's "string x count" operator, which replicates string by count times:

$regex = '\(' . '(?:[^()]|\(' x $depth . '[^()]*' .'\)) *' x $depth . '\)';

I'll leave the analysis for your free time.

5.2.5 Watching Out for Unwanted Matches

It's easy to forget what happens if the text is not formed just as you expect. Let's say you are writing a filter to convert a text file to HTML, and you want to replace a line of hyphens by <HR>, which represent a horizontal rule (a line across the page). If you used a s/-*/<HR>/ search-and-replace command, it would replace the sequences you wanted, but only when they're at the beginning of the line. Surprised? In fact, s/-*/<HR>/ adds <HR> to the beginning of every line, whether they begin with a sequence of hyphens or not!

Remember, anything that isn't required is always considered successful. The first time figs/boxdr.jpg-*figs/boxul.jpg is attempted at the start of the string, it matches any hyphens that might be there. However, if there aren't any, it is still happy to successfully match nothing. That's what star is all about.

Let's look at a similar example I once saw in a book by a respected author, in which he describes a regular expression to match a number, either integer or floating- point. As his expression is constructed, such a number has an optional leading minus sign, any number of digits, an optional decimal point, and any number of digits that follow. His regex is figs/boxdr.jpg-?[0-9]*\.?[0-9]*figs/boxul.jpg .

Indeed, this matches such examples as 1, -272.37, 129238843., .191919, and even something like -.0. This is all good, and as expected.

However, how do you think it matches in a string like 'this•has•no•number', 'nothing•here', or even an empty string? Look at the regex closely—everything is optional. If a number is there, and if it is at the beginning of the string, it is matched, but nothing is required. This regex can match all three non-number vexamples, matching the nothingness at the beginning of the string each time. In fact, it even matches nothingness at the beginning of an example like 'num•123', since that nothingness matches earlier than the number would.

So, it's important to say what you really mean. A floating-point number must have at least one digit in it, or it's not a number (!). To construct our regex, let's first assume there is at least one digit before the decimal point. (We'll remove this requirement later.) If so, we need to use plus for those digits: figs/boxdr.jpg-?[0-9]+figs/boxul.jpg .

Writing the subexpression to match an optional decimal point (and subsequent digits) hinges on the realization that any numbers after the decimal point are contingent upon there being a decimal point in the first place. If we use something naïve like figs/boxdr.jpg\.?[0-9]*figs/boxul.jpg , the figs/boxdr.jpg[0-9]*figs/boxul.jpg gets a chance to match regardless of the decimal point's presence.

The solution is, again, to say what we mean. A decimal point (and subsequent digits, if any) is optional: figs/boxdr.jpg (\.[0-9]*)?figs/boxul.jpg . Here, the question mark no longer quantifies (that is, governs or controls) only the decimal point, but instead the entire combination of the decimal point plus any following digits. Within that combination, the decimal point is required; if it is not there, figs/boxdr.jpg[0-9]*figs/boxul.jpg is not even reached.

Putting this all together, we have figs/boxdr.jpg-?[0-9]+(\.[0-9]*)?figs/boxul.jpg . This still doesn't allow something like '.007', since our regex requires at least one digit before the decimal point. If we change that part to allow zero digits, we have to change the other so it doesn't, since we can't allow all digits to be optional (the problem we are trying to correct in the first place).

The solution is to add an alternative that allows for the uncovered situation: figs/boxdr.jpg-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+ figs/boxul.jpg . This now also allows just a decimal point followed by (this time not optional) digits. Details, details. Did you notice that I allowed for the optional leading minus in the second alternative as well? That's easy to forget. Of course, you could instead bring the figs/boxdr.jpg-?figs/boxul.jpg out of the alternation, as in figs/boxdr.jpg -?( [0-9]+(\.[0-9]*)?|\.[0-9]+ ) figs/boxul.jpg .

Although this is an improvement on the original, it's still more than happy to match at ' 2003.04.12'. Knowing the context in which a regex is intended to be used is an important part of striking the balance between matching what you want, and not matching what you don't want. Our regex for floating-point numbers requires that it be constrained somehow by being part of a larger regex, such as being wrapped by figs/boxdr.jpg^···$figs/boxul.jpg , or perhaps figs/boxdr.jpgnum\s*=\s*···$figs/boxul.jpg .

5.2.6 Matching Delimited Text

Matching a double-quoted string and matching an IP address are just two examples of a whole class of matching problem that often arises: the desire to match text delimited (or perhaps separated) by some other text. Other examples include:

  • Matching a C comment, which is surrounded by '/*' and '*/'.

  • Matching an HTML tag, which is text wrapped by <···>, such as <CODE>.

  • Extracting items between HTML tags, such as the 'super exciting' of the HTML 'a <I>super exciting</I> offer!'

  • Matching a line in a .mailrc file. This file gives email aliases, where each line is in the form of

    alias shorthand full-address

    such as 'alias jeff jfriedl@regex.info'. (Here, the delimiters are the whitespace between each item, as well as the ends of the line.)

  • Matching a quoted string, but allowing it to contain quotes if they are escaped, as in 'a passport needs a "2\"x3\" likeness" of the holder.'

  • Parsing CSV (comma-separated values) files.

In general, the requirements for such tasks can be phrased along the lines of:

  1. Match the opening delimiter

  2. Match the main text

    (which is really "match anything that is not the closing delimiter")

  3. Match the closing delimiter

As I mentioned earlier, the "match anything not the closing delimiter" can become complicated when the closing delimiter is more than one character, or in situations where it can appear within the main text.

5.2.6.1 Allowing escaped quotes in double-quoted strings

Let's look at the 2\"x3\" example, where the closing delimiter is a quote, yet can appear within the main part if escaped. It's easy enough to match the opening and closing quotes; the trick is to match the main text without overshooting the closing quote. Thinking clearly about which items the main text allows, we know that if a character is not a double quote (in other words, if it's figs/boxdr.jpg[^"]figs/boxul.jpg ), it is certainly okay. However, if it is a double quote, it is okay if preceded by a backslash. Translating that literally, using lookbehind (see Section 3.4.3.6) for the "if preceded" part, it becomes figs/boxdr.jpg"([^"]|(?<=\\)")*"figs/boxul.jpg , which indeed properly matches our 2\"x3\" example.

This is a perfect example to show how unintended matches can sneak into a seemingly proper regex, because as much as it seems to be correct, it doesn't always work. We want it to match the marked part of this silly example:

Darth Symbol: "/-|-\\" or "[^-^]"

but it actually matches:

Darth Symbol: "/-|-\\" or "[^-^]"

This is because the final quote of the first string indeed has a backslash before it. That backslash is itself escaped, so it doesn't escape the quote that follows (which means the quote that follows does end the string). Our lookbehind doesn't recognize that the preceding backslash has been itself escaped, and considering that there may be any number of preceding '\\' sequences, it's a can of worms to try to solve this with lookbehind. The real problem is that a backslash that escapes a quote is not being recognized as an escaping backslash when we first process it, so let's try a different approach that tackles it from that angle.

Concentrating again at what kinds of things we want to match between the opening and closing delimiter, we know that something escaped is okay ( figs/boxdr.jpg\\.figs/boxul.jpg ), as well as anything else other than the closing quote ( figs/boxdr.jpg[^"]figs/boxul.jpg ). This yields figs/boxdr.jpg"(\\.|[^"])*"figs/boxul.jpg . Wonderful, we've solved the problem! Unfortunately, not yet. Unwanted matches can still creep in, such as with this example for which we expect no match because the closing quote has been forgotten:

"You need a 2\"3\" photo.

Why does it match? Recall the lessons from "Greediness and Laziness Always Favor a Match" (see Section 4.5.4). Even though our regex initially matches past that last quote, as we want, it still backtracks after it finds that there is no ending quote, to:

at '···2\"3figs/U25B4small.gif\"•···'matching figs/boxdr.jpg(\\.|figs/U25B4small.gif[^"])figs/boxul.jpg

From that point, the figs/boxdr.jpg[^"]figs/boxul.jpg matches the backslash, leaving us at what the regex can consider an ending quote.

An important lesson to take from this example is:

When backtracking can cause undesired matches in relation to alternation, it's likely a sign that any success is just a happenstance due to the ordering of the alternatives.

In fact, had our original regex had its alternatives reversed, it would match incorrectly in every string containing an escaped double quote. The problem is that one alternative can match something that is supposed to be handled by the other.

So, how can we fix it? Well, just as in the continuation-lines example in Section 5.1, we must make sure that there's no other way for that backslash to be matched, which means changing figs/boxdr.jpg[^"]figs/boxul.jpg , to figs/boxdr.jpg[^\\"]figs/boxul.jpg , . This recognizes that both a double quote and a backslash are "special" in this context, and must be handled accordingly. The result is figs/boxdr.jpg"(\\.|[^\\"])*"figs/boxul.jpg , which works just fine. (Although this regex now works, it can still be improved so that it is much more efficient for NFA engines; we'll see this example quite a bit in the next chapter see Section 6.1.)

This example shows a particularly important moral:

Always consider the "odd" cases in which you don't want a regex to match, such as with "bad" data.

Our fix is the right one, but it's interesting to note that if you have possessive quantifiers (see Section 3.4.5.10) or atomic grouping (see Section 3.4.5.3), this regex can be written as figs/boxdr.jpg"(\\.|[^"])*+"figs/boxul.jpg and figs/boxdr.jpg"(?>(\\.|[^"])*)"figs/boxul.jpg respectively. They don't really fix the problem so much as hide it, disallowing the engine from backtracking back to where the problem could show itself. Either way, they get the job done well.

Understanding how possessive quantifiers and atomic grouping help in this situation is extremely valuable, but I would still go ahead and make the previous fix anyway, as it is more descriptive to the reader. Actually, in this case, I would want to use possessive quantifiers or atomic grouping as well—not to solve the previous problem, but for efficiency, so that a failure fails more quickly.

5.2.7 Knowing Your Data and Making Assumptions

This is an opportune time to highlight a general point about constructing and using regular expressions that I've briefly mentioned a few times. It is important to be aware of the assumptions made about the kind of data with which, and situations in which, a regular expression will be used. Even something as simple as figs/boxdr.jpgafigs/boxul.jpg assumes that the target data is in the same character encoding (see Section 3.3.2) as the author intends. This is pretty much common sense, which is why I haven't been too picky about saying these things.

However, many assumptions that might seem obvious to one person are not necessarily obvious to another. For example, the solution in the previous section assumes that escaped newlines shouldn't be matched, or that it will be applied in a dot-matches-all mode (see Section 3.3.3). If we really want to ensure that dot can match a newline, we should write that by using figs/boxdr.jpg(?s:.)figs/boxul.jpg , if supported by the flavor.

Another assumption made in the previous section is the type of data to which the regex will be applied, as it makes no provisions for any other uses of double quotes in the data. If you apply it to source code from almost any programming language, for example, you'll find that it breaks because there can be double quotes within comments.

There is nothing wrong with making assumptions about your data, or how you intend a regex to be used. The problems, if any, usually lie in overly optimistic assumptions and in misunderstandings between the author's intentions and how the regex is eventually used. Documenting the assumptions can help.

5.2.8 Stripping Leading and Trailing Whitespace

Removing leading and trailing whitespace from a line is not a challenging problem, but it's one that seems to come up often. By far the best all-around solution is the simple use of two substitutions:

s/^\s+//;

s/\s+$//;

As a measure of efficiency, these use figs/boxdr.jpg+figs/boxul.jpg instead of figs/boxdr.jpg*figs/boxul.jpg , since there's no benefit to doing the substitution unless there is actually whitespace to remove.

For some reason, it seems to be popular to try to find a way to do it all in one expression, so I'll offer a few methods for comparison. I don't recommend them, but it's educational to understand why they work, and why they're not desirable.

s/\s*(.*?)\s*$/$1/s

This used to be given as a great example of lazy quantifiers, but not any more, because people now realize that it's so much slower than the simple approach. (In Perl, it's about 5x slower). The lack of speed is due to the need to check figs/boxdr.jpg\s*$figs/boxul.jpg before each application of the lazy-quantified dot. That requires a lot of backtracking.

s/^\s*( (?:.*\S)?)\s*$/$1/s

This one looks more complex than the previous example, but its matching is more straightforward, and is only twice as slow as the simple approach. After the initial figs/boxdr.jpg^\s*figs/boxul.jpg has bypassed any leading whitespace, the figs/boxdr.jpg.*figs/boxul.jpg in the middle matches all the way to the end of the text. The figs/boxdr.jpg\Sfigs/boxul.jpg that follows forces it to backtrack to the last non-whitespace in the text, thereby leaving the trailing whitespace matched by the final figs/boxdr.jpg\s*$figs/boxul.jpg , outside of the capturing parentheses.

The question mark is needed so that this expression works properly on a line that has only whitespace. Without it, it would fail to match, leaving the whitespace- filled line unchanged.

s/^\s+|\s+$//g

This is a commonly thought-up solution that, while not incorrect (none of these are incorrect), it has top-level alternation that removes many optimizations (covered in the next chapter) that might otherwise be possible.

The /g modifier is required to allow each alternative to match, to remove both leading and trailing whitespace. It seems a waste to use /g when we know we intend at most two matches, and each with a different subexpression. This is about 4x slower than the simple approach.

I've mentioned the relative speeds as I tested them, but in practice, the actual relative speeds are dependent upon the tool and the data. For example, if the target text is very, very long, but has relatively little whitespace on either end, the middle approach can be somewhat faster than the simple approach. Still, in my programs, I use the language's equivalent of

s/^\s+//;

s/\s+$//;

because it's almost always fastest, and is certainly the easiest to understand.

    Previous Section  < Free Open Study >  Next Section