5.4 Extended Examples

The next few examples illustrate some important techniques about regular expressions. The discussions are longer, and show more of the thought processes and mistaken paths that eventually lead to a working conclusion.

5.4.1 Keeping in Sync with Your Data

Let's look at a lengthy example that might seem a bit contrived, but which illustrates some excellent points on why it's important to keep in sync with what you're trying to match (and provides some methods to do so).

Let's say that your data is a series of five-digit US postal codes (ZIP codes) that are run together, and that you need to retrieve all that begin with, say, 44. Here is a sample line of data, with the codes we want to retrieve in bold:

     03824531449411615213441829503544272752010217443235

Building Up a Regex Through Variables in Java

String SubDomain = "(?i:[a-z0-9]|[a-z0-9][-a-z0-9]*[a-z0-9])"; String TopDomains = "(?x-i:com\\b \n" + " |edu\\b \n" + " |biz\\b \n" + " |in(?:t|fo)\\b \n" + " |mil\\b \n" + " |net\\b \n" + " |org\\b \n" + " |[a-z][a-z]\\b \n" + // country codes ") \n"; String Hostname = "(?:" + SubDomain + "\\.)+" + TopDomains; String NOT_IN = ";\"'<>()\\[\\]\\{\\}\\s\\x7F-\\xFF"; String NOT_END = ".,?"; String ANYWHERE = "[^" + NOT_IN + NOT_END + "]"; String EMBEDDED = "[" + NOT_END + "]"; String UrlPath = "/"+ANYWHERE + "*("+EMBEDDED+"+"+ANYWHERE+"+)*"; String Url = "(?x: \n"+ " \\b \n"+ " ## match the hostname part \n"+ " ( \n"+ " (?: ftp | http s? ): // [-\\w]+(\\.\\w[-\\w]*)+ \n"+ " | \n"+ " " + Hostname + " \n"+ " ) \n"+ " # allow optional port \n"+ " (?: \\d+ )? \n"+ " \n"+ " # rest of url is optional, and begins with / \n"+ " (?: " + UrlPath + ")? \n"+ ")"; // Now convert string we've built up into a real regex object Pattern UrlRegex = Pattern.compile(Url); // Now ready to apply to raw text to find urls . . . · · ·

As a starting point, consider that \d\d\d\d\d can be used repeatedly to match all the ZIP codes. In Perl, this is as simple as @zips = m/\d\d\d\d\d/g; to create a list with one ZIP code per element. (To keep these examples less cluttered, they assume the text to be matched is in Perl's default target variable $_ see Section 2.3.7.) With other languages, it's usually a simple matter to call the regex "find" method in a loop. I'd like to concentrate on the regular expression rather than that mechanics particular to each language, so will continue to use Perl to show the examples.

Back to \d\d\d\d\d. Here's a point whose importance will soon become apparent: the regex never fails until the entire list has been parsed—there are absolutely no bump-and-retries by the transmission. (I'm assuming we'll have only proper data, an assumption that is very situation specific.)

So, it should be apparent that changing \d\d\d\d\d to 44\d\d\d in an attempt to find only ZIP codes starting with 44 is silly — once a match attempt fails, the transmission bumps along one character, thus putting the match for the 44··· out of sync with the start of each ZIP code. Using 44\d\d\d incorrectly finds a match at '···5314494116···'.

You could, of course, put a caret or \A at the head of the regex, but they allow a target ZIP code to match only if it's the first in the string. We need to keep the regex engine in sync manually by writing our regex to pass over undesired ZIP codes as needed. The key here is that it must pass over full ZIP codes, not single characters as with the automatic bump-along.

5.4.1.1 Keeping the match in sync with expectations

The following are a few ways to pass over undesired ZIP codes. Inserting them before what we want ((44\d\d\d) ) achieves the desired effect. Non-capturing (?:···) parentheses are used to match undesired ZIP codes, effectively allowing us to pass them over on the way toward matching a desired ZIP code within the $1 capturing parentheses:


(?:[^4]\d\d\d\d|\d[^4]\d\d\d)*···

This brute-force method actively skips ZIP codes that start with something other than 44. (Well, it's probably better to use [1235-9] instead of [^4], but as I said earlier, I am assuming properly formatted data.) By the way, we can't use (?:[^4][^4]\d\d\d)*, as it does not match (and thus does not pass over) undesired ZIP codes like 43210.

(?:(?!44)\d\d\d\d\d)*···

This method, which uses negative lookahead, actively skips ZIP codes that do not start with 44. This English description sounds virtually identical to the previous one, but when rendered into a regular expression looks quite different. Compare the two descriptions and related expressions. In this case, a desired ZIP code (beginning with 44) causes (?!44) to fail, thus causing the skipping to stop.

(?:\d\d\d\d\d)*?···

This method uses a lazy quantifier to skip ZIP codes only when needed. We use it before a subexpression matching what we do want, so that if that subexpression fails, this one matches a ZIP. It's the laziness of (···)*? that allows this to happen. Because of the lazy quantifier, (?:\d\d\d\d\d) is not even attempted until whatever follows has failed. The star assures that it is repeatedly attempted until whatever follows finally does match, thus effectively skipping only what we want to skip.

Combining this last method with (44\d\d\d) gives us

     @zips = m/

(?:\d\d\d\d\d)*?(44\d\d\d)/g;

and picks out the desired '44xxx' codes, actively skipping undesired ones that intervene. (In this "@array = m/···/g" situation, Perl fills the array with what's matched by capturing parentheses during each match attempt see Section 7.5.3.3.) This regex can work repeatedly on the string because we know each match always leaves the "current match position" at the start of the next ZIP code, thereby priming the next match to start at the beginning of a ZIP code as the regex expects.

5.4.1.2 Maintaining sync after a non-match as well

Have we really ensured that the regex is always applied only at the start of a ZIP code? No! We have manually skipped intervening undesired ZIP codes, but once there are no more desired ones, the regex finally fails. As always, the bump-alongand- retry happens, thereby starting the match from a position within a ZIP code— something our approach relies on never happening.

Let's look at our sample data again:

038245314494116

15213441829503544272752010217443235

Here, the matched codes are bold (the third of which is undesired), the codes we actively skipped are underlined, and characters skipped via bump-along-and-retry are marked. After the match of 44272, no more target codes are able to be matched, so the subsequent attempt fails. Does the whole match attempt end? Of course not. The transmission bumps along to apply the regex at the next character, putting us out of sync with the real ZIP codes. After the fourth such bump-along, the regex skips 10217 as it matches 44323, reporting it falsely as a desired code.

Any of our three expressions work smoothly so long as they are applied at the start of a ZIP code, but the transmission's bump-along defeats them. This can be solved by ensuring that the transmission doesn't bump along, or that a bumpalong doesn't cause problems.

One way to ensure that the transmission doesn't bump along, at least for the first two methods, is to make (44\d\d\d\) greedily optional by appending ?. This plays off the knowledge that the prepended (?:(?!44)\d\d\d\d\d)*··· or (?:[^4]\d\d\d\d|\d[^4]\d\d\d)*··· finish only when at a desired code, or when there are no more codes (which is why it can't be used for the third, nongr eedy method.) Thus, (44\d\d\d)? matches the desired ZIP code if it's there, but doesn't force a backtrack if it's not.

There are some problems with this solution. One is that because we can now have a regex match even when we don't have a target ZIP code, the handling code must be a bit more complex. However, to its benefit, it is fast, since it doesn't involve much backtracking, nor any bump-alongs by the transmission.

5.4.1.3 Maintaining sync with `\G`

A more general approach is to simply prepend \G (see Section 3.4.3.3) to any of the three methods' expressions. Because we crafted each to explicitly end on a ZIP code boundary, we're assured that any subsequent match that has had no intervening bump-along begins on that same ZIP boundary. And if there has been a bumpalong, the leading \G fails immediately, because with most flavors, it's successful only when there's been no intervening bump-along. (This is not true for Ruby and other flavors whose \G means "start of the current match" instead of "end of the previous match" see Section 3.4.3.4.)

So, using the second expression, we end up with

     @zips = m/\G(?:(?!44)\d\d\d\d\d)*(44\d\d\d\d)/g;

without the need for any special after-match checking.

5.4.1.4 This example in perspective

I'll be the first to admit that this example is contrived, but nevertheless, it shows a number of valuable lessons about how to keep a regex in sync with the data. Still, were I really to need to do this in real life, I would probably not try to solve it completely with regular expressions. I would simply use \d\d\d\d\d to grab each ZIP code, then discard it if it doesn't begin with '44'. In Perl, this looks like:

     @zips = ( ); # Ensure the array is empty



     while (m/(\d\d\d\d\d)/) {

        $zip = $1;

        if (substr($zip, 0, 2) eq "44") {

            push @zips, $zip;

        }

     }

Also, see the sidebar in Section 3.4.3.4 for a particularly interesting use of \G, although one available at the time of this writing only in Perl.

5.4.2 Parsing CSV Files

As anyone who's ever tried to parse a CSV (Comma Separated Values) file can tell you, it can be a bit tricky. The biggest problem is that it seems every program that produces a CSV file has a different idea of just what the format should be. In this section, I'll start off with methods to parse the kind of CSV file that Microsoft Excel generates, and we'll move from there to look at some other format permutations.^[3]

^[3] The final code for processing the Microsoft style CSV files is presented in Chapter 6 (see Section 6.6.7.3) after the efficiency issues discussed in that chapter are taken into consideration.

Luckily, the Microsoft format is one of the simplest. The values, separated by commas, are either "raw" (just sitting there between the commas), or within double quotes (and within the double quotes, a double quote itself is represented by a pair of double quotes in a row).

Here's an example:

     Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K

This row represents seven fields:

     Ten•Thousand

     10000

     •2710•

     an empty field

     10,000

     It's•"10•Grand",•baby

     10K

So, to parse out the fields from a line, we need an expression to cover each of two field types. The non-quoted ones are easy—they contain anything except commas and quotes, so they are matched by [^",]+ .

A double-quoted field can contain commas, spaces, and in fact anything except a double quote. It can also contain the two quotes in a row that represent one quote in the final value. So, a double-quoted field is matched by any number of [^"]|"" between "···", which gives us "(?:[^"]|"")*". (Actually, for efficiency, I can use atomic grouping, (?>···) instead of (?:···), but I'll leave that discussion until the next chapter; see Section 6.5.7.)

Putting this all together results in [^",]+|"(?:[^"]|"")*" to match a single field. That might be getting a bit hard to read, so I'll rewrite it in a free-spacing mode (see Section 3.3.3.2):

     # Either some non-quote/non-comma text . . .

     [^",]+

     # . . . or . . .

      |

     # . . . a double-quoted field (inside, paired double quotes are allowed)

     " # field's opening quote

      (?: [^"] ; "" )*

     " # field's closing quote

Now, to use this in practice, we can apply it repeatedly to a string containing a CSV row, but if we want to actually do anything productive with the results of the match, we should know which alternative matched. If it's the double-quoted field, we need to remove the outer quotes and replace internal paired double quotes with one double quote to yield the original data.

I can think of two approaches to this. One is to just look at the text matched and see whether the first character is a double quote. If so, we know that we must strip the first and last characters (the double quotes) and replace any internal '""' by '"'. That's simple enough, but it's even simpler if we are clever with capturing parentheses. If we put capturing parentheses around each of the subexpressions that match actual field data, we can inspect them after the match to see which group has a value:

     # Either some non-quote/non-comma text . . .




     ( [^",]+ )

     # . . . or . . .

      |

     # . .. a double-quoted field (inside, paired double quotes are allowed)

     " # field's opening quote




       ( (?: [^"] | "" )* )

     " # field's closing quote

Now, if we see that the first group captured, we can just use the value as is. If the second group captured, we merely need to replace any '""' with '"' and we can use the value.

I'll show the example now in Perl, and a bit later (after we flush out some bugs) in Java and VB.NET. Here's the snippet in Perl, assuming our line is in $html and has had any newline removed from the end (we don't want the newline to be part of the last field!):

     while ($line =~ m{

                # Either some non-quote/non-comma text . . .

                ( [^",]+ )

                # . . . or . . .

                 |

                # . . . a double-quoted field ("" allowed inside)

                " # field's opening quote

                 ( (?: [^"] | "" )* )

                " # field's closing quote

             }gx)

     {

         if (defined $1) {

             $field = $1;

         } else {

             $field = $2;

             $field =~ s/""/"/g;

         }

         print "[$field]"; # print the field, for debugging
         Can work with $field now . . .

     }

Applying this against our test data, the output is:

     [Ten•Thousand][10000][•2710•][10,000][It's•"10•Grand",•baby][10K]

This looks mostly good, but unfortunately doesn't give us anything for that empty fourth field. If the program's "work with $field" is to save the field value to an array, once we're all done, we'd want access to the fifth element of the array to yield the fifth field ("10,000"). That won't work if we don't fill an element of the array with an empty value for each empty field.

The first idea that might come to mind for matching an empty field is to change [^",]+ to [^",]*. Well, that may seem obvious, but does it really work?

Let's test it. Here's the output:

     [Ten•Thousand][][10000][][•2710•][][][][10,000][][][It's•"10•Grand", . . .

Oops, we somehow got a bunch of extra fields! Well, in retrospect, we shouldn't be surprised. By using (···)* to match a field, we don't actually require anything to match. That works to our advantage when we have an empty field, but consider after having matched the first field, the next application of the regex starts at 'Ten•Thousand,10000···'. If nothing in the regex can match that raw comma (as is the case), yet an empty match is considered successful, the empty match will indeed be successful at that point. In fact, it could be successful an infinite number of times at that point if the regex engine doesn't realize, as modern ones do, that it's in such a situation and force a bump-along so that two zero-width matches don't happen in a row (see Section 3.4.3.4). That's why there's one empty match between each valid match (and although not shown, there's an empty match at the end).

5.4.2.1 Distrusting the bump-along

The problem here stems from us having relied on the transmission's bump-along to get us past the separating commas. To solve it, we need to take that control into our own hands. Two approaches come to mind:

We could try to match the commas ourselves. If we take this approach, we must be sure to match a comma as part of matching a regular field, using it to "pace ourselves" through the string.
We could check to be sure that each match start is consistent with locations that we know can start a field. A field starts either at the beginning of the line, or after a comma.

Perhaps even better, we can combine the two. Starting with the first approach (matching the commas ourselves), we can simply require a comma before each field except the first. Alternatively, we can require a comma after each field except the last. We can do this by prepending ^|, or appending $|, to our regex, with appropriate parentheses to control the scope. Let's try prepending, which gives us:


     (?:^|,)
     (?:

         # Either some non-quote/non-comma text....

         ( [^",]* )

       # ···  or···

       |

         # ··· a double-quoted field (inside, paired double quotes are allowed)

         " # field's opening quote

          ( (?: [^"] | "" )* )

         " # field's closing quote

     )

This really sounds like it should work, but plugging it into our test program, the result is disappointing:

     [Ten•Thousand][10000][•2710•][][][000][][•baby][10K]

Remember, we're expecting:

     [Ten•Thousand][10000][•2710•][][10,000][It's•"10•Grand",•baby][10K]

Why didn't this one work? It seems that the double-quoted fields didn't come out right, so the problem must be with the part that matches a double-quoted field, right? No, the problem is before that. Perhaps the moral from Section 4.5.10.1 can help: when more than one alternative can potentially match from the same point, care must be taken when selecting the order of the alternatives. Since the first alternative, [^",]*, requires nothing to be successful, the second alternative never gets a chance to match unless forced by something that must match later in the regex. Our regex doesn't have anything after these alternatives, so as it's written, the second alternative is never even reached. Doh!

Wow, you've really got to keep your wits about you. Okay, let's swap the alternatives and try again:

(?:^|,)

(?: # Now, match either a double-quoted field (inside, paired double quotes are allowed) . . .

         " # (double-quoted field's opening quote)

         ( (?: [^"] | "" )* )

         " # (double-quoted field's closing quote)

 |

    # . . . or, some non-quote/non-comma text . . .

         ( [^",]* )

)

Now, it works! Well, at least for our test data. Could it fail with other data? This section is named "Distrusting the bump-along," and while nothing takes the place of some thought backed up with good testing, we can use \G to ensure that each match begins exactly at the point that the previous match ended. We believe that should be true already because of how we've constructed and apply our regex. If we start out the regex with \G, we disallow any match after the engine's transmission is forced to bump along. We hope it will never matter, but doing so may make an error more apparent. Had we done this with our previously-failing regex that had given us

     [Ten•Thousand][10000][•2710•][][][000][][•baby][10K]

we would have gotten

     [Ten•Thousand][10000][•2710•][][]

instead. This perhaps would have made the error more apparent, had we missed it the first time.

CSV Processing in Java

Here's the CSV example with Sun's java.util.regex. This is designed to be clear and simple—a more efficient version is given in Chapter 8 (see Section 8.4.4.3).

import java.util.regex.*; · · · Pattern fieldRegex = Pattern.compile( "\\G(?:^|,) \n"+ "(?: \n"+ " # Either a double-quoted field ... \n"+ " \" # field's opening quote \n"+ " ( (?: [^\"]++ | \"\" )*+ ) \n"+ " \" # field's closing quote \n"+ " # ... or ... \n"+ " | \n"+ " # ... some non-quote/non-comma text ... \n"+ " ( [^\",]* ) \n"+ " ) \n", Pattern.COMMENTS); Pattern quotesRegex = Pattern.compile("\"\""); · · · // Given the string in 'line', find all the fields . . . Matcher m = fieldRegex.matcher(line); while (m.find()) { String field; if (m.group(1) != null) { field = quotesRegex.matcher(m.group(1)).replaceAll("\""); } else { field = m.group(2); } // We can now work with the field . . . System.out.println("[" + field + "]"); }

Another approach. The beginning of this section noted two approaches to ensuring we stay properly aligned with the fields. The other is to be sure that a match begins only where a field is allowed. On the surface, this is similar to prepending ^|,, except using lookbehind as with (?<=^|,).

Unfortunately, as the section in the previous chapter (see Section 3.4.3.6) explains, even if lookbehind is supported, variable-length lookbehind sometimes isn't, so this approach may not work. If the variable length is the issue, we could replace (?<=^|,) with (?:^|(?<=,)), but this seems overly complex considering that we already have the first approach working. Also, it reverts to relying on the transmission's bump-along to bypass the commas, so if we've made a mistake elsewhere, it could allow a match to begin at a location like '···"10,000"···'. All in all, it just seems safer to use the first approach.

However, we can use a twist on this approach—requiring a match to end before a comma (or before the end of the line). Adding (?=$|,) to the end of our regex adds yet another assurance that it won't match where we don't want it to. In practice, would I do add this? Well, frankly, I feel pretty comfortable with what we came up with before, so I'd probably not add it in this exact situation, but it's a good technique to have on hand when you need it.

5.4.2.2 One change for the sake of efficiency

Although I don't concentrate on efficiency until the next chapter, I'd like to make one efficiency-related change now, for systems that support atomic grouping (see Section 3.4.5.3). If supported, I'd change the part that matches the values of doublequoted fields from (?:[^"]|"")* to (?>[^"]+|"")*. The VB.NET example in the sidebar below shows this.

CSV Processing in VB.NET

Imports System.Text.RegularExpressions · · · Dim FieldRegex as Regex = New Regex( _ "(?:^|,) " & _ "(?: " & _ " (?# Either a doublequoted field ...) " & _ " "" (?# field's opening quote ) " & _ " ( (?> [^""]+ | """" )* ) " & _ " "" (?# field's closing quote ) " & _ " (?# ... or ...) " & _ " | " & _ " (?# ... some non-quote/non-comma text ...) " & _ " ( [^"",]*) " & _ " )", RegexOptions.IgnorePatternWhitespace) Dim QuotesRegex as Regex = New Regex(" "" "" ") 'A string with two double quotes · · · Dim FieldMatch as Match = FieldRegex.Match(Line) While FieldMatch.Success Dim Field as String If FieldMatch.Groups(1).Success Field = QuotesRegex.Replace(FieldMatch.Groups(1).Value, """") Else Field = FieldMatch.Groups(2).Value End If Console.WriteLine("[" & Field & "]") ' Can now work with 'Field'.··· FieldMatch = FieldMatch.NextMatch End While

If possessive quantifiers (see Section 3.4.5.8) are supported, as they are with Sun's Java regex package, they can be used instead. The sidebar with the Java CSV code shows this.

The reasoning behind these changes is discussed in the Chapter 6, and eventually we end up with a particularly efficient version, shown in Section 6.6.7.3.

5.4.2.3 Other CSV formats

Microsoft's CSV format is popular because it's Microsoft's CSV format, but it's not necessarily what other programs use. Here are some twists I've seen:

Using another character, such as ';' or a tab, as the separator (which makes one wonder why the format is still called "comma-separated values").
Allowing spaces after the separators, but not counting them as part of the value.
Using backslashes to escape quotes (e.g., using '\"' rather than '""' to include a quote within the value). This usually means that a backslash is allowed (and ignored) before any character.

These changes are easily accommodated. Do the first by replacing each comma in the regex with the desired separator; the second by adding \s* after the first separator, e.g., starting out with (?:^|,\s*).

For the third, we can use what we developed earlier (see Section 5.2.7), replacing [^"]+|"" with [^"\\]+|\\.. Of course, we'd also have to change the subsequent s/""/"/g to the more general s/\\(.)/$1/g, or our target language's equivalent.

< Free Open Study >