![]() |
< Free Open Study > |
![]() |
5.4 Extended ExamplesThe next few examples illustrate some important techniques about regular expressions. The discussions are longer, and show more of the thought processes and mistaken paths that eventually lead to a working conclusion. 5.4.1 Keeping in Sync with Your DataLet's look at a lengthy example that might seem a bit contrived, but which illustrates some excellent points on why it's important to keep in sync with what you're trying to match (and provides some methods to do so). Let's say that your data is a series of five-digit US postal codes (ZIP codes) that are run together, and that you need to retrieve all that begin with, say, 44. Here is a sample line of data, with the codes we want to retrieve in bold: 03824531449411615213441829503544272752010217443235
As a starting point, consider that
Back to
So, it should be apparent that changing
You could, of course, put a caret or
5.4.1.1 Keeping the match in sync with expectationsThe following are a few ways to pass over undesired ZIP codes. Inserting them
before what we want (
Combining this last method with
@zips = m/
(?:\d\d\d\d\d)*?(44\d\d\d)/g;
and picks out the desired '44xxx ' codes, actively skipping undesired ones that intervene. (In this " @array = m/···/g " situation, Perl fills the array with what's matched by capturing parentheses during each match attempt see Section 7.5.3.3.) This regex can work repeatedly on the string because we know each match always leaves the "current match position" at the start of the next ZIP code, thereby priming the next match to start at the beginning of a ZIP code as the regex expects. 5.4.1.2 Maintaining sync after a non-match as wellHave we really ensured that the regex is always applied only at the start of a ZIP code? No! We have manually skipped intervening undesired ZIP codes, but once there are no more desired ones, the regex finally fails. As always, the bump-alongand- retry happens, thereby starting the match from a position within a ZIP code— something our approach relies on never happening. Let's look at our sample data again: 038245314494116 15213441829503544272 Here, the matched codes are bold (the third of which is undesired), the codes we actively skipped are underlined, and characters skipped via bump-along-and-retry are marked. After the match of 44272, no more target codes are able to be matched, so the subsequent attempt fails. Does the whole match attempt end? Of course not. The transmission bumps along to apply the regex at the next character, putting us out of sync with the real ZIP codes. After the fourth such bump-along, the regex skips 10217 as it matches 44323, reporting it falsely as a desired code. Any of our three expressions work smoothly so long as they are applied at the start of a ZIP code, but the transmission's bump-along defeats them. This can be solved by ensuring that the transmission doesn't bump along, or that a bumpalong doesn't cause problems. One way to ensure that the transmission doesn't bump along, at least for the first
two methods, is to make
There are some problems with this solution. One is that because we can now have a regex match even when we don't have a target ZIP code, the handling code must be a bit more complex. However, to its benefit, it is fast, since it doesn't involve much backtracking, nor any bump-alongs by the transmission. 5.4.1.3 Maintaining sync with \GA more general approach is to simply prepend
So, using the second expression, we end up with @zips = m/\G(?:(?!44)\d\d\d\d\d)*(44\d\d\d\d)/g; without the need for any special after-match checking. 5.4.1.4 This example in perspectiveI'll be the first to admit that this example is contrived, but nevertheless, it shows a
number of valuable lessons about how to keep a regex in sync with the data. Still,
were I really to need to do this in real life, I would probably not try to solve it
completely with regular expressions. I would simply use
@zips = ( ); # Ensure the array is empty
while (m/(\d\d\d\d\d)/) {
$zip = $1;
if (substr($zip, 0, 2) eq "44") {
push @zips, $zip;
}
}
Also, see the sidebar in Section 3.4.3.4 for a particularly interesting use of
5.4.2 Parsing CSV FilesAs anyone who's ever tried to parse a CSV (Comma Separated Values) file can tell you, it can be a bit tricky. The biggest problem is that it seems every program that produces a CSV file has a different idea of just what the format should be. In this section, I'll start off with methods to parse the kind of CSV file that Microsoft Excel generates, and we'll move from there to look at some other format permutations.[3]
Luckily, the Microsoft format is one of the simplest. The values, separated by commas, are either "raw" (just sitting there between the commas), or within double quotes (and within the double quotes, a double quote itself is represented by a pair of double quotes in a row). Here's an example: Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K This row represents seven fields: Ten•Thousand
10000
•2710•
an empty field
10,000
It's•"10•Grand",•baby
10K
So, to parse out the fields from a line, we need an expression to cover each of two
field types. The non-quoted ones are easy—they contain anything except commas
and quotes, so they are matched by
A double-quoted field can contain commas, spaces, and in fact anything except a
double quote. It can also contain the two quotes in a row that represent one quote
in the final value. So, a double-quoted field is matched by any number of
Putting this all together results in
# Either some non-quote/non-comma text . . . [^",]+ # . . . or . . . | # . . . a double-quoted field (inside, paired double quotes are allowed) " # field's opening quote (?: [^"] ; "" )* " # field's closing quote Now, to use this in practice, we can apply it repeatedly to a string containing a CSV row, but if we want to actually do anything productive with the results of the match, we should know which alternative matched. If it's the double-quoted field, we need to remove the outer quotes and replace internal paired double quotes with one double quote to yield the original data. I can think of two approaches to this. One is to just look at the text matched and see whether the first character is a double quote. If so, we know that we must strip the first and last characters (the double quotes) and replace any internal '""' by '"'. That's simple enough, but it's even simpler if we are clever with capturing parentheses. If we put capturing parentheses around each of the subexpressions that match actual field data, we can inspect them after the match to see which group has a value: # Either some non-quote/non-comma text . . . Now, if we see that the first group captured, we can just use the value as is. If the second group captured, we merely need to replace any '""' with '"' and we can use the value. I'll show the example now in Perl, and a bit later (after we flush out some bugs) in Java and VB.NET. Here's the snippet in Perl, assuming our line is in $html and has had any newline removed from the end (we don't want the newline to be part of the last field!): while ($line =~ m{ # Either some non-quote/non-comma text . . . ( [^",]+ ) # . . . or . . . | # . . . a double-quoted field ("" allowed inside) " # field's opening quote ( (?: [^"] | "" )* ) " # field's closing quote }gx) { if (defined $1) { $field = $1; } else { $field = $2; $field =~ s/""/"/g; } print "[$field]"; # print the field, for debugging Applying this against our test data, the output is: [Ten•Thousand][10000][•2710•][10,000][It's•"10•Grand",•baby][10K] This looks mostly good, but unfortunately doesn't give us anything for that empty fourth field. If the program's "work with $field" is to save the field value to an array, once we're all done, we'd want access to the fifth element of the array to yield the fifth field ("10,000"). That won't work if we don't fill an element of the array with an empty value for each empty field. The first idea that might come to mind for matching an empty field is to change
Let's test it. Here's the output: [Ten•Thousand][][10000][][•2710•][][][][10,000][][][It's•"10•Grand", . . . Oops, we somehow got a bunch of extra fields! Well, in retrospect, we shouldn't
be surprised. By using
5.4.2.1 Distrusting the bump-alongThe problem here stems from us having relied on the transmission's bump-along to get us past the separating commas. To solve it, we need to take that control into our own hands. Two approaches come to mind:
Perhaps even better, we can combine the two. Starting with the first approach
(matching the commas ourselves), we can simply require a comma before each
field except the first. Alternatively, we can require a comma after each field except
the last. We can do this by prepending
(?:^|,) This really sounds like it should work, but plugging it into our test program, the result is disappointing: [Ten•Thousand][10000][•2710•][][][000][][•baby][10K] Remember, we're expecting: [Ten•Thousand][10000][•2710•][][10,000][It's•"10•Grand",•baby][10K] Why didn't this one work? It seems that the double-quoted fields didn't come out
right, so the problem must be with the part that matches a double-quoted field,
right? No, the problem is before that. Perhaps the moral from Section 4.5.10.1 can help:
when more than one alternative can potentially match from the same point, care
must be taken when selecting the order of the alternatives. Since the first alternative,
Wow, you've really got to keep your wits about you. Okay, let's swap the alternatives and try again: (?:^|,) (?: # Now, match either a double-quoted field (inside, paired double quotes are allowed) . . . " # (double-quoted field's opening quote) ( (?: [^"] | "" )* ) " # (double-quoted field's closing quote) | # . . . or, some non-quote/non-comma text . . . ( [^",]* ) ) Now, it works! Well, at least for our test data. Could it fail with other data? This
section is named "Distrusting the bump-along," and while nothing takes the place
of some thought backed up with good testing, we can use
[Ten•Thousand][10000][•2710•][][][000][][•baby][10K] we would have gotten [Ten•Thousand][10000][•2710•][][] instead. This perhaps would have made the error more apparent, had we missed it the first time.
Another approach. The beginning of this section noted two approaches to
ensuring we stay properly aligned with the fields. The other is to be sure that a
match begins only where a field is allowed. On the surface, this is similar to
prepending
Unfortunately, as the section in the previous chapter (see Section 3.4.3.6) explains, even if
lookbehind is supported, variable-length lookbehind sometimes isn't, so this
approach may not work. If the variable length is the issue, we could replace
However, we can use a twist on this approach—requiring a match to end before a
comma (or before the end of the line). Adding
5.4.2.2 One change for the sake of efficiencyAlthough I don't concentrate on efficiency until the next chapter, I'd like to make
one efficiency-related change now, for systems that support atomic grouping
(see Section 3.4.5.3). If supported, I'd change the part that matches the values of doublequoted
fields from
If possessive quantifiers (see Section 3.4.5.8) are supported, as they are with Sun's Java regex package, they can be used instead. The sidebar with the Java CSV code shows this. The reasoning behind these changes is discussed in the Chapter 6, and eventually we end up with a particularly efficient version, shown in Section 6.6.7.3. 5.4.2.3 Other CSV formatsMicrosoft's CSV format is popular because it's Microsoft's CSV format, but it's not necessarily what other programs use. Here are some twists I've seen:
These changes are easily accommodated. Do the first by replacing each comma in
the regex with the desired separator; the second by adding
For the third, we can use what we developed earlier (see Section 5.2.7), replacing
|
![]() |
< Free Open Study > |
![]() |