This section presents some simple techniques for writing your own regular expressions. I think of them as the push, the pull, and the composition. As in the Japanese martial art Judo, if your opponent is pushing against you, you pull him. If he's pulling away, you push. If those techniques don't work, you compose him into a pretzel.
Similarly, writing a regular expression will sometimes seem impervious to certain approaches, but very susceptible to others. The methods I describe in the following sections are only simple techniques for writing patterns. If you haven't already done so, you'll soon cultivate your own bag of regex tricks. You may even develop pet names for them.
One of the most successful ways to create regular expressions consists of taking an exact match and then slowly morphing it into a generic regular expression that matches the original. I think of this as the pull technique, because I'm slowly pulling the regular expression out of the exact match.
For example, imagine that you want to create a pattern to match four-digit numbers. Thus, 1234 would be a match, but 123 would not, and neither would 12345 or ABCD.
For the sake of this example, you'll need to introduce a single regular expression metacharacter, \d, that will match any digit ranging from 0 to 9.
Note |
A metacharacter describes another, more complex character. For example, \n is a metacharacter describing the nonprintable newline character. |
Going back to the example, you know that
1234 matches 1234
This is, of course, obvious: Anything will match itself. However, you also know that \d matches any digit. By the transitive property of logic, you can substitute \d for the last digit. Thus, the pattern becomes
Here you replace the last digit, 4, with the equivalent metacharacter, \d. If you run this pattern though the handy RX.java program, you can see that it does, in fact, continue to match. So far, so good. Actually, it's better than good: Now you have a pattern that will match not only 1234, but also any four-digit number beginning with the digits 123. We're getting closer.
Note |
RX.java is a very short companion program for this book that you can obtain from Downloads section of the Apress Web site (http://www.apress.com). You can use this program to execute regular expression patterns against a candidate string. |
Repeat the process on the third digit, so that 1234 should match 12\d\d, where you replace the 3 with the equivalent \d. Things are looking up. Not only does this match 1234, but also it matches any four-digit number beginning with the digits 12.
You can see where this is going. Eventually, you'll create the pattern \d\d\d\d, which will match any four digits. This isn't the most succinct pattern, but it's sufficient to meet the stated need: It matches any four-digit number.
The point here is that you can, in principle, sometimes work backward from a specific match to create the pattern you need. Of course, this is just a technique, and it won't work for all situations. However, it's a good method to put into your regex bag of tricks.
Another technique that I've found to be helpful in writing regular expression patterns is the push technique. The push technique builds on previous work by either adding to it, subtracting from it, or modifying its scope until it's useful.
Instead of working with a specific matching token, as in the pull technique, this approach takes a preexisting regular expression that's similar to the one you need and modifies it until it does the required job. That is, the regular expression is pushed into another functionality, hence the name.
For example, say you want a regex pattern that matches five digits. Based on the previous example, you know that \d\d\d\d will match any four digits. Thus, the process of finding a match for a five-digit match is as easy as appending another \d to the previous pattern. The answer, of course, is the pattern \d\d\d\d\d.
As you progress though this chapter, you'll learn that these aren't the most elegant representations of the four-digit and five-digit matching patterns you could come up with, but they're perfectly legitimate solutions, and they're reasonably derived. That process of derivation is the important point to take away from this discussion.
The composition technique does exactly what its name implies: It puts together various patterns to form a new whole. That is, it's the composition of a new pattern by using other patterns. This is distinct from the push technique in that patterns aren't modified; rather, they're simply appended.
Assume that you need to create a pattern that will match United States zip codes, which consist of five digits, followed by a hyphen character, followed by four digits. Based on the work you've already done, this pattern is very easy to create. You know that four digits match \d\d\d\d, that a hyphen matches itself, and that five digits match \d\d\d\d\d. Composing these into a single pattern yields the pattern \d\d\d\d\d-\d\d\d\d\d.
Again, this isn't the most elegant and concise representation for a zip code, and it isn't very permissive (what about five-digit zip codes? What if there are spaces between the hyphen and the digits? What if there is no hyphen, just a space?), but it does meet the stated requirement.
Note |
As with most software problems, you can often find the solution to a regex conundrum by clarifying the requirements. |