1.5 Expanding the Foundation

I hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I've provided so far lacks depth. There's so much more out there.

1.5.1 Linguistic Diversification

I mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I'll leave for later chapters.

Unfortunately, the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own "improvements." The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression "flavors." We'll see many examples in the following chapters.

1.5.2 The Goal of a Regular Expression

From the broadest top-down view, a regular expression either matches within a lump of text (with egrep, each line) or it doesn't. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don't want.

Also, while egrep doesn't care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as

...zip is 44272. If you write, send $4.95 to cover postage and...

and you merely want to find lines matching [0-9]+, you don't care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and such — we will see examples of this kind of processing in the next chapter), you'll care very much exactly which numbers are matched.

1.5.3 A Few More Examples

As with any language, experience is a very good thing, so I'm including a few more examples of regular expressions to match some common constructs.

Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to not match when and where you don't want. In practice, both are important, but for the moment, I would like to concentrate on the "getting successful matches" aspect. Even though I don't take these examples to their fullest depths, they still provide useful insight.

1.5.3.1 Variable names

Many programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a digit. They are matched by [a-zA-Z_][a-zA-Z_0-9]*. The first character class matches what the first character can be, the second (with its accompanying star) allows the rest of the identifier. If there is a limit on the length of an identifier, say 32 characters, you might replace the star with {0,31} if the { min,max } notation is supported. (This construct, the interval quantifier, was briefly mentioned in Section 1.4.9.1.)

1.5.3.2 A string within double quotes

A simple solution to matching a string within double quotes might be: "[^"]*"

The double quotes at either end are to match the opening and closing double quotes of the string. Between them, we can have anything... except another double quote! So, we use [^"] to match all characters except a double quote, and apply using a star to indicate we can have any number of such non double-quote characters.

A more useful (but more complex) definition of a double-quoted string allows double quotes within the string if they are escaped with a backslash, such as in "nail•the•2\"x4\"•plank". We'll see this example several times in future chapters while covering the many details of how a match is actually carried out.

1.5.3.3 Dollar amount (with optional cents)

One approach to matching a dollar amount is: \$[0-9]+(\.[0-9][0-9])?

From a top-level perspective, this is a simple regular expression with three parts: \$ and ···+ and (···)?, which might be loosely paraphrased as "a literal dollar sign, a bunch of one thing, and finally perhaps another thing." In this case, the "one thing" is a digit (with a bunch of them being a number), and "another thing" is the combination of a decimal point followed by two digits.

This example is a bit naïve for several reasons. For example, it considers dollar amounts like $1000, but not $1,000. It does allow for optional cents, but frankly, that's not really very useful when applied with egrep. egrep never cares exactly how much is matched, but merely whether there is a match. Allowing something optional at the end never changes whether there's an overall match to begin with.

But, if you need to find lines that contain just a price, and nothing else, you can wrap the expression with ^···$. In this case, the optional cents part becomes important since it might or might not come between the dollar amount and the end of the line, and allowing or disallowing it makes the difference in achieving an overall match.

One type of value our expression doesn't match is '$.49'. To solve this, you might be tempted to change the plus to a star, but that doesn't work. As to why, I'll leave it as a teaser until we look at this example again in Chapter 5 (see Section 5.2.5).

1.5.3.4 An HTTP/HTML URL

The format of web URLs can be complex, so constructing a regular expression to match any possible URL can be equally complex. However, relaxing your standards slightly can allow you to match most common URLs with a fairly simple expression. One common reason I might do this, for example, would be to search my email archive for a URL that I vaguely remember having received, but which I think I might recognize when I see it.

The general form of a common HTTP/HTML URL is along the lines of

http://hostname/path.html

although ending with .htm is common as well.

The rules about what can and can't be a hostname (computer name, such as www.yahoo.com) are complex, but for our needs we can realize that if it follows 'http://', it's probably a hostname, so we can make do with something simple, such as [-a-z0-9_.]+. The path part can be even more varied, so we'll use [-a-z0-9_:@&?=+,.!/~*'%$]* for that. Notice that these classes have the dash first, to ensure that it's taken as a literal character and included in the list, as opposed to part of a range (see Section 1.4.2).

Putting these all together, we might use as our first attempt something like:

% egrep -i '\<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*'%$]*\.html?\>' files

Again, since we've taken liberties and relaxed what we'll match, we could well match something such as 'http://..../foo.html', which is certainly not a valid URL. Do we care about this? It all depends on what you're trying to do. For my scan of my email archive, it doesn't really matter if I get a few false matches. Heck, I could probably get away with even something as simple as:

% egrep -i '\<http://[^ ]*\.html?\>' files...

As we'll learn when getting deeper into how to craft an expression, knowing the data you'll be searching is an important aspect of finding the balance between complexity and completeness. We'll visit this example again, in more detail, in the next chapter.

1.5.3.5 An HTML tag

With a tool like egrep, it doesn't seem particularly common or useful to simply match lines with HTML tags. But, exploring a regular expression that matches HTML tags exactly can be quite fruitful, especially when we delve into more advanced tools in the next chapter.

Looking at simple cases like '<TITLE>' and '<HR>', we might think to try <.*>. This simplistic approach is a frequent first thought, but it's certainly incorrect. Converting <.*> into English reads "match a '<', followed by as much of anything as can be matched, followed by '>'." Well, when phrased that way, it shouldn't be surprising that it can match more than just one tag, such as the marked portion of 'this <I>short</I> example'.

This might have been a bit surprising, but we're still in the first chapter, and our understanding at this point is only superficial. I have this example here to high-light that regular expressions are not a difficult subject, but they can be tricky if you don't truly understand them. Over the next few chapters, we'll look at all the details required to understand and solve this problem.

1.5.3.6 Time of day, such as "9:17 am" or "12:30 pm"

Matching a time can be taken to varying levels of strictness. Something such as

[0-9]?[0-9]:[0-9][0-9]•(am|pm)

picks up both 9:17•am and 12:30•pm, but also allows something nonsensical like 99:99•pm.

Looking at the hour, we realize that if it is a two-digit number, the first digit must be a one. But, 1?[0-9] still allows an hour of 19 (and also an hour of 0), so maybe it is better to break the hour part into two possibilities: 1[012] for two-digit hours and [1-9] for single-digit hours. The result is (1[012]|[1-9]).

The minute part is easier. The first digit should be [0-5]. For the second, we can stick with the current [0-9]. This gives (1[012]|[1-9]):[0-5][0-9]•(am|pm) when we put it all together.

Using the same logic, can you extend this to handle 24-hour time with hours from 0 through 23? As a challenge, allow for a leading zero, at least through to 09:59.
Try building your solution, and then click here to check mine.

1.5.4 Regular Expression Nomenclature

1.5.4.1 Regex

As you might guess, using the full phrase "regular expression" can get a bit tiring, particularly in writing. Instead, I normally use "regex." It just rolls right off the tongue (it rhymes with "FedEx," with a hard g sound like "regular" and not a soft one like in "Regina") and it is amenable to a variety of uses like "when you regex...," "budding regexers," and even "regexification." ^[11] I use the phrase "regex engine" to refer to the part of a program that actually does the work of carrying out a match attempt.

^[11] You might also come across the decidedly unsightly "regexp." I'm not sure how one would pronounce that, but those with a lisp might find it a bit easier.

1.5.4.2 Matching

When I say a regex "matches" a string, I really mean that it matches in a string. Technically, the regex a doesn't match cat, but matches the a in cat. It's not something that people tend to confuse, but it's still worthy of mention.

1.5.4.3 Metacharacter

Whether a character is a metacharacter (or "metasequence"—I use the words inter-changeably) depends on exactly where in the regex it's used. For example, * is a metacharacter, but only when it's not within a character class and when not escaped. "Escaped" means that it has a backslash in front of it—usually. The star is escaped in \*, but not in \\* (where the first backslash escapes the second), although the star "has a backslash in front of it" in both examples.

Depending upon the regex flavor, there are various situations when certain characters are and aren't metacharacters. Chapter 3 discusses this in more detail.

1.5.4.4 Flavor

As I've hinted, different tools use regular expressions for many different things, and the set of metacharacters and other features that each support can differ. Let's look at word boundaries again as an example. Some versions of egrep support the \<···\> notation we've seen. However, some do not support the separate word-start and word-end, but one catch-all \b metacharacter (which we haven't seen yet — we'll see it in the next chapter). Still others support both, and many others support neither.

I use the term "flavor" to describe the sum total of all these little implementation decisions. In the language analogy, it's the same as a dialect of an individual speaker. Superficially, this concept refers to which metacharacters are and aren't supported, but there's much more to it. Even if two programs both support \<···\>, they might disagree on exactly what they do and don't consider to be a word. This concern is important when you use the tool.

Don't confuse "flavor" with "tool." Just as two people can speak the same dialect, two completely different programs can support exactly the same regex flavor. Also, two programs with the same name (and built to do the same task) often have slightly (and sometimes not-so-slightly) different flavors. Among the various programs called egrep, there is a wide variety of regex flavors supported.

In the late 1990s, the particularly expressive flavor offered by the Perl programming language was widely recognized for its power, and soon other languages were offering Perl-inspired regular expressions (many even acknowledging the inspirational source by labeling themselves "Perl-compatible"). The adopters include Python, many Java regex package, Microsoft's .NET Framework, Tcl, and a variety of C libraries, to name a few. Yet, all are different in important respects. On top of this, Perl's regular expressions themselves are evolving and growing (some-times, now, in response to advances seen with other tools). As always, the overall landscape continues to become more varied and confusing.

1.5.4.5 Subexpression

The term "subexpression" simply refers to part of a larger expression, although it often refers to some part of an expression within parentheses, or to an alternative of |. For example, with ^(Subject|Date):•, the Subject|Date is usually referred to as a subexpression. Within that, the alternatives Subject and Date are each referred to as subexpressions as well. But technically, S is a subexpression, as is u, and b, and j, . . .

Something such as 1-6 isn't considered a subexpression of H[1-6]•*, since the '1-6' is part of an unbreakable "unit," the character class. But, H, [1-6], and •* are all subexpressions of H[1-6]•*.

Unlike alternation, quantifiers (star, plus, and question mark) always work with the smallest immediately-preceding subexpression. This is why with mis+pell, the + governs the s, not the mis or is. Of course, when what immediately precedes a quantifier is a parenthesized subexpression, the entire subexpression (no matter how complex) is taken as one unit.

1.5.4.6 Character

The word "character" can be a loaded term in computing. The character that a byte represents is merely a matter of interpretation. A byte with such-and-such a value has that same value in any context in which you might wish to consider it, but which character that value represents depends on the encoding in which it's viewed. As a concrete example, two bytes with decimal values 64 and 53 represent the characters "@" and "5" respectively, if considered in the ASCII encoding, yet on the other hand are completely different if considered in the EBCDIC encoding (they are a space and some kind of a control character).

On the third hand, if those two bytes are considered in one of the popular encodings for Japanese characters, together they represent the single character figs/japscript3.jpg . Yet, to represent this same character in another of the Japanese encodings requires two completely different bytes. Those two different bytes, by the way, yield the two characters "Àm" in the popular Latin-1 encoding, but yield the one Korean character figs/japscript4.jpg in one of the Unicode encodings. ^[12] The point is this: how bytes are to be interpreted is a matter of perspective (called an encoding), and to be successful, you've got to make sure that your perspective agrees with the perspective taken by the tool you're using.

^[12] The definitive book on multiple-byte encodings is Ken Lunde's CJKV Information Processing, also published by O'Reilly & Associates. The CJKV stands for Chinese, Japanese, Korean, and Vietnamese, which are languages that tend to require multiple-byte encodings. Ken and Adobe kindly provided many of the special fonts used in this book.

Until recently, text-processing tools generally treated their data as a bunch of ASCII bytes, without regard to the encoding you might be intending. Recently, however, more and more systems are using some form of Unicode to process data internally (Chapter 3 includes an introduction to Unicode see Section 3.3.2). On such systems, if the regular-expression subsystem has been implemented properly, the user doesn't normally have to pay much attention to these issues. That's a big "if," which is why Chapter 3 looks at this issue in depth.

1.5.5 Improving on the Status Quo

When it comes down to it, regular expressions are not difficult. But, if you talk to the average user of a program or language that supports them, you will likely find someone that understands them "a bit," but does not feel secure enough to really use them for anything complex or with any tool but those they use most often.

Traditionally, regular expression documentation tends to be limited to a short and incomplete description of one or two metacharacters, followed by a table of the rest. Examples often use meaningless regular expressions like a*((ab)*|b*), and text like 'a•xxx•ce•xxxxxx•ci•xxx•d'. They also tend to completely ignore subtle but important points, and often claim that their flavor is the same as some other well-known tool, almost always forgetting to mention the exceptions where they inevitably differ. The state of regex documentation needs help.

Now, I don't mean to imply that this chapter fills the gap for all regular expressions, or even for egrep regular expressions. Rather, this chapter merely provides the foundation upon which the rest of this book is built. It may be ambitious, but I hope this book does fill the gaps for you. I received many gratifying responses to the first edition, and have worked very hard to make this one even better, both in breadth and in depth.

Perhaps because regular-expression documentation has traditionally been so lacking, I feel the need to make the extra effort to make things particularly clear. Because I want to make sure you can use regular expressions to their fullest potential, I want to make sure you really, really understand them.

This is both good and bad.

It is good because you will learn how to think regular expressions. You will learn which differences and peculiarities to watch out for when faced with a new tool with a different flavor. You will know how to express yourself even with a weak, stripped-down regular expression flavor. You will understand what makes one expression more efficient than another, and will be able to balance tradeoffs among complexity, efficiency, and match results. When faced with a particularly complex task, you will know how to work through an expression the way the program would, constructing it as you go. In short, you will be comfortable using regular expressions to their fullest.

The problem is that the learning curve of this method can be rather steep, with three separate issues to tackle:

How regular expressions are used Most programs use regular expressions in ways that are more complex than egrep. Before we can discuss in detail how to write a really useful expression, we need to look at the ways regular expressions can be used. We start in the next chapter.
Regular expression features Selecting the proper tool to use when faced with a problem seems to be half the battle, so I don't want to limit myself to only using one utility throughout this book. Different programs, and often even different versions of the same program, provide different features and metacharacters. We must survey the field before getting into the details of using them. This is the subject of Chapter 3.
How regular expressions really work Before we can learn from useful (but often complex) examples, we need to "look under the hood" to understand just how a regular expression search is conducted. As we'll see, the order in which certain metacharacters are checked can be very important. In fact, regular expression engines can be implemented in different ways, so different pro-grams sometimes do different things with the same expression. We examine this meaty subject in Chapters 4, 5, and 6.

This last point is the most important and the most difficult to address. The discussion is unfortunately sometimes a bit dry, with the reader chomping at the bit to get to the fun part — tackling real problems. However, understanding how the regex engine really works is the key to really understanding.

You might argue that you don't want to be taught how a car works when you simply want to know how to drive. But, learning to drive a car is a poor analogy for learning about regular expressions. My goal is to teach you how to solve problems with regular expressions, and that means constructing regular expressions. The better analogy is not how to drive a car, but how to build one. Before you can build a car, you have to know how it works.

Chapter 2 gives more experience with driving. Chapter 3 takes a short look at the history of driving, and a detailed look at the bodywork of a regex flavor. Chapter 4 looks at the all-important engine of a regex flavor. Chapter 5 shows some extended examples, Chapter 6 shows you how to tune up certain kinds of engines, and the chapters after that examine some specific makes and models. Particularly in Chapters 4, 5, and 6, we'll spend a lot of time under the hood, so make sure to have your coveralls and shop rags handy.

1.5.6 Summary

Table 1-3 summarizes the egrep metacharacters we've looked at in this chapter.

Table 3. Egrep Metacharacter Summary

Items to Match a Single Character

Metacharacter Matches

.
[···]
[^···]
\char dot
character class
negated character class
escaped character Matches any one character
Matches any one character listed
Matches any one character not listed
When char is a metacharacter, or the escaped combination is not
otherwise special, matches the literal char

Items Appended to Provide "Counting" : The Quantifiers

?
*
+
{ min,max } question
star
plus
specified range ^[13] One allowed, but it is optional
Any number allowed, but all are optional
At least one required; additional are optional
Min required, max allowed

^[13] not supported by all versions of egrep

Items That Match a Position

^
$
\<
\>
caret
dollar
word boundary ^[13]
word boundary ^[13]
Matches the position at the start of the line
Matches the position at the end of the line
Matches the position at the start of a word
Matches the position at the end of a word

Other

|
(···)

\1, \2, ... alternation
parentheses

backreference ^[13]

Matches either expression it separates
Limits scope of alternation, provides grouping for the quantifiers, and "captures" for backreferences
Matches text previously matched within first, second, etc., set of parentheses.

^[13] not supported by all versions of egrep

In addition, be sure that you understand the following points:

Not all egrep programs are the same. The metacharacters supported, as well as their exact meanings, are often different — see your local documentation (see Section 1.5).
Three reasons for using parentheses are constraining alternation (see Section 1.4.4), grouping (see Section 1.4.5), and capturing (see Section 1.4.10).
Character classes are special, and have their own set of metacharacters totally distinct from the "main" regex language (see Section 1.4.2.2).
Alternation and character classes are fundamentally different, providing unrelated services that appear, in only one limited situation, to overlap (see Section 1.4.4).
A negated character class is still a "positive assertion"—even negated, a character class must match a character to be successful. Because the listing of characters to match is negated, the matched character must be one of those not listed in the class (see Section 1.4.3).
The useful -i option discounts capitalization during a match (see Section 1.4.6).
There are three types of escaped items:
1. The pairing of \ and a metacharacter is a metasequence to match the literal character (for example, \* matches a literal asterisk).
2. The pairing of \ and selected non-metacharacters becomes a metasequence with an implementation-defined meaning (for example, \< often means "start of word").
3. The pairing of \ and any other character defaults to simply matching the character (that is, the backslash is ignored).
Remember, though, that a backslash within a character class is not special at all with most versions of egrep, so it provides no "escape services" in such a situation.
Items governed by a question mark or star don't need to actually match any characters to "match successfully." They are always successful, even if they don't match anything (see Section 1.4.8).

< Free Open Study >