Previous Section  < Free Open Study >  Next Section

1.4 Egrep Metacharacters

Let's start to explore some of the egrep metacharacters that supply its regular expression power. I'll go over them quickly with a few examples, leaving the detailed examples and descriptions for later chapters.

Typographical Conventions Before we begin, please make sure to review the typographical conventions explained in the preface . This book forges a bit of new ground in the area of typesetting, so some of my notations may be unfamiliar at first.

1.4.1 Start and End of the Line

Probably the easiest metacharacters to understand are figs/boxdr.jpg^figs/boxul.jpg (caret) and figs/boxdr.jpg$figs/boxul.jpg (dollar), which represent the start and end, respectively, of the line of text as it is being checked. As we've seen, the regular expression figs/boxdr.jpgcatfigs/boxul.jpg finds c·a·t anywhere on the line, but figs/boxdr.jpg^catfigs/boxul.jpg matches only if the c·a·t is at the beginning of the line—the figs/boxdr.jpg^figs/boxul.jpg is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, figs/boxdr.jpgcat$figs/boxul.jpg finds c·a·t only at the end of the line, such as a line ending with scat.

It's best to get into the habit of interpreting regular expressions in a rather literal way. For example, don't think

figs/boxdr.jpg^catfigs/boxul.jpg matches a line with cat at the beginning

but rather:

figs/boxdr.jpg^catfigs/boxul.jpg matches if you have the beginning of a line, followed immediately by c, followed immediately by a, followed immediately by t.

They both end up meaning the same thing, but reading it the more literal way allows you to intrinsically understand a new expression when you see it. How would egrep interpret figs/boxdr.jpgcat$figs/boxul.jpg , figs/boxdr.jpg^$figs/boxul.jpg , or even simply figs/boxdr.jpg^figs/boxul.jpg alone? figs/bullet.jpg Click here to check your interpretations.

The caret and dollar are special in that they match a position in the line rather than any actual text characters themselves. Of course, there are various ways to actually match real text. Besides providing literal characters like figs/boxdr.jpgcatfigs/boxul.jpg in your regular expression, you can also use some of the items discussed in the next few sections.

1.4.2 Character Classes

1.4.2.1 Matching any one of several characters

Let's say you want to search for "grey," but also want to find it if it were spelled "gray." The regular-expression construct figs/boxdr.jpg[···]figs/boxul.jpg , usually called a character class, lets you list the characters you want to allow at that point in the match. While figs/boxdr.jpgefigs/boxul.jpg matches just an e, and figs/boxdr.jpgafigs/boxul.jpg matches just an a, the regular expression figs/boxdr.jpg[ea]figs/boxul.jpg matches either. So, then, consider figs/boxdr.jpggr[ea]yfigs/boxul.jpg : this means to find " g, followed by r, followed by either an e or an a, all followed by y ." Because I'm a really poor speller, I'm always using regular expressions like this against a huge list of English words to figure out proper spellings. One I use often is figs/boxdr.jpgsep[ea]r[ea]tefigs/boxul.jpg , because I can never remember whether the word is spelled "seperate," "separate," "separete," or what. The one that pops up in the list is the proper spelling; regular expressions to the rescue.

Notice how outside of a class, literal characters (like the figs/boxdr.jpggfigs/boxul.jpg and figs/boxdr.jpgrfigs/boxul.jpg of figs/boxdr.jpggr[ae]yfigs/boxul.jpg ) have an implied "and then" between them — "match figs/boxdr.jpggfigs/boxul.jpg and then match figs/boxdr.jpgrfigs/boxul.jpg . . ." It's completely opposite inside a character class. The contents of a class is a list of characters that can match at that point, so the implication is "or."

As another example, maybe you want to allow capitalization of a word's first letter, such as with figs/boxdr.jpg[Ss]mithfigs/boxul.jpg . Remember that this still matches lines that contain smith (or Smith) embedded within another word, such as with blacksmith. I don't want to harp on this throughout the overview, but this issue does seem to be the source of problems among some new users. I'll touch on some ways to handle this embedded-word problem after we examine a few more metacharacters.

You can list in the class as many characters as you like. For example, figs/boxdr.jpg[123456]figs/boxul.jpg matches any of the listed digits. This particular class might be useful as part of figs/boxdr.jpg<H[123456]>figs/boxul.jpg , which matches <H1>, <H2>, <H3>, etc. This can be useful when searching for HTML headers.

Within a character class, the character-class metacharacter '-' (dash) indicates a range of characters: figs/boxdr.jpg<H[1-6]>figs/boxul.jpg is identical to the previous example. figs/boxdr.jpg[0-9]figs/boxul.jpg and figs/boxdr.jpg[a-z]figs/boxul.jpg are common shorthands for classes to match digits and English lowercase letters, respectively. Multiple ranges are fine, so figs/boxdr.jpg[0123456789abcdefABCDEF]figs/boxul.jpg can be written as figs/boxdr.jpg[0-9a-fA-F]figs/boxul.jpg (or, perhaps, figs/boxdr.jpg[A-Fa-f0-9]figs/boxul.jpg , since the order in which ranges are given doesn't matter). These last three examples can be useful when processing hexadecimal numbers. You can freely combine ranges with literal characters: figs/boxdr.jpg[0-9A-Z_!.?]figs/boxul.jpg matches a digit, uppercase letter, underscore, exclamation point, period, or a question mark.

Note that a dash is a metacharacter only within a character class — otherwise it matches the normal dash character. In fact, it is not even always a metacharacter within a character class. If it is the first character listed in the class, it can't possibly indicate a range, so it is not considered a metacharacter. Along the same lines, the question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class (so, to be clear, the only special characters within the class in figs/boxdr.jpg[0-9A-Z_!.?]figs/boxul.jpg are the two dashes).

Consider character classes as their own mini language. The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes.

We'll see more examples of this shortly.

1.4.2.2 Negated character classes

If you use figs/boxdr.jpg[^···]figs/boxul.jpg instead of figs/boxdr.jpg[···]figs/boxul.jpg , the class matches any character that isn't listed. For example, figs/boxdr.jpg[^1-6]figs/boxul.jpg matches a character that's not 1 through 6. The leading ^ in the class "negates" the list, so rather than listing the characters you want to include in the class, you list the characters you don't want to be included.

You might have noticed that the ^ used here is the same as the start-of-line caret introduced in Section 1.4.1. The character is the same, but the meaning is completely different. Just as the English word "wind" can mean different things depending on the context (sometimes a strong breeze, sometimes what you do to a clock), so can a metacharacter. We've already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class's opening bracket; otherwise, it's not special inside a class). Don't fear — these are the most complex special cases; others we'll see later aren't so bad.

As another example, let's search that list of English words for odd words that have q followed by something other than u. Translating that into a regular expression, it becomes figs/boxdr.jpgq[^u]figs/boxul.jpg . I tried it on the list I have, and there certainly weren't many. I did find a few, including a number of words that I didn't even know were English.

Here's what happened. (What I typed is in bold.)


% egrep 'q[^u]' word.list

Iraqi

Iraqian

miqra

qasida

qintar

qoph

zaqqum%

Two notable words not listed are "Qantas", the Australian airline, and "Iraq". Although both words are in the word.list file, neither were displayed by my egrep command. Why? figs/bullet.jpg Think about it for a bit, and then click here to check your reasoning.

Remember, a negated character class means "match a character that's not listed" and not "don't match what is listed." These might seem the same, but the Iraq example shows the subtle difference. A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed.

1.4.3 Matching Any Character with Dot

The metacharacter figs/boxdr.jpg.figs/boxul.jpg (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an "any character here" placeholder in your expression. For example, if you want to search for a date such as 03/19/76, 03-19-76, or even 03.19.76, you could go to the trouble to construct a regular expression that uses character classes to explicitly allow '/', '-', or '.' between each number, such as figs/boxdr.jpg03[-./]19[-./]76figs/boxul.jpg . However, you might also try simply using figs/boxdr.jpg03.19.76figs/boxul.jpg .

Quite a few things are going on with this example that might be unclear at first. In figs/boxdr.jpg03[-./]19[-./]76figs/boxul.jpg , the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [ or [^. Had they not been first, as with figs/boxdr.jpg[.-/]figs/boxul.jpg , it they would be the class range metacharacter, which would be a mistake in this situation.

With figs/boxdr.jpg03.19.76figs/boxul.jpg , the dots are metacharacters — ones that match any character (including the dash, period, and slash that we are expecting). However, it is important to know that each dot can match any character at all, so it can match, say, 'lottery numbers: 19 203319 7639'.

So, figs/boxdr.jpg03[-./]19[-./]76figs/boxul.jpg is more precise, but it's more difficult to read and write. figs/boxdr.jpg03.19.76figs/boxul.jpg is easy to understand, but vague. Which should we use? It all depends upon what you know about the data being searched, and just how specific you feel you need to be. One important, recurring issue has to do with balancing your knowledge of the text being searched against the need to always be exact when writing an expression. For example, if you know that with your data it would be highly unlikely for figs/boxdr.jpg03.19.76figs/boxul.jpg to match in an unwanted place, it would certainly be reasonable to use it. Knowing the target text well is an important part of wielding regular expressions effectively.

1.4.4 Alternation

1.4.4.1 Matching any one of several subexpressions

A very convenient metacharacter is figs/boxdr.jpg|figs/boxul.jpg , which means "or." It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, figs/boxdr.jpgBobfigs/boxul.jpg and figs/boxdr.jpgRobertfigs/boxul.jpg are separate expressions, but figs/boxdr.jpgBob|Robertfigs/boxul.jpg is one expression that matches either. When combined this way, the subexpressions are called alternatives.

Looking back to our figs/boxdr.jpggr[ea]yfigs/boxul.jpg example, it is interesting to realize that it can be written as figs/boxdr.jpggrey|grayfigs/boxul.jpg , and even figs/boxdr.jpggr(a|e)yfigs/boxul.jpg . The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like figs/boxdr.jpggr[a|e]yfigs/boxul.jpg is not what we want — within a class, the '|' character is just a normal character, like figs/boxdr.jpgafigs/boxul.jpg and figs/boxdr.jpgefigs/boxul.jpg .

With figs/boxdr.jpggr(a|e)yfigs/boxul.jpg , the parentheses are required because without them, figs/boxdr.jpggra|eyfigs/boxul.jpg means " figs/boxdr.jpggrafigs/boxul.jpg or figs/boxdr.jpgeyfigs/boxul.jpg ," which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is figs/boxdr.jpg(First|1st)•[Ss]treetfigs/boxul.jpg . [5] Actually, since both figs/boxdr.jpgFirstfigs/boxul.jpg and figs/boxdr.jpg1stfigs/boxul.jpg end with figs/boxdr.jpgstfigs/boxul.jpg , the combination can be shortened to figs/boxdr.jpg(Fir|1)st•[Ss]treetfigs/boxul.jpg . That's not necessarily quite as easy to read, but be sure to understand that figs/boxdr.jpg(first|1st)figs/boxul.jpg and figs/boxdr.jpg(fir|1)stfigs/boxul.jpg effectively mean the same thing.

[5] Recall from the typographical conventions in Preface that "" is how I sometimes show a space character so it can be seen easily.

Here's an example involving an alternate spelling of my name. Compare and contrast the following three expressions, which are all effectively the same:

figs/boxdr.jpg
Jeffrey|Jeffery
figs/boxul.jpg
figs/boxdr.jpg
Jeff(rey|ery)
figs/boxul.jpg
figs/boxdr.jpg
Jeff(re|er)y
figs/boxul.jpg

To have them match the British spellings as well, they could be:

figs/boxdr.jpg
(Geoff|Jeff)(rey|ery)
figs/boxul.jpg
figs/boxdr.jpg
(Geo|Je)ff(rey|ery)
figs/boxul.jpg
figs/boxdr.jpg
(Geo|Je)ff(re|er)y
figs/boxul.jpg

Finally, note that these three match effectively the same as the longer (but simpler) figs/boxdr.jpgJeffrey|Geoffery|Jeffery|Geoffreyfigs/boxul.jpg . They're all different ways to specify the same desired matches.

Although the figs/boxdr.jpggr[ea]yfigs/boxul.jpg versus figs/boxdr.jpggr(a|e)yfigs/boxul.jpg examples might blur the distinction, be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the "main" regular expression language. You'll find both to be extremely useful.

Also, take care when using caret or dollar in an expression that has alternation. Compare figs/boxdr.jpg^From|Subject|Date:•figs/boxul.jpg with figs/boxdr.jpg^(From|Subject|Date):•figs/boxul.jpg . Both appear similar to our earlier email example, but what each matches (and therefore how useful it is) differs greatly. The first is composed of three alternatives, so it matches " figs/boxdr.jpg^Fromfigs/boxul.jpg or figs/boxdr.jpgSubjectfigs/boxul.jpg or figs/boxdr.jpgDate:•figs/boxul.jpg ," which is not particularly useful. We want the leading caret and trailing figs/boxdr.jpg:•figs/boxul.jpg to apply to each alternative. We can accomplish this by using parentheses to "constrain" the alternation:

figs/boxdr.jpg
^(From|Subject|Date):•
figs/boxul.jpg

The alternation is constrained by the parentheses, so literally, this regex means "match the start of the line, then one of figs/boxdr.jpgFromfigs/boxul.jpg , figs/boxdr.jpgSubjectfigs/boxul.jpg , or figs/boxdr.jpgDatefigs/boxul.jpg , and then match figs/boxdr.jpg:•figs/boxul.jpg ." Effectively, it matches:

1) start-of-line, followed by F·r·o·m, followed by ':•'

or 2) start-of-line, followed by S·u·b·j·e·c·t, followed by ':•'

or 3) start-of-line, followed by D·a·t·e, followed by ':•'

Putting it less literally, it matches lines beginning with 'From:•', 'Subject:•', or 'Date:•', which is quite useful for listing the messages in an email file.

Here's an example:

% egrep '^(From|Subject|Date): ' mailbox

From: elvis@tabloid.org (The King)

Subject: be seein' ya around

Date: Thu, 22 Aug 2002 11:04:13

From: The Prez <president@whitehouse.gov>

Date: Tue, 27 Aug 2002 8:36:24

Subject: now, about your vote···

  .

  .

  .

1.4.5 Ignoring Differences in Capitalization

This email header example provides a good opportunity to introduce the concept of a case-insensitive match. The field types in an email header usually appear with leading capitalization, such as "Subject" and "From," but the email standard actually allows mixed capitalization, so things like "DATE" and "from" are also allowed. Unfortunately, the regular expression in the previous section doesn't match those.

One approach is to replace figs/boxdr.jpgFromfigs/boxul.jpg with figs/boxdr.jpg[Ff][Rr][Oo][Mm]figs/boxul.jpg to match any form of "from," but this is quite cumbersome, to say the least. Fortunately, there is a way to tell egrep to ignore case when doing comparisons, i.e., to perform the match in a case insensitive manner in which capitalization differences are simply ignored. It is not a part of the regular-expression language, but is a related useful feature many tools provide. egrep's command-line option "-i" tells it to do a case-insensitive match. Place -i on the command line before the regular expression:


% egrep -i '^(From|Subject|Date): ' mailbox

This brings up all the lines we matched before, but also includes lines such as:


SUBJECT: MAKE MONEY FAST

I find myself using the -i option quite frequently (perhaps related to the footnote in Section 1.7.2!) so I recommend keeping it in mind. We'll see other convenient support features like this in later chapters.

1.4.6 Word Boundaries

A common problem is that a regular expression that matches the word you want can often also match where the "word" is embedded within a larger word. I mentioned this briefly in the cat, gray, and Smith examples. It turns out, though, that some versions of egrep offer limited support for word recognition: namely the ability to match the boundary of a word (where a word begins or ends).

You can use the (perhaps odd looking) metasequences figs/boxdr.jpg\<figs/boxul.jpg and figs/boxdr.jpg\>figs/boxul.jpg if your version happens to support them (not all versions of egrep do). You can think of them as word-based versions of figs/boxdr.jpg^figs/boxul.jpg and figs/boxdr.jpg$figs/boxul.jpg that match the position at the start and end of a word, respectively. Like the line anchors caret and dollar, they anchor other parts of the regular expression but don't actually consume any characters during a match. The expression figs/boxdr.jpg\<cat\>figs/boxul.jpg literally means " match if we can find a start-of-word position, followed immediately by c·a·t, followed immediately by an end-of-word position ." More naturally, it means "find the word cat." If you wanted, you could use figs/boxdr.jpg\<catfigs/boxul.jpg or figs/boxdr.jpgcat\>figs/boxul.jpg to find words starting and ending with cat.

Note that figs/boxdr.jpg<figs/boxul.jpg and figs/boxdr.jpg>figs/boxul.jpg alone are not metacharacters — when combined with a backslash, the sequences become special. This is why I called them "metasequences." It's their special interpretation that's important, not the number of characters, so for the most part I use these two meta-words interchangeably.

Remember, not all versions of egrep support these word-boundary metacharacters, and those that do don't magically understand the English language. The "start of a word" is simply the position where a sequence of alphanumeric characters begins; "end of word" is where such a sequence ends. Figure 1-2 shows a sample line with these positions marked.

The word-starts (as egrep recognizes them) are marked with up arrows, the wordends with down arrows. As you can see, "start and end of word" is better phrased as "start and end of an alphanumeric sequence," but perhaps that's too much of a mouthful.

Figure 2. Start and end of "word" positions
figs/mre2_0102.jpg

1.4.7 In a Nutshell

Table 1-1 summarizes the metacharacters we have seen so far.

Table 1. Summary of Metacharacters Seen So Far
MetacharacterNameMatches

.

[···]

[^···]

dot

character class

negated character class

any one character

any character listed

any character not listed

^

$

\<

\>

caret

dollar

backslash less-than

backslash greater-than

the position at the start of the line

the position at the end of the line

[6]the position at the start of a word

[6] the position at the end of a word

|

(···)

or; bar

parentheses

matches either expression it separates

used to limit scope of figs/boxdr.jpg|figs/boxul.jpg , plus additional uses yet to be discussed

[6] not supported by all versions of egrep

In addition to the table, important points to remember include:

  • The rules about which characters are and aren't metacharacters (and exactly what they mean) are different inside a character class. For example, dot is a metacharacter outside of a class, but not within one. Conversely, a dash is a metacharacter within a class (usually), but not outside. Moreover, a caret has one meaning outside, another if specified inside a class immediately after the opening [, and a third if given elsewhere in the class.

  • Don't confuse alternation with a character class. The class figs/boxdr.jpg[abc]figs/boxul.jpg and the alternation figs/boxdr.jpg(a|b|c)figs/boxul.jpg effectively mean the same thing, but the similarity in this example does not extend to the general case. A character class can match exactly one character, and that's true no matter how long or short the speci- fied list of acceptable characters might be.

    Alternation, on the other hand, can have arbitrarily long alternatives, each textually unrelated to the other: figs/boxdr.jpg\<(1,000,000|million|thousand•thou)\>figs/boxul.jpg . However, alternation can't be negated like a character class.

  • A negated character class is simply a notational convenience for a normal character class that matches everything not listed. Thus, figs/boxdr.jpg[^x]figs/boxul.jpg doesn't mean " match unless there is an x ," but rather " match if there is something that is not x ." The difference is subtle, but important. The first concept matches a blank line, for example, while figs/boxdr.jpg[^x]figs/boxul.jpg does not.

  • The useful -i option discounts capitalization during a match (see Section 1.4.6). [7]

    [7] Recall from the typographical conventions ( Preface ) that something like "see Section 1.4.6 is a shorthand for a reference to another section of this book.

What we have seen so far can be quite useful, but the real power comes from optional and counting elements, which we'll look at next.

1.4.8 Optional Items

Let's look at matching color or colour. Since they are the same except that one has a u and the other doesn't, we can use figs/boxdr.jpgcolou?rfigs/boxul.jpg to match either. The metacharacter figs/boxdr.jpg?figs/boxul.jpg (question mark) means optional. It is placed after the character that is allowed to appear at that point in the expression, but whose existence isn't actually required to still be considered a successful match.

Unlike other metacharacters we have seen so far, the question mark attaches only to the immediately-preceding item. Thus, figs/boxdr.jpgcolou?rfigs/boxul.jpg is interpreted as " figs/boxdr.jpgcfigs/boxul.jpg then figs/boxdr.jpgofigs/boxul.jpg then figs/boxdr.jpglfigs/boxul.jpg then figs/boxdr.jpgofigs/boxul.jpg then figs/boxdr.jpgu?figs/boxul.jpg then figs/boxdr.jpgrfigs/boxul.jpg . "

The figs/boxdr.jpgu?figs/boxul.jpg part is always successful: sometimes it matches a u in the text, while other times it doesn't. The whole point of the ?-optional part is that it's successful either way. This isn't to say that any regular expression that contains ? is always successful. For example, against 'semicolon', both figs/boxdr.jpgcolofigs/boxul.jpg and figs/boxdr.jpgu?figs/boxul.jpg are successful (matching colo and nothing, respectively). However, the final figs/boxdr.jpgrfigs/boxul.jpg fails, and that's what disallows semicolon, in the end, from being matched by figs/boxdr.jpgcolou?rfigs/boxul.jpg .

As another example, consider matching a date that represents July fourth, with the "July" part being either July or Jul, and the "fourth" part being fourth, 4th, or simply 4. Of course, we could just use figs/boxdr.jpg(July|Jul)•(fourth|4th|4)figs/boxul.jpg , but let's explore other ways to express the same thing.

First, we can shorten the figs/boxdr.jpg(July|Jul)figs/boxul.jpg to figs/boxdr.jpg(July?)figs/boxul.jpg . Do you see how they are effectively the same? The removal of the figs/boxdr.jpg|figs/boxul.jpg means that the parentheses are no longer really needed. Leaving the parentheses doesn't hurt, but with them removed, figs/boxdr.jpgJuly?figs/boxul.jpg is a bit less cluttered. This leaves us with figs/boxdr.jpgJuly?•(fourth|4th|4)figs/boxul.jpg .

Moving now to the second half, we can simplify the figs/boxdr.jpg4th|4figs/boxul.jpg to figs/boxdr.jpg4(th)?figs/boxul.jpg . As you can see, figs/boxdr.jpg?figs/boxul.jpg can attach to a parenthesized expression. Inside the parentheses can be as complex a subexpression as you like, but "from the outside" it is considered a single unit. Grouping for figs/boxdr.jpg?figs/boxul.jpg (and other similar metacharacters which I'll introduce momentarily) is one of the main uses of parentheses.

Our expression now looks like figs/boxdr.jpgJuly?•(fourth|4(th)?) figs/boxul.jpg . Although there are a fair number of metacharacters, and even nested parentheses, it is not that difficult to decipher and understand. This discussion of two essentially simple examples has been rather long, but in the meantime we have covered tangential topics that add a lot, if perhaps only subconsciously, to our understanding of regular expressions. Also, it's given us some experience in taking different approaches toward the same goal. As we advance through this book (and through to a better understanding), you'll find many opportunities for creative juices to flow while trying to find the optimal way to solve a complex problem. Far from being some stuffy science, writing regular expressions is closer to an art.

1.4.9 Other Quantifiers: Repetition

Similar to the question mark are figs/boxdr.jpg+figs/boxul.jpg (plus) and figs/boxdr.jpg*figs/boxul.jpg (an asterisk, but as a regularexpr ession metacharacter, I prefer the term star). The metacharacter figs/boxdr.jpg+figs/boxul.jpg means "one or more of the immediately-preceding item," and figs/boxdr.jpg*figs/boxul.jpg means "any number, including none, of the item." Phrased differently, figs/boxdr.jpg···*figs/boxul.jpg means "try to match it as many times as possible, but it's okay to settle for nothing if need be." The construct with plus, figs/boxdr.jpg···+figs/boxul.jpg , is similar in that it also tries to match as many times as possible, but different in that it fails if it can't match at least once. These three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they govern.

Like figs/boxdr.jpg···?figs/boxul.jpg , the figs/boxdr.jpg···*figs/boxul.jpg part of a regular expression always succeeds, with the only issue being what text (if any) is matched. Contrast this to figs/boxdr.jpg···+figs/boxul.jpg , which fails unless the item matches at least once.

For example, figs/boxdr.jpg•?figs/boxul.jpg allows a single optional space, but figs/boxdr.jpg•*figs/boxul.jpg allows any number of optional spaces. We can use this to make Section 1.4.2's <H[1-6]> example flexible. The HTML specification [8] says that spaces are allowed immediately before the closing >, such as with <H3•> and <H4•••>. Inserting figs/boxdr.jpg•*figs/boxul.jpg into our regular expression where we want to allow (but not require) spaces, we get <H[1-6]•*>. This still matches <H1>, as no spaces are required, but it also flexibly picks up the other versions.

[8] If you are not familiar with HTML, never fear. I use these as real-world examples, but I provide all the details needed to understand the points being made. Those familiar with parsing HTML tags will likely recognize important considerations I don't address at this point in the book.

Exploring further, let's search for an HTML tag such as <HR•SIZE=14>, which indicates that a line (a Horizontal Rule) 14 pixels thick should be drawn across the screen. Like the <H3> example, optional spaces are allowed before the closing angle bracket. Additionally, they are allowed on either side of the equal sign. Finally, one space is required between the HR and SIZE, although more are allowed. To allow more, we could just add figs/boxdr.jpg•*figs/boxul.jpg to the figs/boxdr.jpgfigs/boxul.jpg already there, but instead let's change it to figs/boxdr.jpg•+figs/boxul.jpg . The plus allows extra spaces while still requiring at least one, so it's effectively the same as figs/boxdr.jpg••*figs/boxul.jpg , but more concise. All these changes leave us with figs/boxdr.jpg<HR•+SIZE•*=•*14•*>figs/boxul.jpg .

Although flexible with respect to spaces, our expression is still inflexible with respect to the size given in the tag. Rather than find tags with only one particular size such as 14, we want to find them all. To accomplish this, we replace the figs/boxdr.jpg14figs/boxul.jpg with an expression to find a general number. Well, in this case, a "number" is one or more digits. A digit is figs/boxdr.jpg[0-9]figs/boxul.jpg , and "one or more" adds a plus, so we end up replacing figs/boxdr.jpg14figs/boxul.jpg by figs/boxdr.jpg[0-9]+figs/boxul.jpg . (A character class is one "unit," so can be subject directly to plus, question mark, and so on, without the need for parentheses.)

This leaves us with figs/boxdr.jpg<HR•+SIZE•*=•*[0-9]+•*>figs/boxul.jpg , which is certainly a mouthful even though I've presented it with the metacharacters bold, added a bit of spacing to make the groupings more apparent, and am using the "visible space" symbol '' for clarity. (Luckily, egrep has the -i case-insensitive option, see Section 1.4.6, which means I don't have to use figs/boxdr.jpg[Hh][Rr]figs/boxul.jpg instead of figs/boxdr.jpgHRfigs/boxul.jpg .) The unadorned regular expression figs/boxdr.jpg<HR +SIZE *= *[0-9]+ *>figs/boxul.jpg likely appears even more confusing. This example looks particularly odd because the subjects of most of the stars and pluses are space characters, and our eye has always been trained to treat spaces specially. That's a habit you will have to break when reading regular expressions, because the space character is a normal character, no different from, say, j or 4. (In later chapters, we'll see that some other tools support a special mode in which whitespace is ignored, but egrep has no such mode.)

Continuing to exploit a good example, let's consider that the size attribute is optional, so you can simply use <HR> if the default size is wanted. (Extra spaces are allowed before the >, as always.) How can we modify our regular expression so that it matches either type? The key is realizing that the size part is optional (that's a hint). figs/bullet.jpg Click here to check your answer.

Take a good look at our latest expression (in the answer box) to appreciate the differences among the question mark, star, and plus, and what they really mean in practice. Table 1-2 summarizes their meanings.

Note that each quantifier has some minimum number of matches required to succeed, and a maximum number of matches that it will ever attempt. With some, the minimum number is zero; with some, the maximum number is unlimited.

Table 2. Summary of Quantifier "Repetition Metacharacters"
 Minimum RequiredMaximum to TryMeaning

?

*

+

none

none

1

1

no limit

no limit

one allowed; none required ("one optional")

unlimited allowed; none required ("any amount okay")

unlimited allowed; one required ("at least one")

1.4.9.1 Defined range of matches: intervals

Some versions of egrep support a metasequence for providing your own minimum and maximum: figs/boxdr.jpg···{ min,max }figs/boxul.jpg . This is called the interval quantifier. For example, figs/boxdr.jpg···{3,12}figs/boxul.jpg matches up to 12 times if possible, but settles for three. One might use figs/boxdr.jpg[a-zA-Z]{1,5}figs/boxul.jpg to match a US stock ticker (from one to five letters). Using this notation, {0,1} is the same as a question mark.

Not many versions of egrep support this notation yet, but many other tools do, so it's covered in Chapter 3 when we look in detail at the broad spectrum of metacharacters in common use today.

1.4.10 Parentheses and Backreferences

So far, we have seen two uses for parentheses: to limit the scope of alternation, figs/boxdr.jpg|figs/boxul.jpg , and to group multiple characters into larger units to which you can apply quanti- fiers like question mark and star. I'd like to discuss another specialized use that's not common in egrep (although GNU's popular version does support it), but which is commonly found in many other tools.

In many regular-expression flavors, parentheses can "remember" text matched by the subexpression they enclose. We'll use this in a partial solution to the doubledword problem at the beginning of this chapter. If you knew the the specific doubled word to find (such as "the" earlier in this sentence — did you catch it?), you could search for it explicitly, such as with figs/boxdr.jpg the•the figs/boxul.jpg . In this case, you would also find items such as the•theory, but you could easily get around that problem if your egrep supports the word-boundary metasequences figs/boxdr.jpg\<···\>figs/boxul.jpg mentioned in Section 1.4.6: figs/boxdr.jpg\<the•the\>figs/boxul.jpg . We could use figs/boxdr.jpg•+figs/boxul.jpg for the space for even more flexibility.

However, having to check for every possible pair of words would be an impossible task. Wouldn't it be nice if we could match one generic word, and then say "now match the same thing again"? If your egrep supports backreferencing, you can. Backreferencing is a regular-expression feature that allows you to match new text that is the same as some text matched earlier in the expression.

We start with figs/boxdr.jpg\<the•+the\>figs/boxul.jpg and replace the initial figs/boxdr.jpgthefigs/boxul.jpg with a regular expression to match a general word, say figs/boxdr.jpg[A-Za-z]+figs/boxul.jpg . Then, for reasons that will become clear in the next paragraph, let's put parentheses around it. Finally, we replace the second 'the' by the special metasequence figs/boxdr.jpg\1figs/boxul.jpg . This yields figs/boxdr.jpg\<([A-Za-z]+)•+\1\>figs/boxul.jpg .

With tools that support backreferencing, parentheses "remember" the text that the subexpression inside them matches, and the special metasequence figs/boxdr.jpg\1figs/boxul.jpg represents that text later in the regular expression, whatever it happens to be at the time.

Of course, you can have more than one set of parentheses in a regular expression. Use figs/boxdr.jpg\1figs/boxul.jpg , figs/boxdr.jpg\2figs/boxul.jpg , figs/boxdr.jpg\3figs/boxul.jpg , etc., to refer to the first, second, third, etc. sets. Pairs of parentheses are numbered by counting opening parentheses from the left, so with figs/boxdr.jpg([a-z])([0-9])\1\2figs/boxul.jpg , the figs/boxdr.jpg\1figs/boxul.jpg refers to the text matched by figs/boxdr.jpg[a-z]figs/boxul.jpg , and figs/boxdr.jpg\2figs/boxul.jpg refers to the text matched by figs/boxdr.jpg[0-9]figs/boxul.jpg .

With our 'the•the' example, figs/boxdr.jpg[A-Za-z]+figs/boxul.jpg matches the first 'the'. It is within the first set of parentheses, so the 'the' matched becomes available via figs/boxdr.jpg\1figs/boxul.jpg . If the following figs/boxdr.jpg•+figs/boxul.jpg matches, the subsequent figs/boxdr.jpg\1figs/boxul.jpg will require another 'the'. If figs/boxdr.jpg\1figs/boxul.jpg is successful, then figs/boxdr.jpg\>figs/boxul.jpg makes sure that we are now at an end-of-word boundary (which we wouldn't be were the text 'the•theft'). If successful, we've found a repeated word. It's not always the case that that is an error (such as with "that" in this sentence), but that's for you to decide once the suspect lines are shown.

When I decided to include this example, I actually tried it on what I had written so far. (I used a version of egrep that supports both figs/boxdr.jpg\<···\>figs/boxul.jpg and backreferencing.) To make it more useful, so that 'The•the' would also be found, I used the case-insensitive -i option mentioned in Section 1.4.6. [9]

[9] Be aware that the popular GNU version of egrep has a bug with its -i option such that it doesn't apply to backreferences. Thus, it finds "the the" but not "The the."

Here's the command I ran:


% egrep -i '\<([a-z]+) +\1\>'   files···

I was surprised to find fourteen sets of mistakenly 'doubled•doubled' words! I corrected them, and since then have built this type of regular-expression check into the tools that I use to produce the final output of this book, to ensure none creep back in.

As useful as this regular expression is, it is important to understand its limitations. Since egrep considers each line in isolation, it isn't able to find when the ending word of one line is repeated at the beginning of the next. For this, a more flexible tool is needed, and we will see some examples in the next chapter.

1.4.11 The Great Escape

One important thing I haven't mentioned yet is how to actually match a character that a regular expression would normally interpret as a metacharacter. For example, if I searched for the Internet hostname ega.att.com using figs/boxdr.jpgega.att.comfigs/boxul.jpg , it could end up matching something like megawatt•computing. Remember, figs/boxdr.jpg.figs/boxul.jpg is a metacharacter that matches any character, including a space.

The metasequence to match an actual period is a period preceded by a backslash: figs/boxdr.jpgega\.att\.comfigs/boxul.jpg . The sequence figs/boxdr.jpg\.figs/boxul.jpg is described as an escaped period or escaped dot, and you can do this with all the normal metacharacters, except in a characterclass. [10]

[10] Most programming languages and tools allow you to escape characters within a character class as well, but most versions of egrep do not, instead treating '\' within a class as a literal backslash to be included in the list of characters.

A backslash used in this way is called an "escape" — when a metacharacter is escaped, it loses its special meaning and becomes a literal character. If you like, you can consider the sequence to be a special metasequence to match the literal character. It's all the same.

As another example, you could use figs/boxdr.jpg \([a-zA-Z]+\) figs/boxul.jpg to match a word within parentheses, such as '(very)'. The backslashes in the figs/boxdr.jpg\(figs/boxul.jpg and figs/boxdr.jpg\)figs/boxul.jpg sequences remove the special interpretation of the parentheses, leaving them as literals to match parentheses in the text.

When used before a non-metacharacter, a backslash can have different meanings depending upon the version of the program. For example, we have already seen how some versions treat figs/boxdr.jpg\<figs/boxul.jpg , figs/boxdr.jpg\>figs/boxul.jpg , figs/boxdr.jpg\1figs/boxul.jpg , etc. as metasequences. We will see many more examples in later chapters.

    Previous Section  < Free Open Study >  Next Section