< Free Open Study > |
1.4 Egrep MetacharactersLet's start to explore some of the egrep metacharacters that supply its regular expression power. I'll go over them quickly with a few examples, leaving the detailed examples and descriptions for later chapters. Typographical Conventions Before we begin, please make sure to review the typographical conventions explained in the preface . This book forges a bit of new ground in the area of typesetting, so some of my notations may be unfamiliar at first. 1.4.1 Start and End of the LineProbably the easiest metacharacters to understand are ^ (caret) and $ (dollar), which represent the start and end, respectively, of the line of text as it is being checked. As we've seen, the regular expression cat finds c·a·t anywhere on the line, but ^cat matches only if the c·a·t is at the beginning of the linethe ^ is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, cat$ finds c·a·t only at the end of the line, such as a line ending with scat. It's best to get into the habit of interpreting regular expressions in a rather literal way. For example, don't think
but rather:
They both end up meaning the same thing, but reading it the more literal way allows you to intrinsically understand a new expression when you see it. How would egrep interpret cat$ , ^$ , or even simply ^ alone? Click here to check your interpretations. The caret and dollar are special in that they match a position in the line rather than any actual text characters themselves. Of course, there are various ways to actually match real text. Besides providing literal characters like cat in your regular expression, you can also use some of the items discussed in the next few sections. 1.4.2 Character Classes1.4.2.1 Matching any one of several charactersLet's say you want to search for "grey," but also want to find it if it were spelled "gray." The regular-expression construct [···] , usually called a character class, lets you list the characters you want to allow at that point in the match. While e matches just an e, and a matches just an a, the regular expression [ea] matches either. So, then, consider gr[ea]y : this means to find " g, followed by r, followed by either an e or an a, all followed by y ." Because I'm a really poor speller, I'm always using regular expressions like this against a huge list of English words to figure out proper spellings. One I use often is sep[ea]r[ea]te , because I can never remember whether the word is spelled "seperate," "separate," "separete," or what. The one that pops up in the list is the proper spelling; regular expressions to the rescue. Notice how outside of a class, literal characters (like the g and r of gr[ae]y ) have an implied "and then" between them "match g and then match r . . ." It's completely opposite inside a character class. The contents of a class is a list of characters that can match at that point, so the implication is "or." As another example, maybe you want to allow capitalization of a word's first letter, such as with [Ss]mith . Remember that this still matches lines that contain smith (or Smith) embedded within another word, such as with blacksmith. I don't want to harp on this throughout the overview, but this issue does seem to be the source of problems among some new users. I'll touch on some ways to handle this embedded-word problem after we examine a few more metacharacters. You can list in the class as many characters as you like. For example, [123456] matches any of the listed digits. This particular class might be useful as part of <H[123456]> , which matches <H1>, <H2>, <H3>, etc. This can be useful when searching for HTML headers. Within a character class, the character-class metacharacter '-' (dash) indicates a range of characters: <H[1-6]> is identical to the previous example. [0-9] and [a-z] are common shorthands for classes to match digits and English lowercase letters, respectively. Multiple ranges are fine, so [0123456789abcdefABCDEF] can be written as [0-9a-fA-F] (or, perhaps, [A-Fa-f0-9] , since the order in which ranges are given doesn't matter). These last three examples can be useful when processing hexadecimal numbers. You can freely combine ranges with literal characters: [0-9A-Z_!.?] matches a digit, uppercase letter, underscore, exclamation point, period, or a question mark. Note that a dash is a metacharacter only within a character class otherwise it matches the normal dash character. In fact, it is not even always a metacharacter within a character class. If it is the first character listed in the class, it can't possibly indicate a range, so it is not considered a metacharacter. Along the same lines, the question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class (so, to be clear, the only special characters within the class in [0-9A-Z_!.?] are the two dashes).
We'll see more examples of this shortly. 1.4.2.2 Negated character classesIf you use [^···] instead of [···] , the class matches any character that isn't listed. For example, [^1-6] matches a character that's not 1 through 6. The leading ^ in the class "negates" the list, so rather than listing the characters you want to include in the class, you list the characters you don't want to be included. You might have noticed that the ^ used here is the same as the start-of-line caret introduced in Section 1.4.1. The character is the same, but the meaning is completely different. Just as the English word "wind" can mean different things depending on the context (sometimes a strong breeze, sometimes what you do to a clock), so can a metacharacter. We've already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class's opening bracket; otherwise, it's not special inside a class). Don't fear these are the most complex special cases; others we'll see later aren't so bad. As another example, let's search that list of English words for odd words that have q followed by something other than u. Translating that into a regular expression, it becomes q[^u] . I tried it on the list I have, and there certainly weren't many. I did find a few, including a number of words that I didn't even know were English. Here's what happened. (What I typed is in bold.)
% egrep 'q[^u]' word.list
Iraqi
Iraqian
miqra
qasida
qintar
qoph
zaqqum%
Two notable words not listed are "Qantas", the Australian airline, and "Iraq". Although both words are in the word.list file, neither were displayed by my egrep command. Why? Think about it for a bit, and then click here to check your reasoning. Remember, a negated character class means "match a character that's not listed" and not "don't match what is listed." These might seem the same, but the Iraq example shows the subtle difference. A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed. 1.4.3 Matching Any Character with DotThe metacharacter . (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an "any character here" placeholder in your expression. For example, if you want to search for a date such as 03/19/76, 03-19-76, or even 03.19.76, you could go to the trouble to construct a regular expression that uses character classes to explicitly allow '/', '-', or '.' between each number, such as 03[-./]19[-./]76 . However, you might also try simply using 03.19.76 . Quite a few things are going on with this example that might be unclear at first. In 03[-./]19[-./]76 , the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [ or [^. Had they not been first, as with [.-/] , it they would be the class range metacharacter, which would be a mistake in this situation. With 03.19.76 , the dots are metacharacters ones that match any character (including the dash, period, and slash that we are expecting). However, it is important to know that each dot can match any character at all, so it can match, say, 'lottery numbers: 19 203319 7639'. So, 03[-./]19[-./]76 is more precise, but it's more difficult to read and write. 03.19.76 is easy to understand, but vague. Which should we use? It all depends upon what you know about the data being searched, and just how specific you feel you need to be. One important, recurring issue has to do with balancing your knowledge of the text being searched against the need to always be exact when writing an expression. For example, if you know that with your data it would be highly unlikely for 03.19.76 to match in an unwanted place, it would certainly be reasonable to use it. Knowing the target text well is an important part of wielding regular expressions effectively. 1.4.4 Alternation1.4.4.1 Matching any one of several subexpressionsA very convenient metacharacter is | , which means "or." It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, Bob and Robert are separate expressions, but Bob|Robert is one expression that matches either. When combined this way, the subexpressions are called alternatives. Looking back to our gr[ea]y example, it is interesting to realize that it can be written as grey|gray , and even gr(a|e)y . The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like gr[a|e]y is not what we want within a class, the '|' character is just a normal character, like a and e . With gr(a|e)y , the parentheses are required because without them, gra|ey means " gra or ey ," which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is (First|1st)•[Ss]treet . [5] Actually, since both First and 1st end with st , the combination can be shortened to (Fir|1)st•[Ss]treet . That's not necessarily quite as easy to read, but be sure to understand that (first|1st) and (fir|1)st effectively mean the same thing.
Here's an example involving an alternate spelling of my name. Compare and contrast the following three expressions, which are all effectively the same: Jeffrey|Jeffery Jeff(rey|ery) Jeff(re|er)y To have them match the British spellings as well, they could be: (Geoff|Jeff)(rey|ery) (Geo|Je)ff(rey|ery) (Geo|Je)ff(re|er)y Finally, note that these three match effectively the same as the longer (but simpler) Jeffrey|Geoffery|Jeffery|Geoffrey . They're all different ways to specify the same desired matches. Although the gr[ea]y versus gr(a|e)y examples might blur the distinction, be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the "main" regular expression language. You'll find both to be extremely useful. Also, take care when using caret or dollar in an expression that has alternation. Compare ^From|Subject|Date:• with ^(From|Subject|Date):• . Both appear similar to our earlier email example, but what each matches (and therefore how useful it is) differs greatly. The first is composed of three alternatives, so it matches " ^From or Subject or Date:• ," which is not particularly useful. We want the leading caret and trailing :• to apply to each alternative. We can accomplish this by using parentheses to "constrain" the alternation: ^(From|Subject|Date):• The alternation is constrained by the parentheses, so literally, this regex means "match the start of the line, then one of From , Subject , or Date , and then match :• ." Effectively, it matches: 1) start-of-line, followed by F·r·o·m, followed by ':•' or 2) start-of-line, followed by S·u·b·j·e·c·t, followed by ':•' or 3) start-of-line, followed by D·a·t·e, followed by ':•' Putting it less literally, it matches lines beginning with 'From:•', 'Subject:•', or 'Date:•', which is quite useful for listing the messages in an email file. Here's an example: % egrep '^(From|Subject|Date): ' mailbox
From: elvis@tabloid.org (The King)
Subject: be seein' ya around
Date: Thu, 22 Aug 2002 11:04:13
From: The Prez <president@whitehouse.gov>
Date: Tue, 27 Aug 2002 8:36:24
Subject: now, about your vote···
.
.
.
1.4.5 Ignoring Differences in CapitalizationThis email header example provides a good opportunity to introduce the concept of a case-insensitive match. The field types in an email header usually appear with leading capitalization, such as "Subject" and "From," but the email standard actually allows mixed capitalization, so things like "DATE" and "from" are also allowed. Unfortunately, the regular expression in the previous section doesn't match those. One approach is to replace From with [Ff][Rr][Oo][Mm] to match any form of "from," but this is quite cumbersome, to say the least. Fortunately, there is a way to tell egrep to ignore case when doing comparisons, i.e., to perform the match in a case insensitive manner in which capitalization differences are simply ignored. It is not a part of the regular-expression language, but is a related useful feature many tools provide. egrep's command-line option "-i" tells it to do a case-insensitive match. Place -i on the command line before the regular expression:
% egrep -i '^(From|Subject|Date): ' mailbox
This brings up all the lines we matched before, but also includes lines such as: SUBJECT: MAKE MONEY FAST I find myself using the -i option quite frequently (perhaps related to the footnote in Section 1.7.2!) so I recommend keeping it in mind. We'll see other convenient support features like this in later chapters. 1.4.6 Word BoundariesA common problem is that a regular expression that matches the word you want can often also match where the "word" is embedded within a larger word. I mentioned this briefly in the cat, gray, and Smith examples. It turns out, though, that some versions of egrep offer limited support for word recognition: namely the ability to match the boundary of a word (where a word begins or ends). You can use the (perhaps odd looking) metasequences \< and \> if your version happens to support them (not all versions of egrep do). You can think of them as word-based versions of ^ and $ that match the position at the start and end of a word, respectively. Like the line anchors caret and dollar, they anchor other parts of the regular expression but don't actually consume any characters during a match. The expression \<cat\> literally means " match if we can find a start-of-word position, followed immediately by c·a·t, followed immediately by an end-of-word position ." More naturally, it means "find the word cat." If you wanted, you could use \<cat or cat\> to find words starting and ending with cat. Note that < and > alone are not metacharacters when combined with a backslash, the sequences become special. This is why I called them "metasequences." It's their special interpretation that's important, not the number of characters, so for the most part I use these two meta-words interchangeably. Remember, not all versions of egrep support these word-boundary metacharacters, and those that do don't magically understand the English language. The "start of a word" is simply the position where a sequence of alphanumeric characters begins; "end of word" is where such a sequence ends. Figure 1-2 shows a sample line with these positions marked. The word-starts (as egrep recognizes them) are marked with up arrows, the wordends with down arrows. As you can see, "start and end of word" is better phrased as "start and end of an alphanumeric sequence," but perhaps that's too much of a mouthful. Figure 2. Start and end of "word" positions1.4.7 In a NutshellTable 1-1 summarizes the metacharacters we have seen so far.
In addition to the table, important points to remember include:
What we have seen so far can be quite useful, but the real power comes from optional and counting elements, which we'll look at next. 1.4.8 Optional ItemsLet's look at matching color or colour. Since they are the same except that one has a u and the other doesn't, we can use colou?r to match either. The metacharacter ? (question mark) means optional. It is placed after the character that is allowed to appear at that point in the expression, but whose existence isn't actually required to still be considered a successful match. Unlike other metacharacters we have seen so far, the question mark attaches only to the immediately-preceding item. Thus, colou?r is interpreted as " c then o then l then o then u? then r . " The u? part is always successful: sometimes it matches a u in the text, while other times it doesn't. The whole point of the ?-optional part is that it's successful either way. This isn't to say that any regular expression that contains ? is always successful. For example, against 'semicolon', both colo and u? are successful (matching colo and nothing, respectively). However, the final r fails, and that's what disallows semicolon, in the end, from being matched by colou?r . As another example, consider matching a date that represents July fourth, with the "July" part being either July or Jul, and the "fourth" part being fourth, 4th, or simply 4. Of course, we could just use (July|Jul)•(fourth|4th|4) , but let's explore other ways to express the same thing. First, we can shorten the (July|Jul) to (July?) . Do you see how they are effectively the same? The removal of the | means that the parentheses are no longer really needed. Leaving the parentheses doesn't hurt, but with them removed, July? is a bit less cluttered. This leaves us with July?•(fourth|4th|4) . Moving now to the second half, we can simplify the 4th|4 to 4(th)? . As you can see, ? can attach to a parenthesized expression. Inside the parentheses can be as complex a subexpression as you like, but "from the outside" it is considered a single unit. Grouping for ? (and other similar metacharacters which I'll introduce momentarily) is one of the main uses of parentheses. Our expression now looks like July?•(fourth|4(th)?) . Although there are a fair number of metacharacters, and even nested parentheses, it is not that difficult to decipher and understand. This discussion of two essentially simple examples has been rather long, but in the meantime we have covered tangential topics that add a lot, if perhaps only subconsciously, to our understanding of regular expressions. Also, it's given us some experience in taking different approaches toward the same goal. As we advance through this book (and through to a better understanding), you'll find many opportunities for creative juices to flow while trying to find the optimal way to solve a complex problem. Far from being some stuffy science, writing regular expressions is closer to an art. 1.4.9 Other Quantifiers: RepetitionSimilar to the question mark are + (plus) and * (an asterisk, but as a regularexpr ession metacharacter, I prefer the term star). The metacharacter + means "one or more of the immediately-preceding item," and * means "any number, including none, of the item." Phrased differently, ···* means "try to match it as many times as possible, but it's okay to settle for nothing if need be." The construct with plus, ···+ , is similar in that it also tries to match as many times as possible, but different in that it fails if it can't match at least once. These three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they govern. Like ···? , the ···* part of a regular expression always succeeds, with the only issue being what text (if any) is matched. Contrast this to ···+ , which fails unless the item matches at least once. For example, •? allows a single optional space, but •* allows any number of optional spaces. We can use this to make Section 1.4.2's <H[1-6]> example flexible. The HTML specification [8] says that spaces are allowed immediately before the closing >, such as with <H3•> and <H4•••>. Inserting •* into our regular expression where we want to allow (but not require) spaces, we get <H[1-6]•*>. This still matches <H1>, as no spaces are required, but it also flexibly picks up the other versions.
Exploring further, let's search for an HTML tag such as <HR•SIZE=14>, which indicates that a line (a Horizontal Rule) 14 pixels thick should be drawn across the screen. Like the <H3> example, optional spaces are allowed before the closing angle bracket. Additionally, they are allowed on either side of the equal sign. Finally, one space is required between the HR and SIZE, although more are allowed. To allow more, we could just add •* to the • already there, but instead let's change it to •+ . The plus allows extra spaces while still requiring at least one, so it's effectively the same as ••* , but more concise. All these changes leave us with <HR•+SIZE•*=•*14•*> . Although flexible with respect to spaces, our expression is still inflexible with respect to the size given in the tag. Rather than find tags with only one particular size such as 14, we want to find them all. To accomplish this, we replace the 14 with an expression to find a general number. Well, in this case, a "number" is one or more digits. A digit is [0-9] , and "one or more" adds a plus, so we end up replacing 14 by [0-9]+ . (A character class is one "unit," so can be subject directly to plus, question mark, and so on, without the need for parentheses.) This leaves us with <HR•+SIZE•*=•*[0-9]+•*> , which is certainly a mouthful even though I've presented it with the metacharacters bold, added a bit of spacing to make the groupings more apparent, and am using the "visible space" symbol '•' for clarity. (Luckily, egrep has the -i case-insensitive option, see Section 1.4.6, which means I don't have to use [Hh][Rr] instead of HR .) The unadorned regular expression <HR +SIZE *= *[0-9]+ *> likely appears even more confusing. This example looks particularly odd because the subjects of most of the stars and pluses are space characters, and our eye has always been trained to treat spaces specially. That's a habit you will have to break when reading regular expressions, because the space character is a normal character, no different from, say, j or 4. (In later chapters, we'll see that some other tools support a special mode in which whitespace is ignored, but egrep has no such mode.) Continuing to exploit a good example, let's consider that the size attribute is optional, so you can simply use <HR> if the default size is wanted. (Extra spaces are allowed before the >, as always.) How can we modify our regular expression so that it matches either type? The key is realizing that the size part is optional (that's a hint). Click here to check your answer. Take a good look at our latest expression (in the answer box) to appreciate the differences among the question mark, star, and plus, and what they really mean in practice. Table 1-2 summarizes their meanings. Note that each quantifier has some minimum number of matches required to succeed, and a maximum number of matches that it will ever attempt. With some, the minimum number is zero; with some, the maximum number is unlimited.
1.4.9.1 Defined range of matches: intervalsSome versions of egrep support a metasequence for providing your own minimum and maximum: ···{ min,max } . This is called the interval quantifier. For example, ···{3,12} matches up to 12 times if possible, but settles for three. One might use [a-zA-Z]{1,5} to match a US stock ticker (from one to five letters). Using this notation, {0,1} is the same as a question mark. Not many versions of egrep support this notation yet, but many other tools do, so it's covered in Chapter 3 when we look in detail at the broad spectrum of metacharacters in common use today. 1.4.10 Parentheses and BackreferencesSo far, we have seen two uses for parentheses: to limit the scope of alternation, | , and to group multiple characters into larger units to which you can apply quanti- fiers like question mark and star. I'd like to discuss another specialized use that's not common in egrep (although GNU's popular version does support it), but which is commonly found in many other tools. In many regular-expression flavors, parentheses can "remember" text matched by the subexpression they enclose. We'll use this in a partial solution to the doubledword problem at the beginning of this chapter. If you knew the the specific doubled word to find (such as "the" earlier in this sentence did you catch it?), you could search for it explicitly, such as with the•the . In this case, you would also find items such as the•theory, but you could easily get around that problem if your egrep supports the word-boundary metasequences \<···\> mentioned in Section 1.4.6: \<the•the\> . We could use •+ for the space for even more flexibility. However, having to check for every possible pair of words would be an impossible task. Wouldn't it be nice if we could match one generic word, and then say "now match the same thing again"? If your egrep supports backreferencing, you can. Backreferencing is a regular-expression feature that allows you to match new text that is the same as some text matched earlier in the expression. We start with \<the•+the\> and replace the initial the with a regular expression to match a general word, say [A-Za-z]+ . Then, for reasons that will become clear in the next paragraph, let's put parentheses around it. Finally, we replace the second 'the' by the special metasequence \1 . This yields \<([A-Za-z]+)•+\1\> . With tools that support backreferencing, parentheses "remember" the text that the subexpression inside them matches, and the special metasequence \1 represents that text later in the regular expression, whatever it happens to be at the time. Of course, you can have more than one set of parentheses in a regular expression. Use \1 , \2 , \3 , etc., to refer to the first, second, third, etc. sets. Pairs of parentheses are numbered by counting opening parentheses from the left, so with ([a-z])([0-9])\1\2 , the \1 refers to the text matched by [a-z] , and \2 refers to the text matched by [0-9] . With our 'the•the' example, [A-Za-z]+ matches the first 'the'. It is within the first set of parentheses, so the 'the' matched becomes available via \1 . If the following •+ matches, the subsequent \1 will require another 'the'. If \1 is successful, then \> makes sure that we are now at an end-of-word boundary (which we wouldn't be were the text 'the•theft'). If successful, we've found a repeated word. It's not always the case that that is an error (such as with "that" in this sentence), but that's for you to decide once the suspect lines are shown. When I decided to include this example, I actually tried it on what I had written so far. (I used a version of egrep that supports both \<···\> and backreferencing.) To make it more useful, so that 'The•the' would also be found, I used the case-insensitive -i option mentioned in Section 1.4.6. [9]
Here's the command I ran:
% egrep -i '\<([a-z]+) +\1\>' files···
I was surprised to find fourteen sets of mistakenly 'doubled•doubled' words! I corrected them, and since then have built this type of regular-expression check into the tools that I use to produce the final output of this book, to ensure none creep back in. As useful as this regular expression is, it is important to understand its limitations. Since egrep considers each line in isolation, it isn't able to find when the ending word of one line is repeated at the beginning of the next. For this, a more flexible tool is needed, and we will see some examples in the next chapter. 1.4.11 The Great EscapeOne important thing I haven't mentioned yet is how to actually match a character that a regular expression would normally interpret as a metacharacter. For example, if I searched for the Internet hostname ega.att.com using ega.att.com , it could end up matching something like megawatt•computing. Remember, . is a metacharacter that matches any character, including a space. The metasequence to match an actual period is a period preceded by a backslash: ega\.att\.com . The sequence \. is described as an escaped period or escaped dot, and you can do this with all the normal metacharacters, except in a characterclass. [10]
A backslash used in this way is called an "escape" when a metacharacter is escaped, it loses its special meaning and becomes a literal character. If you like, you can consider the sequence to be a special metasequence to match the literal character. It's all the same. As another example, you could use \([a-zA-Z]+\) to match a word within parentheses, such as '(very)'. The backslashes in the \( and \) sequences remove the special interpretation of the parentheses, leaving them as literals to match parentheses in the text. When used before a non-metacharacter, a backslash can have different meanings depending upon the version of the program. For example, we have already seen how some versions treat \< , \> , \1 , etc. as metasequences. We will see many more examples in later chapters. |
< Free Open Study > |