< Free Open Study > |
1.5 Expanding the FoundationI hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I've provided so far lacks depth. There's so much more out there. 1.5.1 Linguistic DiversificationI mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I'll leave for later chapters. Unfortunately, the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own "improvements." The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression "flavors." We'll see many examples in the following chapters. 1.5.2 The Goal of a Regular ExpressionFrom the broadest top-down view, a regular expression either matches within a lump of text (with egrep, each line) or it doesn't. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don't want. Also, while egrep doesn't care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as ...zip is 44272. If you write, send $4.95 to cover postage and... and you merely want to find lines matching [0-9]+ , you don't care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and such — we will see examples of this kind of processing in the next chapter), you'll care very much exactly which numbers are matched. 1.5.3 A Few More ExamplesAs with any language, experience is a very good thing, so I'm including a few more examples of regular expressions to match some common constructs. Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to not match when and where you don't want. In practice, both are important, but for the moment, I would like to concentrate on the "getting successful matches" aspect. Even though I don't take these examples to their fullest depths, they still provide useful insight. 1.5.3.1 Variable namesMany programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a digit. They are matched by [a-zA-Z_][a-zA-Z_0-9]* . The first character class matches what the first character can be, the second (with its accompanying star) allows the rest of the identifier. If there is a limit on the length of an identifier, say 32 characters, you might replace the star with {0,31} if the { min,max } notation is supported. (This construct, the interval quantifier, was briefly mentioned in Section 1.4.9.1.) 1.5.3.2 A string within double quotesA simple solution to matching a string within double quotes might be: "[^"]*" The double quotes at either end are to match the opening and closing double quotes of the string. Between them, we can have anything... except another double quote! So, we use [^"] to match all characters except a double quote, and apply using a star to indicate we can have any number of such non double-quote characters. A more useful (but more complex) definition of a double-quoted string allows double quotes within the string if they are escaped with a backslash, such as in "nail•the•2\"x4\"•plank". We'll see this example several times in future chapters while covering the many details of how a match is actually carried out. 1.5.3.3 Dollar amount (with optional cents)One approach to matching a dollar amount is: \$[0-9]+(\.[0-9][0-9])? From a top-level perspective, this is a simple regular expression with three parts: \$ and ···+ and (···)? , which might be loosely paraphrased as "a literal dollar sign, a bunch of one thing, and finally perhaps another thing." In this case, the "one thing" is a digit (with a bunch of them being a number), and "another thing" is the combination of a decimal point followed by two digits. This example is a bit naïve for several reasons. For example, it considers dollar amounts like $1000, but not $1,000. It does allow for optional cents, but frankly, that's not really very useful when applied with egrep. egrep never cares exactly how much is matched, but merely whether there is a match. Allowing something optional at the end never changes whether there's an overall match to begin with. But, if you need to find lines that contain just a price, and nothing else, you can wrap the expression with ^···$ . In this case, the optional cents part becomes important since it might or might not come between the dollar amount and the end of the line, and allowing or disallowing it makes the difference in achieving an overall match. One type of value our expression doesn't match is '$.49'. To solve this, you might be tempted to change the plus to a star, but that doesn't work. As to why, I'll leave it as a teaser until we look at this example again in Chapter 5 (see Section 5.2.5). 1.5.3.4 An HTTP/HTML URLThe format of web URLs can be complex, so constructing a regular expression to match any possible URL can be equally complex. However, relaxing your standards slightly can allow you to match most common URLs with a fairly simple expression. One common reason I might do this, for example, would be to search my email archive for a URL that I vaguely remember having received, but which I think I might recognize when I see it. The general form of a common HTTP/HTML URL is along the lines of http://hostname/path.html although ending with .htm is common as well. The rules about what can and can't be a hostname (computer name, such as www.yahoo.com) are complex, but for our needs we can realize that if it follows 'http://', it's probably a hostname, so we can make do with something simple, such as [-a-z0-9_.]+ . The path part can be even more varied, so we'll use [-a-z0-9_:@&?=+,.!/~*'%$]* for that. Notice that these classes have the dash first, to ensure that it's taken as a literal character and included in the list, as opposed to part of a range (see Section 1.4.2). Putting these all together, we might use as our first attempt something like: % egrep -i '\<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*'%$]*\.html?\>' files Again, since we've taken liberties and relaxed what we'll match, we could well match something such as 'http://..../foo.html', which is certainly not a valid URL. Do we care about this? It all depends on what you're trying to do. For my scan of my email archive, it doesn't really matter if I get a few false matches. Heck, I could probably get away with even something as simple as: % egrep -i '\<http://[^ ]*\.html?\>' files... As we'll learn when getting deeper into how to craft an expression, knowing the data you'll be searching is an important aspect of finding the balance between complexity and completeness. We'll visit this example again, in more detail, in the next chapter. 1.5.3.5 An HTML tagWith a tool like egrep, it doesn't seem particularly common or useful to simply match lines with HTML tags. But, exploring a regular expression that matches HTML tags exactly can be quite fruitful, especially when we delve into more advanced tools in the next chapter. Looking at simple cases like '<TITLE>' and '<HR>', we might think to try <.*> . This simplistic approach is a frequent first thought, but it's certainly incorrect. Converting <.*> into English reads "match a '<', followed by as much of anything as can be matched, followed by '>'." Well, when phrased that way, it shouldn't be surprising that it can match more than just one tag, such as the marked portion of 'this <I>short</I> example'. This might have been a bit surprising, but we're still in the first chapter, and our understanding at this point is only superficial. I have this example here to high-light that regular expressions are not a difficult subject, but they can be tricky if you don't truly understand them. Over the next few chapters, we'll look at all the details required to understand and solve this problem. 1.5.3.6 Time of day, such as "9:17 am" or "12:30 pm"Matching a time can be taken to varying levels of strictness. Something such as [0-9]?[0-9]:[0-9][0-9]•(am|pm) picks up both 9:17•am and 12:30•pm, but also allows something nonsensical like 99:99•pm. Looking at the hour, we realize that if it is a two-digit number, the first digit must be a one. But, 1?[0-9] still allows an hour of 19 (and also an hour of 0), so maybe it is better to break the hour part into two possibilities: 1[012] for two-digit hours and [1-9] for single-digit hours. The result is (1[012]|[1-9]) . The minute part is easier. The first digit should be [0-5] . For the second, we can stick with the current [0-9] . This gives (1[012]|[1-9]):[0-5][0-9]•(am|pm) when we put it all together. Using the same logic, can you extend this to handle 24-hour time with hours from 0 through 23? As a challenge, allow for a leading zero, at least through to 09:59.
1.5.4 Regular Expression Nomenclature1.5.4.1 RegexAs you might guess, using the full phrase "regular expression" can get a bit tiring, particularly in writing. Instead, I normally use "regex." It just rolls right off the tongue (it rhymes with "FedEx," with a hard g sound like "regular" and not a soft one like in "Regina") and it is amenable to a variety of uses like "when you regex...," "budding regexers," and even "regexification." [11] I use the phrase "regex engine" to refer to the part of a program that actually does the work of carrying out a match attempt.
1.5.4.2 MatchingWhen I say a regex "matches" a string, I really mean that it matches in a string. Technically, the regex a doesn't match cat, but matches the a in cat. It's not something that people tend to confuse, but it's still worthy of mention. 1.5.4.3 MetacharacterWhether a character is a metacharacter (or "metasequence"—I use the words inter-changeably) depends on exactly where in the regex it's used. For example, * is a metacharacter, but only when it's not within a character class and when not escaped. "Escaped" means that it has a backslash in front of it—usually. The star is escaped in \* , but not in \\* (where the first backslash escapes the second), although the star "has a backslash in front of it" in both examples. Depending upon the regex flavor, there are various situations when certain characters are and aren't metacharacters. Chapter 3 discusses this in more detail. 1.5.4.4 FlavorAs I've hinted, different tools use regular expressions for many different things, and the set of metacharacters and other features that each support can differ. Let's look at word boundaries again as an example. Some versions of egrep support the \<···\> notation we've seen. However, some do not support the separate word-start and word-end, but one catch-all \b metacharacter (which we haven't seen yet — we'll see it in the next chapter). Still others support both, and many others support neither. I use the term "flavor" to describe the sum total of all these little implementation decisions. In the language analogy, it's the same as a dialect of an individual speaker. Superficially, this concept refers to which metacharacters are and aren't supported, but there's much more to it. Even if two programs both support \<···\> , they might disagree on exactly what they do and don't consider to be a word. This concern is important when you use the tool. Don't confuse "flavor" with "tool." Just as two people can speak the same dialect, two completely different programs can support exactly the same regex flavor. Also, two programs with the same name (and built to do the same task) often have slightly (and sometimes not-so-slightly) different flavors. Among the various programs called egrep, there is a wide variety of regex flavors supported. In the late 1990s, the particularly expressive flavor offered by the Perl programming language was widely recognized for its power, and soon other languages were offering Perl-inspired regular expressions (many even acknowledging the inspirational source by labeling themselves "Perl-compatible"). The adopters include Python, many Java regex package, Microsoft's .NET Framework, Tcl, and a variety of C libraries, to name a few. Yet, all are different in important respects. On top of this, Perl's regular expressions themselves are evolving and growing (some-times, now, in response to advances seen with other tools). As always, the overall landscape continues to become more varied and confusing. 1.5.4.5 SubexpressionThe term "subexpression" simply refers to part of a larger expression, although it often refers to some part of an expression within parentheses, or to an alternative of | . For example, with ^(Subject|Date):• , the Subject|Date is usually referred to as a subexpression. Within that, the alternatives Subject and Date are each referred to as subexpressions as well. But technically, S is a subexpression, as is u , and b , and j , . . . Something such as 1-6 isn't considered a subexpression of H[1-6]•* , since the '1-6' is part of an unbreakable "unit," the character class. But, H , [1-6] , and •* are all subexpressions of H[1-6]•* . Unlike alternation, quantifiers (star, plus, and question mark) always work with the smallest immediately-preceding subexpression. This is why with mis+pell , the + governs the s , not the mis or is . Of course, when what immediately precedes a quantifier is a parenthesized subexpression, the entire subexpression (no matter how complex) is taken as one unit. 1.5.4.6 CharacterThe word "character" can be a loaded term in computing. The character that a byte represents is merely a matter of interpretation. A byte with such-and-such a value has that same value in any context in which you might wish to consider it, but which character that value represents depends on the encoding in which it's viewed. As a concrete example, two bytes with decimal values 64 and 53 represent the characters "@" and "5" respectively, if considered in the ASCII encoding, yet on the other hand are completely different if considered in the EBCDIC encoding (they are a space and some kind of a control character). On the third hand, if those two bytes are considered in one of the popular encodings for Japanese characters, together they represent the single character . Yet, to represent this same character in another of the Japanese encodings requires two completely different bytes. Those two different bytes, by the way, yield the two characters "Àm" in the popular Latin-1 encoding, but yield the one Korean character in one of the Unicode encodings. [12] The point is this: how bytes are to be interpreted is a matter of perspective (called an encoding), and to be successful, you've got to make sure that your perspective agrees with the perspective taken by the tool you're using.
Until recently, text-processing tools generally treated their data as a bunch of ASCII bytes, without regard to the encoding you might be intending. Recently, however, more and more systems are using some form of Unicode to process data internally (Chapter 3 includes an introduction to Unicode see Section 3.3.2). On such systems, if the regular-expression subsystem has been implemented properly, the user doesn't normally have to pay much attention to these issues. That's a big "if," which is why Chapter 3 looks at this issue in depth. 1.5.5 Improving on the Status QuoWhen it comes down to it, regular expressions are not difficult. But, if you talk to the average user of a program or language that supports them, you will likely find someone that understands them "a bit," but does not feel secure enough to really use them for anything complex or with any tool but those they use most often. Traditionally, regular expression documentation tends to be limited to a short and incomplete description of one or two metacharacters, followed by a table of the rest. Examples often use meaningless regular expressions like a*((ab)*|b*) , and text like 'a•xxx•ce•xxxxxx•ci•xxx•d'. They also tend to completely ignore subtle but important points, and often claim that their flavor is the same as some other well-known tool, almost always forgetting to mention the exceptions where they inevitably differ. The state of regex documentation needs help. Now, I don't mean to imply that this chapter fills the gap for all regular expressions, or even for egrep regular expressions. Rather, this chapter merely provides the foundation upon which the rest of this book is built. It may be ambitious, but I hope this book does fill the gaps for you. I received many gratifying responses to the first edition, and have worked very hard to make this one even better, both in breadth and in depth. Perhaps because regular-expression documentation has traditionally been so lacking, I feel the need to make the extra effort to make things particularly clear. Because I want to make sure you can use regular expressions to their fullest potential, I want to make sure you really, really understand them. This is both good and bad. It is good because you will learn how to think regular expressions. You will learn which differences and peculiarities to watch out for when faced with a new tool with a different flavor. You will know how to express yourself even with a weak, stripped-down regular expression flavor. You will understand what makes one expression more efficient than another, and will be able to balance tradeoffs among complexity, efficiency, and match results. When faced with a particularly complex task, you will know how to work through an expression the way the program would, constructing it as you go. In short, you will be comfortable using regular expressions to their fullest. The problem is that the learning curve of this method can be rather steep, with three separate issues to tackle:
This last point is the most important and the most difficult to address. The discussion is unfortunately sometimes a bit dry, with the reader chomping at the bit to get to the fun part — tackling real problems. However, understanding how the regex engine really works is the key to really understanding. You might argue that you don't want to be taught how a car works when you simply want to know how to drive. But, learning to drive a car is a poor analogy for learning about regular expressions. My goal is to teach you how to solve problems with regular expressions, and that means constructing regular expressions. The better analogy is not how to drive a car, but how to build one. Before you can build a car, you have to know how it works. Chapter 2 gives more experience with driving. Chapter 3 takes a short look at the history of driving, and a detailed look at the bodywork of a regex flavor. Chapter 4 looks at the all-important engine of a regex flavor. Chapter 5 shows some extended examples, Chapter 6 shows you how to tune up certain kinds of engines, and the chapters after that examine some specific makes and models. Particularly in Chapters 4, 5, and 6, we'll spend a lot of time under the hood, so make sure to have your coveralls and shop rags handy. 1.5.6 SummaryTable 1-3 summarizes the egrep metacharacters we've looked at in this chapter.
In addition, be sure that you understand the following points:
|
< Free Open Study > |