![]() |
< Free Open Study > |
![]() |
1.5 Expanding the FoundationI hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I've provided so far lacks depth. There's so much more out there. 1.5.1 Linguistic DiversificationI mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I'll leave for later chapters. Unfortunately, the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own "improvements." The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression "flavors." We'll see many examples in the following chapters. 1.5.2 The Goal of a Regular ExpressionFrom the broadest top-down view, a regular expression either matches within a lump of text (with egrep, each line) or it doesn't. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don't want. Also, while egrep doesn't care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as ...zip is 44272. If you write, send $4.95 to cover postage and... and you merely want to find lines matching
1.5.3 A Few More ExamplesAs with any language, experience is a very good thing, so I'm including a few more examples of regular expressions to match some common constructs. Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to not match when and where you don't want. In practice, both are important, but for the moment, I would like to concentrate on the "getting successful matches" aspect. Even though I don't take these examples to their fullest depths, they still provide useful insight. 1.5.3.1 Variable namesMany programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a digit. They are matched by
1.5.3.2 A string within double quotesA simple solution to matching a string within double quotes might be:
The double quotes at either end are to match the opening and closing double quotes of the string. Between them, we can have anything... except another double quote! So, we use
A more useful (but more complex) definition of a double-quoted string allows double quotes within the string if they are escaped with a backslash, such as in "nail•the•2\"x4\"•plank". We'll see this example several times in future chapters while covering the many details of how a match is actually carried out. 1.5.3.3 Dollar amount (with optional cents)One approach to matching a dollar amount is:
From a top-level perspective, this is a simple regular expression with three parts:
This example is a bit naïve for several reasons. For example, it considers dollar amounts like $1000, but not $1,000. It does allow for optional cents, but frankly, that's not really very useful when applied with egrep. egrep never cares exactly how much is matched, but merely whether there is a match. Allowing something optional at the end never changes whether there's an overall match to begin with. But, if you need to find lines that contain just a price, and nothing else, you can wrap the expression with
One type of value our expression doesn't match is '$.49'. To solve this, you might be tempted to change the plus to a star, but that doesn't work. As to why, I'll leave it as a teaser until we look at this example again in Chapter 5 (see Section 5.2.5). 1.5.3.4 An HTTP/HTML URLThe format of web URLs can be complex, so constructing a regular expression to match any possible URL can be equally complex. However, relaxing your standards slightly can allow you to match most common URLs with a fairly simple expression. One common reason I might do this, for example, would be to search my email archive for a URL that I vaguely remember having received, but which I think I might recognize when I see it. The general form of a common HTTP/HTML URL is along the lines of http://hostname/path.html although ending with .htm is common as well. The rules about what can and can't be a hostname (computer name, such as www.yahoo.com) are complex, but for our needs we can realize that if it follows 'http://', it's probably a hostname, so we can make do with something simple, such as
Putting these all together, we might use as our first attempt something like: % egrep -i '\<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*'%$]*\.html?\>' files Again, since we've taken liberties and relaxed what we'll match, we could well match something such as 'http://..../foo.html', which is certainly not a valid URL. Do we care about this? It all depends on what you're trying to do. For my scan of my email archive, it doesn't really matter if I get a few false matches. Heck, I could probably get away with even something as simple as: % egrep -i '\<http://[^ ]*\.html?\>' files... As we'll learn when getting deeper into how to craft an expression, knowing the data you'll be searching is an important aspect of finding the balance between complexity and completeness. We'll visit this example again, in more detail, in the next chapter. 1.5.3.5 An HTML tagWith a tool like egrep, it doesn't seem particularly common or useful to simply match lines with HTML tags. But, exploring a regular expression that matches HTML tags exactly can be quite fruitful, especially when we delve into more advanced tools in the next chapter. Looking at simple cases like '<TITLE>' and '<HR>', we might think to try
This might have been a bit surprising, but we're still in the first chapter, and our understanding at this point is only superficial. I have this example here to high-light that regular expressions are not a difficult subject, but they can be tricky if you don't truly understand them. Over the next few chapters, we'll look at all the details required to understand and solve this problem. 1.5.3.6 Time of day, such as "9:17 am" or "12:30 pm"Matching a time can be taken to varying levels of strictness. Something such as
picks up both 9:17•am and 12:30•pm, but also allows something nonsensical like 99:99•pm. Looking at the hour, we realize that if it is a two-digit number, the first digit must be a one. But,
The minute part is easier. The first digit should be
Using the same logic, can you extend this to handle 24-hour time with hours from 0 through 23? As a challenge, allow for a leading zero, at least through to 09:59.
1.5.4 Regular Expression Nomenclature1.5.4.1 RegexAs you might guess, using the full phrase "regular expression" can get a bit tiring, particularly in writing. Instead, I normally use "regex." It just rolls right off the tongue (it rhymes with "FedEx," with a hard g sound like "regular" and not a soft one like in "Regina") and it is amenable to a variety of uses like "when you regex...," "budding regexers," and even "regexification." [11] I use the phrase "regex engine" to refer to the part of a program that actually does the work of carrying out a match attempt.
1.5.4.2 MatchingWhen I say a regex "matches" a string, I really mean that it matches in a string.
Technically, the regex
1.5.4.3 MetacharacterWhether a character is a metacharacter (or "metasequence"—I use the words inter-changeably) depends on exactly where in the regex it's used. For example,
Depending upon the regex flavor, there are various situations when certain characters are and aren't metacharacters. Chapter 3 discusses this in more detail. 1.5.4.4 FlavorAs I've hinted, different tools use regular expressions for many different things, and the set of metacharacters and other features that each support can differ. Let's look at word boundaries again as an example. Some versions of egrep support the \<···\> notation we've seen. However, some do not support the separate word-start and word-end, but one catch-all
I use the term "flavor" to describe the sum total of all these little implementation decisions. In the language analogy, it's the same as a dialect of an individual speaker. Superficially, this concept refers to which metacharacters are and aren't supported, but there's much more to it. Even if two programs both support
Don't confuse "flavor" with "tool." Just as two people can speak the same dialect, two completely different programs can support exactly the same regex flavor. Also, two programs with the same name (and built to do the same task) often have slightly (and sometimes not-so-slightly) different flavors. Among the various programs called egrep, there is a wide variety of regex flavors supported. In the late 1990s, the particularly expressive flavor offered by the Perl programming language was widely recognized for its power, and soon other languages were offering Perl-inspired regular expressions (many even acknowledging the inspirational source by labeling themselves "Perl-compatible"). The adopters include Python, many Java regex package, Microsoft's .NET Framework, Tcl, and a variety of C libraries, to name a few. Yet, all are different in important respects. On top of this, Perl's regular expressions themselves are evolving and growing (some-times, now, in response to advances seen with other tools). As always, the overall landscape continues to become more varied and confusing. 1.5.4.5 SubexpressionThe term "subexpression" simply refers to part of a larger expression, although it
often refers to some part of an expression within parentheses, or to an alternative
of
Something such as 1-6 isn't considered a subexpression of
Unlike alternation, quantifiers (star, plus, and question mark) always work with the smallest immediately-preceding subexpression. This is why with
1.5.4.6 CharacterThe word "character" can be a loaded term in computing. The character that a byte represents is merely a matter of interpretation. A byte with such-and-such a value has that same value in any context in which you might wish to consider it, but which character that value represents depends on the encoding in which it's viewed. As a concrete example, two bytes with decimal values 64 and 53 represent the characters "@" and "5" respectively, if considered in the ASCII encoding, yet on the other hand are completely different if considered in the EBCDIC encoding (they are a space and some kind of a control character). On the third hand, if those two bytes are considered in one of the popular encodings for Japanese characters, together they represent the single character
Until recently, text-processing tools generally treated their data as a bunch of ASCII bytes, without regard to the encoding you might be intending. Recently, however, more and more systems are using some form of Unicode to process data internally (Chapter 3 includes an introduction to Unicode see Section 3.3.2). On such systems, if the regular-expression subsystem has been implemented properly, the user doesn't normally have to pay much attention to these issues. That's a big "if," which is why Chapter 3 looks at this issue in depth. 1.5.5 Improving on the Status QuoWhen it comes down to it, regular expressions are not difficult. But, if you talk to the average user of a program or language that supports them, you will likely find someone that understands them "a bit," but does not feel secure enough to really use them for anything complex or with any tool but those they use most often. Traditionally, regular expression documentation tends to be limited to a short and
incomplete description of one or two metacharacters, followed by a table of the
rest. Examples often use meaningless regular expressions like
Now, I don't mean to imply that this chapter fills the gap for all regular expressions, or even for egrep regular expressions. Rather, this chapter merely provides the foundation upon which the rest of this book is built. It may be ambitious, but I hope this book does fill the gaps for you. I received many gratifying responses to the first edition, and have worked very hard to make this one even better, both in breadth and in depth. Perhaps because regular-expression documentation has traditionally been so lacking, I feel the need to make the extra effort to make things particularly clear. Because I want to make sure you can use regular expressions to their fullest potential, I want to make sure you really, really understand them. This is both good and bad. It is good because you will learn how to think regular expressions. You will learn which differences and peculiarities to watch out for when faced with a new tool with a different flavor. You will know how to express yourself even with a weak, stripped-down regular expression flavor. You will understand what makes one expression more efficient than another, and will be able to balance tradeoffs among complexity, efficiency, and match results. When faced with a particularly complex task, you will know how to work through an expression the way the program would, constructing it as you go. In short, you will be comfortable using regular expressions to their fullest. The problem is that the learning curve of this method can be rather steep, with three separate issues to tackle:
This last point is the most important and the most difficult to address. The discussion is unfortunately sometimes a bit dry, with the reader chomping at the bit to get to the fun part — tackling real problems. However, understanding how the regex engine really works is the key to really understanding. You might argue that you don't want to be taught how a car works when you simply want to know how to drive. But, learning to drive a car is a poor analogy for learning about regular expressions. My goal is to teach you how to solve problems with regular expressions, and that means constructing regular expressions. The better analogy is not how to drive a car, but how to build one. Before you can build a car, you have to know how it works. Chapter 2 gives more experience with driving. Chapter 3 takes a short look at the history of driving, and a detailed look at the bodywork of a regex flavor. Chapter 4 looks at the all-important engine of a regex flavor. Chapter 5 shows some extended examples, Chapter 6 shows you how to tune up certain kinds of engines, and the chapters after that examine some specific makes and models. Particularly in Chapters 4, 5, and 6, we'll spend a lot of time under the hood, so make sure to have your coveralls and shop rags handy. 1.5.6 SummaryTable 1-3 summarizes the egrep metacharacters we've looked at in this chapter.
In addition, be sure that you understand the following points:
|
![]() |
< Free Open Study > |
![]() |