![]() |
< Free Open Study > |
![]() |
4.2 Match BasicsBefore looking at the differences among these engine types, let's first look at their similarities. Certain aspects of the drive train are the same (or for all practical purposes appear to be the same), so these examples can cover all engine types. 4.2.1 About the ExamplesThis chapter is primarily concerned with a generic, full-function regex engine, so some tools won't support exactly everything presented. In my examples, the dipstick might be to the left of the oil filter, while under your hood it might be behind the distributor cap. Your goal is to understand the concepts so that you can drive and maintain your favorite regex package (and ones you find interest in later). I'll continue to use Perl's notation for most of the examples, although I'll occasionally show others to remind you that the notation is superficial and that the issues under discussion transcend any one tool or flavor. To cut down on wordiness here, I'll rely on you to check Chapter 3 (see Section 3.4) if I use an unfamiliar construct. This chapter details the practical effects of how a match is carried out. It would be nice if everything could be distilled down to a few simple rules that could be memorized without needing to understand what is going on. Unfortunately, that's not the case. In fact, with all this chapter offers, I identify only two all-encompassing rules:
We'll look at these rules, their effects, and much more throughout this chapter. Let's start by diving into the details of the first rule. 4.2.2 Rule 1: The Match That Begins Earliest WinsThis rule says that any match that begins earlier (leftmost) in the string is always preferred over any plausible match that begins later. This rule doesn't say anything about how long the winning match might be (we'll get into that shortly), merely that among all the matches possible anywhere in the string, the one that begins leftmost in the string is chosen. Actually, since more than one plausible match can start at the same earliest point, perhaps the rule should read "a match..." instead of "the match...," but that sounds odd. Here's how the rule comes about: the match is first attempted at the very beginning of the string to be searched (just before the first character). "Attempted" means that every permutation of the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. A "no match" result is reported only if no match is found after the full retry has been attempted at each position all the way to the end of the string (just after the last character). Thus, when trying to match
If you didn't know this rule, results might sometimes surprise you. For example,
when matching
The dragging belly indicates your cat is too fat the match is in indicates, not at the word cat that appears later in the line. This word cat could match, but the cat in indicates appears earlier in the string, so it is the one matched. For an application like egrep, the distinction is irrelevant because it cares only whether there is a match, not where the match might be. For other uses, such as with a search-and-replace, the distinction becomes paramount. Here's a (hopefully simple) quiz: where does
4.2.2.1 The "transmission" and the bump-alongIt might help to think of this rule as the car's transmission, connecting the engine to the drive train while adjusting for the gear you're in. The engine itself does the real work (turning the crank); the transmission transfers this work to the wheels. 4.2.2.1.1 The transmission's main work: the bump-alongIf the engine can't find a match starting at the beginning of the string, it's the transmission that bumps the regex engine along to attempt a match at the next position in the string, and the next, and the next, and so on. Usually. For instance, if a regex begins with a start-of-string anchor, the transmission can realize that any bump-along would be futile, for only the attempt at the start of the string could possibly be successful. This and other internal optimizations are discussed in Chapter 6. 4.2.3 Engine Pieces and PartsAn engine is made up of parts of various types and sizes. You can't possibly hope to truly understand how the whole thing works if you don't know much about the individual parts. In a regex, these parts are the individual units—literal characters, quantifiers (star and friends), character classes, parentheses, and so on, as described in Chapter 3 (see Section 3.4). The combination of these parts (and the engine's treatment of them) makes a regex what it is, so looking at ways they can be combined and how they interact is our primary interest. First, let's take a look at some of the individual parts:
4.2.3.1 No "electric" parentheses, backreferences, or lazy quantifiersI'd like to concentrate here on the similarities among the engines, but as foreshadowing of what's to come in this chapter, I'll point out a few interesting differences. Capturing parentheses (and the associated backreferences and $1 type functionality) are like a gas additive—they affect a gasoline (NFA) engine, but are irrelevant to an electric (DFA) engine. The same thing applies to lazy quantifiers. The way a DFA engine works completely precludes these concepts.[3] This explains why tools developed with DFAs don't provide these features. You'll notice that awk, lex, and egrep don't have backreferences or any $1 type functionality.
You might, however, notice that GNU's version of egrep does support backreferences. It does so by having two complete engines under the hood! It first uses a DFA engine to see whether a match is likely, and then uses an NFA engine (which supports the full flavor, including backreferences) to confirm the match. Later in this chapter, we'll see why a DFA engine can't deal with backreferences or capturing, and why anyone ever would want to use such an engine at all. (It has some major advantages, such as being able to match very quickly.) 4.2.4 Rule 2: The Standard Quantifiers Are GreedySo far, we have seen features that are quite straightforward. They are also rather boring—you can't do much without involving more-powerful metacharacters such as star, plus, alternation, and so on. Their added power requires more information to understand them fully. First, you need to know that the standard quantifiers ( ?, *, +, and {
min,max
}) are
greedy. When one of these governs a subexpression, such as
To be clear, the standard quantifiers settle for something less than the maximum
number of allowed matches if they have to, but they always attempt to match as
many times as they can, up to that maximum allowed. The only time they settle
for anything less than their maximum allowed is when matching too much ends
up causing some later part of the regex to fail. A simple example is using
If it turns out that the only way the rest of the regex can succeed is when the greedy construct in question matches nothing, well, that's perfectly fine, if zero matches are allowed (as with star, question, and {0, max } intervals). However, it turns out this way only if the requirements of some later subexpression force the issue. It's because the greedy quantifiers always (or, at least, try to) take more than they minimally need that they are called greedy. Greediness has many useful (but sometimes troublesome) implications. It explains,
for example, why
4.2.4.1 A subjective exampleOf course, this method of grabbing things is useful for more than just numbers.
Let's say you have a line from an email header and want to check whether it is the
subject line. As we saw in earlier chapters (see Section 2.3.4), you simply use
However, if you use
Before looking at why
So, why do we even bother adding
Once
4.2.4.2 Being too greedyLet's get back to the concept of a greedy quantifier being as greedy as it can be.
Consider how the matching and results would change if we add another
Does this mean that after
Let's consider the possibly useful
With this in mind, let's apply
Unfortunately, even though the first
4.2.4.3 First come, first servedConsider now using
4.2.4.4 Getting down to the detailsI should clear up a few things here. Phrases like " the
|
![]() |
< Free Open Study > |
![]() |