< Free Open Study > |
4.3 Regex-Directed Versus Text-DirectedThe two basic engine types reflect a fundamental difference in algorithms available for applying a regular expression to a string. I call the gasoline-driven NFA engine "regex-directed," and the electric-driven DFA "text-directed." 4.3.1 NFA Engine: Regex-DirectedLet's consider one way an engine might match to(nite|knight|night) against the text '···tonight···' . Starting with the t , the regular expression is examined one component at a time, and the "current text" is checked to see whether it is matched by the current component of the regex. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved. With the to(nite;knight;night) example, the first component is t , which repeatedly fails until a 't' is reached in the target string. Once that happens, the o is checked against the next character, and if it matches, control moves to the next component. In this case, the "next component" is (nite|knight|night) which really means " nite or knight or night . " Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matching tonight, the third alternative is the one that leads to a match. Despite their brainy origins (see Section 3.1.1)), a regex-directed engine can't come to that conclusion until actually going through the motions to check. Attempting the first alternative, nite , involves the same component-at-a-time treatment as before: " Try to match n , then i , then t , and finally e . " If this fails, as it eventually does, the engine tries another alternative, and so on until it achieves a match or must report failure. Control moves within the regex from component to component, so I call it "regex-directed." 4.3.1.1 The control benefits of an NFA engineIn essence, each subexpression of a regex in a regex-directed match is checked independently of the others. Other than backreferences, there's no interrelation among subexpressions, except for the relation implied by virtue of being thrown together to make a larger expression. The layout of the subexpressions and regex control structures (e.g., alternation, parentheses, and quantifiers) controls an engine's overall movement through a match. Since the regex directs the NFA engine, the driver (the writer of the regular expression) has considerable opportunity to craft just what he or she wants to happen. (Chapters 5 and 6 show how to put this to use to get a job done correctly and efficiently.) What this really means may seem vague now, but it will all be spelled out soon. 4.3.2 DFA Engine: Text-DirectedContrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches "currently in the works." In the tonight example, the moment the engine hits t, it adds a potential match to its list of those currently in progress:
Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes
with two possible matches in the works (and one alternative, knight, ruled out). With the g that follows, only the third alternative remains viable. Once the h and t are scanned as well, the engine realizes it has a complete match and can return success. I call this "text-directed" matching because each character scanned from the text controls the engine. As in the example, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a "partial match in progress" is also a full match. If the regex were to(···)? , for example, the parenthesized expression becomes optional, but it's still greedy, so it's always attempted. All the time that a partial match is in progress inside those parentheses, a full match (of 'to') is already confirmed and in reserve in case the longer matches don't pan out. If the engine reaches a character in the text that invalidates all the matches in the works, it must revert to one of the full matches in reserve. If there are none, it must declare that there are no matches at the current attempt's starting point. 4.3.3 First Thoughts: NFA and DFA in ComparisonIf you compare these two engines based only on what I've mentioned so far, you might conclude that the text-directed DFA engine is generally faster. The regexdir ected NFA engine might waste time attempting to match different subexpressions against the same text (such as the three alternatives in the example). You would be right. During the course of an NFA match, the same character of the target might be checked by many different parts of the regex (or even by the same part, over and over). Even if a subexpression can match, it might have to be applied again (and again and again) as it works in concert with the rest of the regex to find a match. A local subexpression can fail or match, but you just never know about the overall match until you eventually work your way to the end of the regex. (If I could find a way to include "It's not over until the fat lady sings." in this paragraph, I would.) On the other hand, a DFA engine is deterministic— each character in the target is checked once (at most). When a character matches, you don't know yet if it will be part of the final match (it could be part of a possible match that doesn't pan out), but since the engine keeps track of all possible matches in parallel, it needs to be checked only once, period. The two basic technologies behind regular-expression engines have the somewhat imposing names Nondeterministic Finite Automaton (NFA) and Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just "NFA" and "DFA." We won't be seeing these phrases spelled out again.[5]
4.3.3.1 Consequences to us as usersBecause of the regex-directed nature of an NFA, the details of how the engine attempts a match are very important. As I said before, the writer can exercise a fair amount of control simply by changing how the regex is written. With the tonight example, perhaps less work would have been wasted had the regex been written differently, such as in one of the following ways:
With any given text, these all end up matching exactly the same thing, but in doing so direct the engine in different ways. At this point, we don't know enough to judge which of these, if any, are better than the others, but that's coming soon. It's the exact opposite with a DFA — since the engine keeps track of all matches simultaneously, none of these differences in representation matter so long as in the end they all represent the same set of possible matches. There could be a hundr ed different ways to achieve the same result, but since the DFA keeps track of them all simultaneously (almost magically — more on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc and [aa-a](b|b{1}|b)c are utterly indistinguishable. Three things come to my mind when describing a DFA engine:
I'll eventually expand on all these points. The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide plenty of room for creative juices to flow. There are great benefits in crafting an expression well, and even greater penalties for doing it poorly. A gasoline engine is not the only engine that can stall and conk out completely. To get to the bottom of this, we need to look at the essence of an NFA engine: backtracking. |
< Free Open Study > |