< Free Open Study > |
4.1 Start Your Engines!Let's see how much I can milk this engine analogy. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the sound system. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it? 4.1.1 Two Kinds of EnginesWell, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls. Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue. 4.1.2 New StandardsLet's stoke the fire by adding another variable: the California Emissions Standards.[1] Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say anything about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).
Come to think of it, I bet that an electric engine can qualify for the standard without much change—the standard just "blesses" the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it—use the wrong kind of gas and you're in big trouble. 4.1.2.1 The impact of standardsBetter pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard. So, you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions! More than you might imagine. 4.1.3 Regex Engine TypesThere are two fundamentally different types of regex engines: one called "DFA" (the electric engine of our story) and one called "NFA" (the gas engine). The details of what NFA and DFA mean follow shortly (see Section 4.3.3), but for now just consider them names, like Bill and Ted. Or electric and gas. Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include the .NET languages, Ruby, Perl, Python, GNU Emacs, ed, sed, PHP, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex . Some systems have a multi-engine hybrid system, using the most appropriate engine for the job (or even one that swaps between engines for different parts of the same regex, as needed to get the best combination of features and speed). Table 4-1 lists a few common programs and the regex engine that most versions use. If your favorite program is not in the list, the section Testing the Engine Type can help you find out which it is.
As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by clearly specifying not only which metacharacters and featur es an engine should support, as mentioned in the previous chapter, but also exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to this new standard, but the kind of results an NFA traditionally provided were different, so changes were needed. As a result, broadly speaking, there are three types of regex engines:
Here, we use "POSIX" to refer to the match semantics—the expected operation of a regex that the POSIX standard specifies (which we'll get to later in this chapter). Don't confuse this use of "POSIX" with uses surrounding regex features introduced in that same standard (see Section 3.4.2.7). Many programs support the features without supporting the full match semantics. Old (and minimally featured) programs like egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just con- firmed the status quo — no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIXcompliant. The gas engines that passed the California Emission Standards tests (POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only increased how fickle they were to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough" now causes knocks and pings. But, so long as you know how to maintain your baby, the engine runs smoothly and cleanly. 4.1.4 From the Department of Redundancy DepartmentAt this point, I'll ask you to go back and review the story about engines. Every sentence there rings with some truth about regular expressions. A second reading should raise some questions. Particularly, what does it mean that an electric DFA regex engine more or less "just runs?" What affects a gas-powered NFA? How can I tune my regular expressions to run as I want on an NFA? What special concerns does an emissions-controlled POSIX NFA have? What's a "stalled engine" in the regex world? 4.1.5 Testing the Engine TypeBecause the type of engine used in a tool influences the type of features it can offer, and how those features appear to work, we can often learn the type of engine a tool has merely by checking to see how it handles a few test expressions. (After all, if you can't tell the difference, the difference doesn't matter.) At this point in the book, I wouldn't expect you to understand why the following test results indicate what they do, but I want to offer these tests now so that if your favorite tool is not listed in Table 4-1, you can investigate before continuing with this and the subsequent chapters. 4.1.5.1 Traditional NFA or not?The most commonly used engine is a Traditional NFA, and spotting it is easy. First, are lazy quantifiers (see Section 3.4.5.8) supported? If so, it's almost certainly a Traditional NFA. As we'll see, lazy quantifiers are not possible with a DFA, nor would they make any sense with a POSIX NFA. However, to make sure, simply apply the regex nfa|nfa•not to the string 'nfa•not'—if only 'nfa' matches, it's a Traditional NFA. If the entire 'nfa•not' matches, it's either a POSIX NFA or a DFA. 4.1.5.2 DFA or POSIX NFA?Differentiating between a POSIX NFA and a DFA is sometimes just as simple. Capturing parentheses and backreferences are not supported by a DFA, so that can be one hint, but there are systems that are a hybrid mix between the two engine types, and so may end up using a DFA if there are no capturing parentheses. Here's a simple test that can tell you a lot. Apply X(.+)+X to a string like '=XX======================', as with this egrep command: echo =XX========================================= | egrep 'X(.+)+X' If it takes a long time to finish, it's an NFA (and if not a Traditional NFA as per the test in the previous section, it must be a POSIX NFA). If it finishes quickly, it's either a DFA or an NFA with some advanced optimization. Does it display a warning message about a stack overflow or long match aborted? If so, it's an NFA. |
< Free Open Study > |