3.1 A Casual Stroll Across the Regex Landscape

I'd like to start with the story about the evolution of some regular expression flavors and their associated programs. So, grab a hot cup (or frosty mug) of your favorite brewed beverage and relax as we look at the sometimes wacky history behind the regular expressions we have today. The idea is to add color to our regex understanding, and to develop a feeling as to why "the way things are" are the way things are. There are some footnotes for those that are interested, but for the most part, this should be read as a light story for enjoyment.

3.1.1 The Origins of Regular Expressions

The seeds of regular expressions were planted in the early 1940s by two neurophysiologists, Warren McCulloch and Walter Pitts, who developed models of how they believed the nervous system worked at the neuron level.^[1] Regular expressions became a reality several years later when mathematician Stephen Kleene formally described these models in an algebra he called regular sets. He devised a simple notation to express these regular sets, and called them regular expressions.

^[1] "A logical calculus of the ideas imminent in nervous activity," first published in Bulletin of Math. Biophysics 5 (1943) and later reprinted in Embodiments of Mind (MIT Press, 1965). The article begins with an interesting summary of how neurons behave (did you know that intra-neuron impulse speeds can range from 1 all the way to 150 meters per second?), and then descends into a pit of formulae that is, literally, all Greek to me.

Through the 1950s and 1960s, regular expressions enjoyed a rich study in theoretical mathematics circles. Robert Constable has written a good summary^[2] for the mathematically inclined.

^[2] Robert L. Constable, "The Role of Finite Automata in the Development of Modern Computing Theory," in The Kleene Symposium, Eds. Barwise, Keisler, and Kunen (North-Holland Publishing Company, 1980), 61-83.

Although there is evidence of earlier work, the first published computational use of regular expressions I have actually been able to find is Ken Thompson's 1968 article Regular Expression Search Algorithm ^[3] in which he describes a regularexpression compiler that produced IBM 7094 object code. This led to his work on qed, an editor that formed the basis for the Unix editor ed.

^[3] Communications of the ACM, Vol.11, No. 6, June 1968.

ed 's regular expressions were not as advanced as those in qed, but they were the first to gain widespread use in non-technical fields. ed had a command to display lines of the edited file that matched a given regular expression. The command, " g/ Regular Expression /p ", was read "Global Regular Expression Print." This particular function was so useful that it was made into its own utility, grep (after which egrep —extended grep —was later modeled).

3.1.1.1 Grep's metacharacters

The regular expressions supported by grep and other early tools were quite limited when compared to egrep's. The metacharacter * was supported, but + and ? were not (the latter's absence being a particularly strong drawback). grep's capturing metacharacters were $···$, with unescaped parentheses representing literal text.^[4] grep supported line anchors, but in a limited way. If ^ appeared at the beginning of the regex, it was a metacharacter matching the beginning of the line. Otherwise, it wasn't a metacharacter at all and just matched a literal circumflex (also called a "caret"). Similarly, $ was the end-of-line metacharacter only at the end of the regex. The upshot was that you couldn't do something like end$|^start. But that's okay, since alternation wasn't supported either!

^[4] Historical trivia: ed (and hence grep) used escaped parentheses rather than unadorned parentheses as delimiters because Ken Thompson felt regular expressions would be used to work primarily with C code, where needing to match raw parentheses would be more common than backreferencing.

The way metacharacters interact is also important. For example, perhaps grep's largest shortcoming was that star could not be applied to a parenthesized expression, but only to a literal character, a character class, or dot. So, in grep, parentheses were useful only for capturing matched text, and not for general grouping. In fact, some early versions of grep didn't even allow nested parentheses.

3.1.1.2 Grep evolves

Although many systems have grep today, you'll note that I've been using past tense. The past tense refers to the flavor of the old versions, now upwards of 30 years old. Over time, as technology advances, older programs are sometimes retrofitted with additional features, and grep has been no exception.

Along the way, AT&T Bell Labs added some new features, such as incorporating the \{min,max\} notation from the program lex. They also fixed the -y option, which in early versions was supposed to allow case-insensitive matches but worked only sporadically. Around the same time, people at Berkeley added startand end-of-word metacharacters and renamed -y to -i. Unfortunately, you still couldn't apply star or the other quantifiers to a parenthesized expression.

3.1.1.3 Egrep evolves

By this time, Alfred Aho (also at AT&T Bell Labs) had written egrep, which provided most of the richer set of metacharacters described in Chapter 1. More importantly, he implemented them in a completely different (and generally better) way. Not only were + and ? added, but they could be applied to parenthesized expressions, greatly increasing egrep expressive power.

Alternation was added as well, and the line anchors were upgraded to "first-class" status so that you could use them almost anywhere in your regex. However, egrep had problems as well—sometimes it would find a match but not display the result, and it didn't have some useful features that are now popular. Nevertheless, it was a vastly more useful tool.

3.1.1.4 Other species evolve

At the same time, other programs such as awk, lex, and sed, were growing and changing at their own pace. Often, developers who liked a feature from one program tried to add it to another. Sometimes, the result wasn't pretty. For example, if support for plus was added to grep, + by itself couldn't be used because grep had a long history of a raw '+' not being a metacharacter, and suddenly making it one would have surprised users. Since '\+' was probably not something a grep user would have otherwise normally typed, it could safely be subsumed as the "one or more" metacharacter.

Sometimes new bugs were introduced as features were added. Other times, added features were later removed. There was little to no documentation for the many subtle points that round out a tool's flavor, so new tools either made up their own style, or attempted to mimic "what seemed to work" with other tools.

Multiply that by the passage of time and numerous programmers, and the result is general confusion (particularly when you try to deal with everything at once).^[5]

^[5] Such as when writing a book about regular expressions—ask me, I know!

3.1.1.5 POSIX—An attempt at standardization

POSIX, short for Portable Operating System Interface, is a wide-ranging standard put forth in 1986 to ensure portability across operating systems. Several parts of this standard deal with regular expressions and the traditional tools that use them, so it's of some interest to us. None of the flavors covered in this book, however, strictly adhere to all the relevant parts. In an effort to reorganize the mess that regular expressions had become, POSIX distills the various common flavors into just two classes of regex flavor, Basic Regular Expressions (BREs), and Extended Regular Expressions (EREs). POSIX programs then support one flavor or the other. Table 3-1below summarizes the metacharacters in the two flavors.

One important feature of the POSIX standard is the notion of a locale, a collection of settings that describe language and cultural conventions for such things as the display of dates, times, and monetary values, the interpretation of characters in the active encoding, and so on. Locales aim to allow programs to be internationalized. Thy are not regex-specific concept, although they can affect regular-expression use. For example, when working with a locale that describes the Latin-1 (ISO-8859-1) encoding, à and À (characters with ordinal values 224 and 160, respectively) are considered "letters," and any application of a regex that ignores capitalization would know to treat them as identical.

Table 1. Overview of POSIX Regex Flavors
Regex feature BREs EREs
dot, ^, $, [···], [^···]
"any number" quantifier * *
+ and ? quantifiers + ?
range quantifier \{min,max \} {min,max}
grouping $···$ (···)
can apply quantifiers to parentheses
backreferences \1 through \9
alternation

Another example is \w, commonly provided as a shorthand for a "word-constituent character" (ostensibly, the same as [a-zA-Z0-9_] in many flavors). This feature is not required by POSIX, but it is allowed. If supported, \w would know to allow all letters and digits defined in the locale, not just those in ASCII.

Note, however, that the need for this aspect of locales is mostly alleviated when working with tools that support Unicode. Unicode is discussed further beginning in Section 3.3.2.2.

3.1.1.6 Henry Spencer's regex package

Also first appearing in 1986, and perhaps of more importance, was the release by Henry Spencer of a regex package, written in C, which could be freely incorporate by others into their own programs — a first at the time. Every program that used Henry's package—and there were many—provided the same consistent regex flavor unless the program's author went to the explicit trouble to change it.

3.1.1.7 Perl evolves

At about the same time, Larry Wall started developing a tool that would later become the language Perl. He had already greatly enhanced distributed software development with his patch program, but Perl was destined to have a truly monumental impact.

Larry released Perl Version 1 in December 1987. Perl was an immediate hit because it blended so many useful features of other languages, and combined them with the explicit goal of being, in a day-to-day practical sense, useful.

One immediately notable feature was a set of regular expression operators in the tradition of the specialty tools sed and awk — a first for a general scripting language. For the regular expression engine, Larry borrowed code from an earlier project, his news reader rn (which based its regular expression code on that in James Gosling's Emacs).^[6] The regex flavor was considered powerful by the day's standards, but was not nearly as full-featured as it is today. Its major drawbacks were that it supported at most nine sets of parentheses, and at most nine alternatives with |, and worst of all, | was not allowed within parentheses. It did not support case-insensitive matching, nor allow \w within a class (it didn't support \s or \d anywhere). It didn't support the {min,max} range quantifier.

^[6] James Gosling would later go on to develop his own language, Java, which somewhat ironically does not natively support regular expressions. Java 1.4 however, does include a wonderful regular expression package, covered in depth in Chapter 8.

Perl 2 was released in June 1988. Larry had replaced the regex code entirely, this time using a greatly enhanced version of the Henry Spencer package mentioned in the previous section. You could still have at most nine sets of parentheses, but now you could use | inside them. Support for \d and \s was added, and support for \w was changed to include an underscore, since then it would match what characters were allowed in a Perl variable name. Furthermore, these metacharacters were now allowed inside classes. (Their opposites, \D, \W, and \S, were also newly supported, but weren't allowed within a class, and in any case sometimes didn't work correctly.) Importantly, the /i modifier was added, so you could now do case-insensitive matching.

Perl 3 came out more than a year later, in October 1989. It added the /e modifier, which greatly increased the power of the replacement operator, and fixed some backr eference-related bugs from the previous version. It added the {min,max} range quantifiers, although unfortunately, they didn't always work quite right. Worse still, with Version 3, the regular expression engine couldn't always work with 8-bit data, yielding unpredictable results with non-ASCII input.

Perl 4 was released half a year later, in March 1991, and over the next two years, it was improved until its last update in February 1993. By this time, the bugs were fixed and restrictions expanded (you could use \D and such within character classes, and a regular expression could have virtually unlimited sets of parentheses). Work also went into optimizing how the regex engine went about its task, but the real breakthrough wouldn't happen until 1994.

Perl 5 was officially released in October 1994. Overall, Perl had undergone a massive overhaul, and the result was a vastly superior language in every respect. On the regular-expression side, it had more internal optimizations, and a few metacharacters were added (including \G, which increased the power of iterative matches see Section 3.4.3.3), non-capturing parentheses (see Section 2.2.3.1), lazy quantifiers (see Section 3.4.5.9), lookahead (see Section 2.3.5.1), and the /x modifier^[7] (see Section 2.3.6.4).

^[7] My claim to fame is that Larry added the /x modifier after seeing a note from me discussing a long and complex regex. In the note, I had "pretty printed" the regular expression for clarity. Upon seeing it, he thought that it would be convenient to do so in Perl code as well, so he added /x.

More important than just for their raw functionality, these "outside the box" modifications made it clear that regular expressions could really be a powerful programming language unto themselves, and were still ripe for further development.

The newly-added non-capturing parentheses and lookahead constructs required a way to be expressed. None of the grouping pairs — (···), [···], <···>, or {···} — were available to be used for these new features, so Larry came up with the various '(?' notations we use today. He chose this unsightly sequence because it previously would have been an illegal combination in a Perl regex, so he was free to give it meaning. One important consideration Larry had the foresight to recognize was that there would likely be additional functionality in the future, so by restricting what was allowed after the '(?' sequences, he was able to reserve them for future enhancements.

Subsequent versions of Perl grew more robust, with fewer bugs, more internal optimizations, and new features. I like to believe that the first edition of this book played some small part in this, for as I researched and tested regex-related features, I would send my results to Larry and the Perl Porters group, which helped give some direction as to where improvements might be made.

New regex features added over the years include limited lookbehind (see Section 2.3.5.1), "atomic" grouping (see Section 3.4.5.4), and Unicode support. Regular expressions were brought to the next level by the addition of conditional constructs (see Section 3.4.5.6), allowing you to make if-then-else decisions right there as part of the regular expression. And if that wasn't enough, there are now constructs that allow you to intermingle Perl code within a regular expression, which takes things full circle (see Section 7.8). The version of Perl covered in this book is 5.8.

3.1.1.8 A partial consolidation of flavors

The advances seen in Perl 5 were perfectly timed for the World Wide Web revolution. Perl was built for text processing, and the building of web pages is just that, so Perl quickly became the language for web development. Perl became vastly more popular, and with it, its powerful regular expression flavor did as well.

Developers of other languages were not blind to this power, and eventually regular expression packages that were "Perl compatible" to one extent or another were created. Among these were packages for Tcl, Python, Microsoft's .NET suite of languages, Ruby, PHP, C/C++, and many packages for Java.

3.1.1.9 Versions as of this book

Table 3-2 shows a few of the version numbers for programs and libraries that I talk about in the book. Older versions may well have fewer features and more bugs, while newer versions may have additional features and bug fixes (and new bugs of their own).

Because Java did not originally come with regex support, numerous regex libraries have been developed over the years, so anyone wishing to use regular expressions in Java needed to find them, evaluate them, and ultimately select one to use. Chapter 6 looks at seven such packages, and ways to evaluate them. For reasons discussed there, the regex package that Sun eventually came up with (their java.util.regex, now standard as of Java 1.4) is what I use for most of the Java examples in this book.

Table 2. Versions of Some Tools Mentioned in This Book

GNU awk 3.1
GNU egrep/grep 2.4.2
GNU Emacs 21.2.1
flex 2.5.4
java.util.regex (Java 1.4.0)

MySQL 3.23.49
.NET Framework 2002 (1.0.3705)
PCRE 3.8
Perl 5.8
PHP (preg routines) 4.0.6

Procmail 3.22
Python 2.2.1
Ruby1.6.7
GNU sed 3.02
Tcl 8.4

3.1.2 At a Glance

A chart showing just a few aspects of some common tools gives a good clue to how different things still are. Table 3-3 provides a very superficial look at a few aspects of the regex flavors of a few tools.

Table 3. A (Very) Superficial Look at the Flavor of a Few Common Tools
Feature Modern
grep Modern
egrep GNU
Emacs Tcl Perl .NET Sun's Java
package
*, ^, $, [···]
? + | \? \+ \| ? + | ? + \| ? + | ? + | ? + | ? + |
grouping $···$ (···) $···$ (···) (···) (···) (···)
(?:···)
word boundary \< \> \< \> \b,\B \m, \M, \y \b,\B \b,\B \b,\B
\w, \W
backreferences
supported

A chart like Table 3-3 is often found in other books to show the differences among tools. But, this chart is only the tip of the iceberg—for every feature shown, there are a dozen important issues that are overlooked.

Foremost is that programs change over time. For example, Tcl used to not support backreferences and word boundaries, but now does. It first supported word boundaries with the ungainly-looking [:<:] and [:>:], and still does, although such use is deprecated in favor of its more-recently supported \m, \M, and \y (start of word boundary, end of word boundary, or either).

Along the same lines, programs such as grep and egrep, which aren't from a single provider but rather can be provided by anyone who wants to create them, can have whatever flavor the individual author of the program wishes. Human nature being what is, each tends to have its own features and peculiarities. (The GNU versions of many common tools, for example, are often more powerful and robust than other versions.)

And perhaps as important as the easily visible features are the many subtle (and some not-so-subtle) differences among flavors. Looking at the table, one might think that regular expressions are exactly the same in Perl, .NET, and Java, which is certainly not true. Just a few of the questions one might ask when looking at something like Table 3-3 are:

Are star and friends allowed to quantify something wrapped in parentheses?
Does dot match a newline? Do negated character classes match it? Do either match the null character?
Are the line anchors really line anchors (i.e., do they recognize newlines that might be embedded within the target string)? Are they first-class metacharacters, or are they valid only in certain parts of the regex?
Are escapes recognized in character classes? What else is or isn't allowed within character classes?
Are parentheses allowed to be nested? If so, how deeply (and how many parentheses are even allowed in the first place)?
If backreferences are allowed, when a case-insensitive match is requested, do backreferences match appropriately? Do backreferences "behave" reasonably in fringe situations?
Are octal escapes such as \123 allowed? If so, how do they reconcile the syntactic conflict with backreferences? What about hexadecimal escapes? Is it really the regex engine that supports octal and hexadecimal escapes, or is it some other part of the utility?
Does \w match only alphanumerics, or additional characters as well? (Among the programs shown supporting \w in Table 3-3, there are several different interpretations). Does \w agree with the various word-boundary metacharacters on what does and doesn't constitute a "word character"? Do they respect the locale, or understand Unicode?

Many issues must be kept in mind, even with a tidy little summary like Table 3-3 as a superficial guide. (As another example, peek ahead to Table 8-1 for a look at a chart showing some differences among Java packages.) If you realize that there's a lot of dirty laundry behind that nice façade, it's not too difficult to keep your wits about you and deal with it.

As mentioned at the start of the chapter, much of this is just superficial syntax, but many issues go deeper. For example, once you understand that something such as (Jul|July) in egrep needs to be written as $Jul\|July$ for GNU Emacs, you might think that everything is the same from there, but that's not always the case. The differences in the semantics of how a match is attempted (or, at least, how it appears to be attempted) is an extremely important issue that is often overlooked, yet it explains why these two apparently identical examples would actually end up matching differently: one always matches 'Jul', even when applied to 'July'. Those very same semantics also explain why the opposite, (July|Jul) and $July\|Jul$, do match the same text. Again, the entire next chapter is devoted to understanding this.

Of course, what a tool can do with a regular expression is often more important than the flavor of its regular expressions. For example, even if Perl's expressions were less powerful than egrep's, Perl's flexible use of regexes provides for more raw usefulness. We'll look at a lot of individual features in this chapter, and in depth at a few languages in later chapters.

< Free Open Study >