3.4 Common Metacharacters and Features

The following overview of current regex metacharacters covers common items and concepts. It doesn't discuss every issue, and no one tool includes everything presented here. In one respect, this is just a summary of much of what you've seen in the first two chapters, but in light of the wider, more complex world presented at the beginning of this chapter. During your first pass through this section, a light glance should allow you to continue on to the next chapters. You can come back here to pick up details as you need them.

Some tools add a lot of new and rich functionality and some gratuitously change common notations to suit their whim or special needs. Although I'll sometimes comment about specific utilities, I won't address too many tool-specific concerns here. Rather, in this section I'll just try to cover some common metacharacters and their uses, and some concerns to be aware of. I encourage you to follow along with the manual of your favorite utility.

The following is an outline of the constructs covered in this section, with pointers to the page where each sub-section starts:

Character Representations
see Section 3.4.1.1
see Section 3.4.1.3
see Section 3.4.1.4
see Section 3.4.1.5
Character Shorthands: \n, \t, \e, ...
Octal Escapes: \ num
Hex/Unicode Escapes:\x num , \x {num} , \u num, \U num, ...
Control Characters: \c char
Character Classes and class-like constructs
see Section 3.4.2.1
see Section 3.4.2.2
see Section 3.4.2.4
see Section 3.4.2.5
see Section 3.4.2.6
see Section 3.4.2.7
see Section 3.4.2.8
see Section 3.4.2.9
see Section 3.4.2.10
see Section 3.4.2.11

Normal classes: [a-z] and [^a-z]
Almost any character: dot
Class shorthands: \w, \d, \s, \W, \D, \S
Unicode properties, blocks, and categories: \p{ Prop }, \P{ Prop }
Class set operations: [[a-z]&&[^aeiou]]
Unicode Combining Character Sequence: \X
POSIX bracket-expression "character class": [[:alpha:]]
POSIX bracket-expression "collating sequences": [[.span-ll.]]>
POSIX bracket-expression "character equivalents": [[=n=]]
Emacs syntax classes
Anchors and Other "Zero-Width Assertions"
see Section 3.4.3.1
see Section 3.4.3.2
see Section 3.4.3.3
see Section 3.4.3.5
see Section 3.4.3.6
Start of line/string: ^, \A
End of line/string: $, \Z, \z
Start of match (or end of previous match): \G
Word boundaries: \b, \B, \<, \>, ...
Lookahead (?=···), (?!···); Lookbehind, (?<=···), (?<!···)
Comments and mode-modifiers
see Section 3.4.4.1
see Section 3.4.4.2
see Section 3.4.4.3
see Section 3.4.4.4
Mode modifier: (? modifier ), such as (?i) or (?-i)
Mode-modified span: (? modifier :···), such as (?i:···)
Comments: (?#···) and #···
Literal-text span: \Q···\E
Grouping, Capturing, Conditionals, and Control
see Section 3.4.5.1
see Section 3.4.5.2
see Section 3.4.5.3
see Section 3.4.5.4
see Section 3.4.5.5
see Section 3.4.5.6
see Section 3.4.5.7
see Section 3.4.5.9
see Section 3.4.5.10

Capturing/grouping parentheses: (···), \1, \2, ...
Grouping-only parentheses: (?:···)
Named capture: (?< Name >···)
Atomic grouping: (?>···)
Alternation: ···|···|···
Conditional: (? if then | else )
Greedy quantifiers: *, +, ?, { num , num }
Lazy quantifiers: *?, +?, ??, { num , num }?
Possessive quantifiers: *+, ++, ?+, { num , num }+

3.4.1 Character Representations

This group of metacharacters provides visually pleasing ways to match specific characters that are otherwise difficult to represent.

3.4.1.1 Character shorthands

Many utilities provide metacharacters to represent certain control characters that are sometimes machine-dependent, and which would otherwise be difficult to input or to visualize:

\a Alert (e.g., to sound the bell when "printed") Usually maps to the ASCII <BEL> character, 007 octal.
\b Backspace Usually maps to the ASCII <BS> character, 010 octal. (Note \b often is a word-boundary metacharacter instead, as we'll see later.)
\e Escape character Usually maps to the ASCII <ESC> character, 033 octal.
\f Form feed Usually maps to the ASCII <FF> character, 014 octal.
\n Newline On most platforms (including Unix and DOS/Windows), usually maps to the ASCII <LF> character, 012 octal. On MacOS systems, usually maps to the ASCII <CR> character, 015 octal. With Java or any .NET language, always the ASCII <LF> character regardless of platform.
\r Carriage return Usually maps to the ASCII <CR> character. On MacOS systems, usually maps to the ASCII <LF> character. With Java or any .NET language, always the ASCII <CR> character regardless of platform.
\t Normal (horizontal) tab Usually maps to the ASCII <HT> character, 011 octal.
\v Vertical tab Usually maps to the ASCII <VT> character, 013 octal.

Table 3-6 lists a few common tools and some of the control shorthands they provide. As discussed earlier, some languages also provide many of the same shorthands for the string literals they support. Be sure to review that section (see Section 3.3) for some of the associated pitfalls.

3.4.1.2 These are machine dependent?

As noted in the list, \n and \r are operating-system dependent in many tools,^[10] so, it's best to choose carefully when you use them. When you need, for example, "a newline" for whatever system your script will happen to run on, use \n. When you need a character with a specific value, such as when writing code for a defined protocol like HTTP, use \012 or whatever the standard calls for. (\012 is an octal escape.) If you wish to match DOS line-ending characters, use \015\012 . To match either DOS or Unix line-ending characters, use \015?\012 . (These actually match the line-ending characters—to match at the start or end of a line, use a line anchor see Section 3.4.3).

^[10] If the tool itself is written in C or C++, and converts its regex backslash escapes into C backslash escapes, the resulting value is dependent upon the compiler used, since the C standard leaves the actual values to the discretion of the compiler vendor. In practice, compilers for any particular platfor m are standardized around newline support, so it's safe to view these as operating-system dependent. Furthermore, it seems that only \n and \r vary across operating systems , so the others can be considered standard across all systems.

Table 6. A Few Utilities and Some of the Shorthand Metacharacters They Provide
\b
(word boundary) \b
(backspace) \a
(alar m) \e
(ASCII escape) \f
(form feed) \n
(newline) \r
(carriage return) \t
(tab) \v
(vertical tab)
Program Character shorthands
Python
Tcl as \y
Perl
Java
GNU awk
GNU sed
GNU Emacs
.NET
PHP
MySQL
GNU grep/egrep
flex
Ruby

supported supported in class only see Section 3.1.1.9 for version information
supported (also supported by string literals)
supported (but string literals have a different meaning for the same sequence)
not supported (but string literals have a different meaning for the same sequence)
not supported (but supported by string literals)
This table assumes the most regex-friendly type of string per application (see Section 3.3)

3.4.1.3 Octal escape—`\num`

Implementations supporting octal (base 8) escapes generally allow two- and threedigit octal escapes to be used to indicate a byte or character with a particular value. For example, \015\012 matches an ASCII CR/LF sequence. Octal escapes can be convenient for inserting hard-to-type characters into an expression. In Perl, for instance, you can use \e for the ASCII escape character, but you can't in awk. Since awk does support octal escapes, you can use the ASCII code for the escape character directly: \033 .

Table 3-7 shows the octal escapes some tools support.

Some implementations, as a special case, allow \0 to match a null byte. Some allow all one-digit octal escapes, but usually don't if backreferences such as \1 are supported. When there's a conflict, backreferences generally take precedence over octal escapes. Some allow four-digit octal escapes, usually to support a requirement that any octal escape begin with a zero (such as with java.util.regex).

You might wonder what happens with out-of-range values like \565 (8-bit octal values range from \000 to \377). It seems that half the implementations leave it as a larger-than-byte value (which may match a Unicode character if Unicode is supported), while the other half strip it to a byte. In general, it's best to limit octal escapes to \377 and below.

3.4.1.4 Hex and Unicode escapes: `\xnum, \x{num}, \unum, \Unum, ...`

Similar to octal escapes, many utilities allow a hexadecimal (base 16) value to be entered using \x, \u, or sometimes \U. If allowed with \x, for example, \x0D\x0A matches the CR/LF sequence. Table 3-7 shows the hex escapes that some tools support.

Besides the question of which escape is used, you must also know how many digits they recognize, and if braces may be (or must be) used around the digits. These are also indicated in Table 3-7.

3.4.1.5 Control characters: `\cchar`

Many flavors offer the \c char sequence to match control characters with encoding values less than 32 (some allow a wider range). For example, \cH matches a Control-H, which represents a backspace in ASCII, while \cJ matches an ASCII linefeed (which is often also matched by \n , but sometimes by \r , depending on the platform see Section 3.4.1).

Details aren't uniform among systems that offer this construct. You'll always be safe using uppercase English letters as in the examples. With most implementations, you can use lowercase letters as well, but Sun's Java regex package, for example, does not support them. And what exactly happens with non-alphabetics is very flavor-dependent, so I recommend using only uppercase letters with \c.

Related Note: GNU Emacs supports this functionality, but with the rather ungainly metasequence ?\^ char (e.g., ?\^H to match an ASCII backspace).

Table 7. A Few Utilities and the Octal and Hex Regex Escapes Their Regexes Support
Backreferences Octal escapes Hex escapes
Python \0, \07, \377 \xFF
Tcl \0, \77, \777 \x··· \uFFFF; \UFFFFFFFF
Perl \0, \77, \377 \xFF; \x{···}
Java \07, \077, \0377 \xFF; \uFFFF
GNU awk \7, \77, \377 \x···
GNU sed
GNU Emacs
.NET \0, \77, \377 \xFF, \uFFFF
PHP \77, \377 \xF, \xFF
MySQL
GNU egrep
GNU grep
flex \7, \77, \377 \xF, \xFF
Ruby \0, \77, \377, \0377 \xF, \xFF

\0 - \0 matches a null byte, but other one-digit octal escapes are not supported
\7, \77 - one- and two- digit octal escapes are supported
\07 - two-digit octal escapes are supported if leading digit is a zero
\077 - three-digit octal escapes are supported if leading digit is a zero
\377 - three-digit octal escapes are supported, until \377
\0377 - four-digit octal escapes are supported, until \0377
\777 - three-digit octal escapes are supported, until \777
\x··· - \x allows any number of digits
\x{···} - \x{···} allows any number of digits
\xF, \xFF - one- and two- digit hex escape is allowed with \x
\uFFFF - four-digit hex escape allowed with \u
\UFFFF - four-digit hex escape allowed with \U
\UFFFFFFFF - eight-digit hex escape allowed with \U (see Section 3.1.1.9 for version information.)

3.4.2 Character Classes and Class-Like Constructs

Modern flavors provide a number of ways to specify a set of characters allowed at a particular point in the regex, but the simple character class is ubiquitous.

3.4.2.1 Normal classes: `[a-z]` and `[^a-z]`

The basic concept of a character class has already been well covered, but let me emphasize again that the metacharacter rules change depending on whether you're in a character class or not. For example, * is never a metacharacter within a class, while - usually is. Some metasequences, such as \b , sometimes have a different meaning within a class than outside of one (see Section 3.4.1.1).

With most systems, the order that characters are listed in a class makes no difference, and using ranges instead of listing characters is irrelevant to the execution speed (e.g., [0-9] should be no different from [9081726354]). However, some implementations don't completely optimize classes (Sun's Java regex package comes to mind), so it's usually best to use ranges, which tend to be faster, wherever possible.

A character class is always a positive assertion. In other words, it must always match a character to be successful. A negated class must still match a character, but one not listed. It might be convenient to consider a negated character class to be a "class to match characters not listed." (Be sure to see the warning about dot and negated character classes, in the next section.) It used to be true that something like [^LMNOP] was the same as [\x00-KQ-\xFF] . In strictly eight-bit systems, it still is, but in a system such as Unicode where character ordinals go beyond 255 (\xFF), a negated class like [^LMNOP] suddenly includes all the tens of thousands of characters in the encoding—all except L, M, N, O, and P.

Be sure to understand the underlying character set when using ranges. For example, [a-Z] is likely an error, and in any case certainly isn't "alphabetics." One specification for alphabetics is [a-zA-Z] , at least for the ASCII encoding. (see \p{L} in "Unicode properties" Section 3.4.2.5.) Of course, when dealing with binary data, ranges like \x80-\xFF make perfect sense.

3.4.2.2 Almost any character: dot

In some tools, dot is a shorthand for a character class that can match any character, while in most others, it is a shorthand to match any character except a newline. It's a subtle difference that is important when working with tools that allow target text to contain multiple logical lines (or to span logical lines, such as in a text editor). Concerns about dot include:

In some Unicode-enabled systems, such as Sun's Java regex package, dot normally does not match a Unicode line terminator (see Section 3.3.2.2).
A match mode (see Section 3.3.3.3) can change the meaning of what dot matches.
The POSIX standard dictates that dot not match a null (a character with the value zero), although all the major scripting languages allow nulls in their text (and dot matches them).

3.4.2.3 Dot versus a negated character class

When working with tools that allow multiline text to be searched, take care to note that dot usually does not match a newline, while a negated class like [^"] usually does. This could yield surprises when changing from something such as ".*" to "[^"]*" . The matching qualities of dot can often be changed by a match mode—see "Dot-matches-all match mode" in Section 3.3.3.3.

3.4.2.4 Class shorthands: `\w, \d, \s, \W, \D, \S`

Support for the following shorthands is quite common:

\d Digit Generally the same as [0-9] or, in some Unicode-enabled tools, all Unicode digits.
\D Non-digit Generally the same as [^\d]
\w Part-of-word character Often the same as [a-zA-Z0-9_] , although some tools omit the underscore, while others include all the extra alphanumerics characters in the locale (see Section 3.1.1.5). If Unicode is supported, \w usually refers to all alphanumerics (notable exception: Sun's Java regex package, whose \w is exactly [a-zA-Z0-9_] ).
\W Non-word character Generally the same as [^\w].
\s Whitespace character On ASCII-only systems, this is often the same as [•\f\n\r\t\v] . Unicode-enabled systems sometimes also include the Unicode "next line" control character U+0085, and sometimes the "whitespace" property \p{Z} (described in the next section).
\S Non-whitespace character Generally the same as [^\s] .

As described in Section 3.1.1.5, a POSIX locale could influence the meaning of these shorthands (in particular, \w ). Unicode-enabled programs likely have \w match a much wider scope of characters, such as \p{L} (discussed in the next section) plus an underscore.

3.4.2.5 Unicode properties, scripts, and blocks: `\p{Prop}, \P{Prop}`

On its surface, Unicode is a mapping (see Section 3.3.2.2), but the Unicode Standard offers much more. It also defines qualities about each character, such as "this character is a lowercase letter," "this character is meant to be written right-to-left," "this character is a mark that's meant to be combined with another character," etc.

Regular-expression support for these qualities varies, but many Unicode-enabled programs support matching via at least some of them with \p{ quality } (matches characters that have the quality) and \P{ quality } (matches characters without it). One example is \p{L} , where 'L' is the quality meaning "letter" (as opposed to number, punctuation, accents, etc.). \p{L} is an example of a general property (also called a category). We'll soon see other "qualities" that can be tested by \p{···} and \P{···} , but the most commonly supported are the general properties.

The general properties are shown in Table 3-8. Each character (each code point actually, which includes those that have no characters defined) can be matched by just one general property. The general property names are one character ('L' for Letter, 'S' for symbol, etc.), but some systems support a more descriptive synonym ('Letter', 'Symbol', etc.) as well. Perl, for example, supports these.

Table 8. Basic Unicode Properties
Class Synonym and description
\p{L} \p{Letter} - Things considered letters.
\p{M} \p{Mark} - Various characters that are not meant to appear by themselves, but with other base characters (accent marks, enclosing boxes, . . . ).
\p{Z} \p{Separator} - Characters that separate things, but have no visual representation (various kinds of spaces . . . ).
\p{S} \p{Symbol} - Various types of Dingbats and symbols.
\p{N} \p{Number} - Any kind of numeric character.
\p{P} \p{Punctuation} - Punctuation characters.
\p{C} \p{Other} - Catch-all for everything else (rarely used for normal characters).

With some systems, single-letter property names may be referenced without the curly braces (e.g., using \pL instead of \p{L} ). Some systems may require (or simply allow) 'In' or 'Is' to prefix the letter (e.g., \p{IsL} ). As we look at additional qualities, we'll see examples of where an Is/In prefix is required.^[11]

^[11] As we'll see (and is illustrated in the table in Section 3.4.2.5), the whole Is/In prefix business is somewhat of a mess. Previous versions of Unicode recommend one thing, while early implementations often did another. During Perl 5.8's development, I worked with the development group to simplify things for Perl. The rule in Perl now is simply "You don't need to use 'Is' or 'In' unless you specifically want a Unicode Block (see Section 3.4.2.5), in which case you must prepend 'In'."

Each one-letter general Unicode property can be further subdivided into a set of two-letter sub-properties, as shown in Table 3-9. Additionally, some implementations support a special composite sub-property, \p{L&} , which is a shorthand for all "cased" letters: [\p{Lu}\p{Ll}\p{Lt}] .

Also shown are the full-length synonyms (e.g., "Lowercase_Letter" instead of "Ll"), which may be supported by some implementations. The standard suggests that a variety of forms be accepted ('LowercaseLetter', 'LOWERCASE_LETTER', 'Lowercase•Letter', 'lowercase-letter', etc.), but I recommend, for consistency, always using the form shown in Table 3-9.

Scripts. Some systems have support for matching via a script (writing system) name with \p{···} . For example, if supported, \p{Hebrew} matches characters that are specifically part of the Hebrew writing system. (A script does not match common characters that might be used by other writing systems as well, such as spaces and punctuation.)

Some scripts are language-based (such as Gujarati, Thai, Cherokee, ...). Some span multiple languages (e.g., Latin, Cyrillic), while some languages are composed of multiple scripts, such as Japanese, which uses characters from the Hiragana, Katakana, Han ("Chinese Characters"), and Latin scripts. See your system's documentation for the full list.

Table 9. Basic Unicode Sub-Properties
Property Synonym and description

\p{Ll}

\p{Lu}

\p{Lt}

\p{L&}

\p{Lm}

\p{Lo}

\p{Lowercase_Letter} - Lowercase letters.

\p{Uppercase_Letter} - Uppercase letters.

\p{Titlecase_Letter} - Letters that appear at the start of a word (e.g., the character D is the title case of the lowercase d and of the uppercase D).
A composite shorthand matching all \p{Ll}, \p{Lu}, and \p{Lt} characters.

\p{Modifier_Letter} - A small set of letter-like special-use characters.

\p{Other_Letter} - Letters that have no case, and aren't modifiers, including letters from Hebrew, Arabic, Bengali, Tibetan, Japanese, ...

\p{Mn}

\p{Mc}

\p{Me}

\p{Non_Spacing_Mark} - "characters" that modify other characters, such as accents, umlauts, certain "vowel signs," and tone marks.

\p{Spacing_Combining_Mark} - modification characters that take up space of their own (mostly "vowel signs" in languages that have them, including Bengali, Gujarati, Tamil, Telugu, Kannada, Malayalam, Sinhala, Myanmar, and Khmer).

\p{Enclosing_Mark} - A small set of marks that can enclose other characters, such as circles, squares, diamonds, and "keycaps."

\p{Zs}

\p{Zl}

\p{Zp}

\p{Space_Separator} - Various kinds of spacing characters, such as a normal space, non-break space, and various spaces of specific widths.

\p{Line_Separator} - The LINE SEPARATOR character (U+2028).

\p{Paragraph_Separator} - The PARAGRAPH SEPARATOR character (U+2029).

\p{Sm}

\p{Sc}

\p{Sk}

\p{So}

\p{Math_Symbol} - +, ÷, a fraction slash, , ...

\p{Currency_Symbol} - $, ¢, ¥, €, ...

\p{Modifier_Symbol} - Mostly versions of the combining characters, but as full-fledged characters in their own right.

\p{Other_Symbol} - Various Dingbats, box-drawing symbols, Braille patterns, non-letter Chinese characters, ...

\p{Nd}

\p{Nl}

\p{No}

\p{Decimal_Digit_Number} - zero through nine, in various scripts (not including Chinese, Japanese, and Korean).

\p{Letter_Number} - mostly Roman numerals.

\p{Other_Number} - Numbers as superscripts or symbols; characters representing numbers that aren't digits (Chinese, Japanese, and Korean not included).

\p{Pd}

\p{Ps}

\p{Pe}

\p{Pi}

\p{Pf}

\p{Pc}

\p{Po}

\p{Dash_Punctuation} - Hyphens and dashes of all sorts.

\p{Open_Punctuation} - Characters like (, , and , ...

\p{Close_Punctuation} - Characters like ), , , ...

\p{Initial_Punctuation} - Characters like «, ", <, ...

\p{Final_Punctuation} - Characters like », ', >, ...

\p{Connector_Punctuation} - A few punctuation characters with special linguistic meaning, such as an underscore.

\p{Other_Punctuation} - Catch-all for other punctuation: !, &, ·, :, , ...

\p{Cc}

\p{Cf}

\p{Co}

\p{Cn}

\p{Control} - The ASCII and Latin-1 control characters (TAB, LF, CR, ...)

\p{Format} - Non-visible characters intended to indicate some basic formatting (zero width joiner, activate Arabic form shaping, ...)

\p{Private_Use} - Code points allocated for private use (company logos, etc.).

\p{Not_Assigned} - Code points that have no characters assigned.

A script does not include all characters used by the particular writing system, but rather, all characters used only (or predominantly) by that writing system. Common characters, such as spacing and punctuation marks, are not included within any script, but rather are included as part of the catch-all pseudo-script IsCommon, matched by \p{IsCommon} . A second pseudo-script, Inherited, is composed of certain combining characters that inherit the script from the base character that they follow.

Blocks. Similar (but inferior) to scripts, blocks refer to ranges of code points on the Unicode character map. For example, the Tibetan block refers to the 256 code points from U+0F00 through U+0FFF. Characters in this block are matched with \p{InTibetan} in Perl and java.util.regex, and with \p{IsTibetan} in .NET. (More on this in a bit.)

There are many blocks, including blocks for most systems of writing (Hebrew, Tamil, Basic_Latin, Hangul_Jamo, Cyrillic, Katakana, ...), and for special character types (Currency, Arrows, Box_Drawing, Dingbats, ...).

Tibetan is one of the better examples of a block, since all characters in the block that are defined relate to the Tibetan language, and there are no Tibetan-specific characters outside the block. Block qualities, however, are inferior to script qualities for a number of reasons:

Blocks can contain unassigned code points. For example, about 25% of the code points in the Tibetan block have no characters assigned to them.
Not all characters that would seem related to a block are actually part of that block. For example, the Currency block does not contain the universal currency symbol '¤', nor such notable currency symbols as $, ¢, £, €, and ¥. (Luckily, in this case, you can use the currency property, \p{Sc}, in its place.)
Blocks often have unrelated characters in them. For example, ¥ (Yen symbol) is found in the Latin_1_Supplement block.
What might be considered one script may be included within multiple blocks. For example, characters used in Greek can be found in both the Greek and Greek_Extended blocks.

Support for block qualities is more common than for script qualities. There is ample room for getting the two confused because there is a lot of overlap in the naming (for example, Unicode provides for both a Tibetan script and a Tibetan block).

Furthermore, as Table 3-10 shows, the nomenclature has not yet been standardized. With Perl and java.util.regex, the Tibetan block is \p{InTibetan} , but in the .NET Framework, it's \p{IsTibetan} (which, to add to the confusion, Perl allows as an alternate representation for the Tibetan script).

Other properties/qualities. Not everything talked about so far is universally supported. Table 3-10 gives a few details about what's been covered so far.

Additionally, Unicode defines many other qualities that might be accessible via the \p{···} construct, including ones related to how a character is written (left-to-right, right-to-left, etc.), vowel sounds associated with characters, and more. Some implementations even allow you to create your own properties on the fly. See your program's documentation for details on what's supported.

Table 10. Property/Script/Block Features
Feature Perl Java .NET

Basic Properties like \p{L}
Basic Properties shorthand like \pL
Basic Properties longhand like \p{IsL}
Basic Properties full like \p{Letter}

Composite \p{L&}
Script like \p{Greek}
Script longhand like \p{IsGreek}

Block like \p{Cyrillic}
Block longhand like \p{InCyrillic}
Block longhand like \p{IsCyrillic} if no script

Negated \P{···}
Negated \p{^···}

\p{Any}
\p{Assigned}
\p{Unassigned}

as \p{all}
as \P{Cn}
as \p{Cn}
as \P{Cn}
as \p{Cn}
Lefthand checkmarks are recommended for new implementations. (see Section 3.1.1.9 for version information)

3.4.2.6 Class set operations: `[[a-z]&&[^aeiou]]`

Sun's Java regex package supports set operations within character classes. For example, you can match all non-vowel English letters with "[a-z] minus [aeiou]". The nomenclature for this may seem a bit odd a first — it's written as [[a-z]&&[^aeiou]] , and read aloud as "this and not that." Before looking at that in more detail, let's look at the two basic class set operations, OR and AND.

OR allows you to add characters to the class by including what looks like an embedded class within the class: [abcxyz] can also be written as [[abc][xyz]], [abc[xyz]], or [[abc]xyz] , among others. OR combines sets, creating a new set that is the sum of the argument sets. Conceptually, it's similar to the "bitwise or" operator that many languages have via a '|' or 'or' operator. In character classes, OR is mostly a notational convenience, although the ability to include negated classes can be useful in some situations.

AND does a conceptual "bitwise AND" of two sets, keeping only those characters found in both sets. It is achieved by inserting the special class metasequence && between two sets of characters. For example, [\p{InThai}&&\P{Cn}] matches all assigned code points in the Thai block. It does this by taking the intersection between (i.e., keeping only characters in both) \p{InThai} and \P{Cn}. Remember, \P{···} with a capital 'P', matches everything not part of the quality, so \P{Cn} matches everything not unassigned, which in other words, means is assigned. (Had Sun supported the Assigned quality, I could have used \p{Assigned} instead of \P{Cn} in this example.)

Be careful not to confuse OR and AND. How intuitive these names feel depends on your point of view. For example, [[this][that]] in normally read "accept characters that match [this] or[that]," yet it is equally true if read "the list of characters to allow is [this] and[that]." Two points of view for the same thing.

AND is less confusing in that [\p{InThai}&&\P{Cn}] is normally read as "match only characters matchable by \p{InThai} and \P{Cn}," although it is sometimes read as "the list of allowed characters is the intersection of \p{InThai} and \P{Cn}."

These differing points of view can make talking about this confusing: what I call OR and AND, some might choose to call AND and INTERSECTION.

Class subtraction. Thinking further about the [\p{InThai}&&\P{Cn}] example, it's useful to realize that \P{Cn} is the same as [^\p{Cn}], so the whole thing can be rewritten as the somewhat more complex looking [\p{InThai}&&[^\p{Cn}]]. Furthermore, matching "assigned characters in the Thai block" is the same as "characters in the Thai block, minus unassigned characters." The double negative makes it a bit confusing, but it shows that [ \p{InThai} && [^\p{Cn} ]] means "\p{InThai} minus \p{Cn}."

This brings us back to the [[a-z]&&[^aeiou]] example from the start of the section, and shows how to do class subtraction. The pattern is that [ this &&[^ that ]] means "[ this ] minus [ that ]." I find that the double negatives of && and [^···] tend to make my head swim, so I just remember the [··· && [^···]] pattern.

Mimicking class set operations with lookaround. If your program doesn't support class set operations, but does support lookaround (see Section 3.4.3.6), you can mimic the set operations. With lookahead, [\p{InThai}&&[^\p{Cn}]] can be rewritten as (?! \p{Cn} ) \p{InThai} .^[12] Although not as efficient as well-implemented class set operations, using lookaround can be quite flexible. This example can be written four different ways (substituting IsThai for InThai in .NET see Section 3.4.2.6):

           

(?!\p{Cn})\p{InThai}

(?=\P{Cn})\p{InThai}

\p{InThai}(?<!\p{Cn})

\p{InThai}(?<=\P{Cn})

^[12] Actually, in Perl, this particular example could probably be written simply as \p{Thai} , since in Perl \p{Thai} is a script, which never contains unassigned characters. Other differences between the Thai script and block are subtle. It's beneficial to have the documentation as to what is actually covered by any particular script or block. In this case, the script is actually missing a few special characters that are in the block.

3.4.2.7 Unicode combining character sequence: `\X`

Perl supports \X as a shorthand for \P{M}\p{M}* , which is like an extended . (dot). It matches a base character (anything not \p{M} , followed by any number (including none) of combining characters (anything that is \p{M} ).

As discussed earlier (see Section 3.3.2.2), Unicode uses a system of base and combining characters which, in combination, create what look like single, accented characters like à ('a' U+0061 combined with the grave accent '`' U+0300). You can use more than one combining character if that's what you need to create the final result. For example, if for some reason you need 'ç', that would be 'c' followed by a combining cedilla '¸' and a combining breve '' (U+0063 followed by U+0327 and U+0306).

If you wanted to match either "francais" or "français," it wouldn't be safe to just use fran.ais or fran[cç]ais , as those assume that the 'ç' is rendered with the single Unicode code point U+00C7, rather than 'c' followed by the cedilla (U+0063 followed by U+0327). You could perhaps use fran(c¸?|ç)ais if you needed to be very specific, but in this case, fran\Xais is a good substitute for fran.ais .

Besides the fact that \X matches trailing combining characters, there are two differences between it and dot. One is that \X always matches a newline and other Unicode line terminators (see Section 3.3.2.2), while dot is subject to dot-matches-all matchmode (see Section 3.3.3.3), and perhaps other match modes depending on the tool. Another difference is that a dot-matches-all dot is guaranteed to match all characters at all times, while \X doesn't match a leading combining character.

3.4.2.8 POSIX bracket-expression "character class": `[[:alpha:]]`

What we normally call a character class, the POSIX standard calls a bracket expression. POSIX uses the term "character class" for a special feature used within a bracket expression^[13] that we might consider to be the precursor to Unicode's character properties.

^[13] In general, this book uses "character class" and "POSIX bracket expression" as synonyms to refer to the entire construct, while "POSIX character class" refers to the special range-like class feature described here.

A POSIX character class is one of several special metasequences for use within a POSIX bracket expression. An example is [:lower:], which represents any lowercase letter within the current locale (see Section 3.1.1.5). For English text, [:lower:] is comparable to a-z. Since this entire sequence is valid only within a bracket expression, the full class comparable to [a-z] is [[:lower:]] . Yes, it's that ugly. But, it has the advantage over [a-z] of including other characters, such as ö, ñ, and the like if the locale actually indicates that they are "lowercase letters."

The exact list of POSIX character classes is locale dependent, but the following are usually supported:

[:alnum:] alphabetic characters and numeric character
[:alpha:] alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank characters (not spaces, control characters, or the like)
[:lower:] lowercase alphabetics
[:print:] like [:graph:], but includes the space character
[:punct:] punctuation characters
[:space:] all whitespace characters ([:blank:], newline, carriage return, and the like)
[:upper:] uppercase alphabetics
[:xdigit:] digits allowed in a hexadecimal number (i.e., 0-9a-fA-F).

Systems that support Unicode properties (see Section 3.4.2.5) may or may not extend that Unicode support to these POSIX constructs. The Unicode property constructs are more powerful, so those should generally be used if available.

3.4.2.9 POSIX bracket-expression "collating sequences": `[[.span-ll.]]`

A locale can have collating sequences to describe how certain characters or sets of characters should be ordered. For example, in Spanish, the two characters ll (as in tortilla) traditionally sort as if it were one logical character between l and m, and the German ß is a character that falls between s and t, but sorts as if it were the two characters ss. These rules might be manifested in collating sequences named, for example, span-ll and eszet.

A collating sequence that maps multiple physical characters to a single logical character, such as the span-ll example, is considered "one character" to a fully compliant POSIX regex engine. This means that something like [^abc] matches the 'll' sequence.

A collating sequence element is included within a bracket expression using a [.···.] notation: torti[[.span-ll.]]a matches tortilla. A collating sequence allows you to match against those characters that are made up of combinations of other characters. It also creates a situation where a bracket expression can match more than one physical character.

3.4.2.10 POSIX bracket-expression "character equivalents": `[[=n=]]`

Some locales define character equivalents to indicate that certain characters should be considered identical for sorting and such. For example, a locale might define an equivalence class 'n' as containing n and ñ, or perhaps one named 'a' as containing a, à, and á. Using a notation similar to [:···:], but with '=' instead of a colon, you can reference these equivalence classes within a bracket expression: [[=n=][=a=]] matches any of the characters just mentioned.

If a character equivalence with a single-letter name is used but not defined in the locale, it defaults to the collating sequence of the same name. Locales normally include normal characters as collating sequences — [.a.], [.b.], [.c.], and so on—so in the absence of special equivalents, [[=n=][=a=]] defaults to [na] .

3.4.2.11 Emacs syntax classes

GNU Emacs doesn't support the traditional \w , \s , etc.; rather, it uses special sequences to reference "syntax classes":

\s char matches characters in the Emacs syntax class as described by char
\S char matches characters not in the Emacs syntax class

\sw matches a "word constituent" character, and \s- matches a "whitespace character." These would be written as \w and \s in many other systems.

Emacs is special because the choice of which characters fall into these classes can be modified on the fly, so, for example, the concept of which characters are word constituents can be changed depending upon the kind of text being edited.

3.4.3 Anchors and Other "Zero-Width Assertions"

Anchors and other "zero-width assertions" don't match actual text, but rather positions in the text.

3.4.3.1 Start of line/string: ^, `\A`

Caret ^ matches at the beginning of the text being searched, and, if in an enhanced line-anchor match mode (see Section 3.3.3.4), after any newline. In some systems, an enhanced-mode ^ can match after Unicode line terminators, as well (see Section 3.3.2.2).

When supported, \A always matches only at the start of the text being searched, regardless of any match mode.

3.4.3.2 End of line/string: `$, \Z, \z`

As Table 3-11 below shows, the concept of "end of line" can be a bit more complex than its start-of-line counterpart. $ has a variety of meanings among different tools, but the most common meaning is that it matches at the end of the target string, and before a string-ending newline, as well. The latter is common, to allow an expression like s$ (ostensibly, to match "a line ending with s") to match '···s', a line ending with s that's capped with an ending newline.

Two other common meanings for $ are to match only at the end of the target text, and to match after any newline. In some Unicode systems, the special meaning of newline in these rules are replaced by Unicode line terminators (see Section 3.3.2.2).

A match mode (see Section 3.3.3.4) can change the meaning of $ to match before any embedded newline (or Unicode line terminator as well).

When supported, \Z usually matches what the "unmoded" $ matches, which often means to match at the end of the string, or before a string-ending newline. To complement these, \z matches only at the end of the string, period, without regard to any newline. See Table 3-11 for a few exceptions.

Table 11. Line Anchors for Some Scripting Languages
Concern Java Perl PHP Python Ruby Tcl .NET
Normally . . .
^ matches at start of string
^ matches after any newline
$ matches at end of string
$ matches before string-ending newline
$ matches before any newline

Has enhanced line-anchor mode (see Section 3.3.3.4)
In enhanced line-anchor mode . . .
^ matches at start of string
^ matches after any newline
$ matches at end of string
$ matches before any newline

N/A
N/A
N/A
N/A

\A always matches like normal ^ ·4
\Z always matches like normal $ ·3 ·5
\z always matches only at end of string N/A N/A
Notes:

Sun's Java regex package supports Unicode's line terminator (see Section 3.3.2.2) in these cases.
Ruby's $ and ^ match at embedded newlines, but its \A and \Z do not.
Python's \Z matches only at the end of the string.
Ruby's \A, unlike its ^, matches only at the start of the string.
Ruby's \Z, unlike its $, matches at the end of the string, or before a string-ending newline.

(see Section 3.1.1.9 for version information.)

3.4.3.3 Start of match (or end of previous match): `\G`

\G was first introduced by Perl to be useful when doing iterative matching with /g (see Section 2.3.2), and ostensibly matches the location where the previous match left off. On the first iteration, \G matches only at the beginning of the string, just like \A .

If a match is not successful, the location at which \G matches is reset back to the beginning of the string. Thus, when a regex is applied repeatedly, as with Perl's s/···/···/g or other's "match all" function, The failure that causes the "match all" to fail also resets the location for \G for the next time a match of some sort is applied.

Perl's \G has three unique aspects that I find quite interesting and useful:

The location associated with \G is an attribute of each target string, not of the regexes that are setting that location. This means that multiple regexes can match against a string, in turn, each using \G to ensure that they pick up exactly where one of the others left off.
Perl's regex operators have an option (Perl's /c modifier see Section 7.5.4.3) that indicates a failing match should not reset the \G location, but rather to leave it where it was. This works well with the first point to allow tests with a variety of expressions to be performed at one point in the target string, advancing only when there's a match.
That location associated with \G can be inspected and modified by non-regex constructs (Perl's pos function see Section 7.5.4.1). One might want to explicitly set the location to "prime" a match to start at a particular location, and match only at that location. Also, if the language supports this point, the functionality of the previous point can be mimicked with this feature, if it's not already supported directly.

See the sidebar below for an example of these features in action. Despite these convenient features, Perl's \G does have a problem in that it works reliably only when it's the first thing in the regex. Luckily, that's where it's mostnaturally used.

3.4.3.4 End of previous match, or start of the current match?

One detail that differs among implementations is whether \G actually matches the "start of the current match" or "end of the previous match." In the vast majority of cases, the two meanings are the same, so it's a non-issue most of the time. Uncommonly, they can differ. There is a realistic example of how this might arise in Section 5.4.2.1, but the issue is easiest to understand with a contrived example: consider applying x? to 'abcde'. The regex can match successfully at 'abcde', but doesn't actually match any text. In a global search-and-replace situation, where the regex is applied repeatedly, picking up each time from where it left off, unless the transmission does something special, the "where it left off" will always be the same as where it started. To avoid an infinite loop, the transmission forcefully bumps along to the next character when it recognizes this situation. You can see this by applying s/x?/!/g to 'abcde', yielding '!a!b!c!d!e!'.

Advanced Use of \G with Perl

Here's the outline of a snippet that performs simple validation on the HTML in the variable $html, ensuring that it contains constructs from among only a very limited subset of HTML (simple <IMG> and <A> tags are allowed, as well as simple entities like >). I've used this method at Yahoo!, for example, to validate that a user's HTML submission met certain guidelines.

This code relies heavily on the behavior of Perl's m/···/gc match operator, which applies the regular expression to the target string once, picking up from where the last successful match left off, but not resetting that position if it fails (see Section 7.5.4.3).

Using this feature, the various expressions used below all "tag team" to work their way through the string. It's similar in theory to having one big alternation with all the expressions, but this approach allows program code to be executed with each match, and to include or exclude expressions on the fly.

my $need_close_anchor = 0; # True if we've seen <A>, but not its closing </A>. while (not $html =~ m/\G\z/gc) # While we haven't worked our way to the end . . . { if ($html =~ m/\G(\w+)/gc) { . . . have a word or number in $1 -- can now check for profanity, for example . . . } elsif ($html =~ m/\G[^<>&\w]+/gc) { # Other non-HTML stuff -- simply allow it. } elsif ($html =~ m/\G<img\s+([^>]+)>/gci) { . . . have an image tag -- can check that it's appropriate . . . . . . } elsif (not $need_close_anchor and $html =~ m/\G<A\s+([^>]+)>/gci) { . . . have a link anchor — can validate it . . . . . . $need_close_anchor = 1; # Note that we now need </A> } elsif ($need_close_anchor and $html =~ m{\G</A>}gci){ $need_close_anchor = 0; # Got what we needed; don't allow again } elsif ($html =~ m/\G&(#\d+|\w+);/gc){ # Allow entities like > and { } else { # Nothing matched at this point, so it must be an error. Grab a dozen or so # characters from the HTML so that we can issue an informative error message. my ($badstuff) = $html =~ m/\G(.{1,12})/g; my $location = pos($html); # Note where the unexpected HTML starts. die "Unexpected HTML at position $location: $badstuff\n"; } } # Make sure there's no dangling <A> if ($need_close_anchor) { die "Missing final </A>" }

One side effect of the transmission having to step in this way is that the "end of the previous match" then differs from "the start of the current match." When this happens, the question becomes: which of the two locations does \G match? In Perl, actually applying s/\Gx?/!/g to 'abcde' yields just '!abcde', so in Perl, \G really does match only the end of the previous match. If the transmission does the artificial bump-along, Perl's \G is guaranteed to fail.

On the other hand, applying the same search-and-replace with some other tools yields the original '!a!b!c!d!e!', showing that their \G matches successfully at the start of each current match, as decided after the artificial bump-along.

You can't always rely on the documentation that comes with a tool to tell you which is which, as I've found that both Microsoft's .NET and Sun's Java documentation are incorrect. My testing has shown that java.util.regex and Ruby have \G match at the start of the current match, while Perl and the .NET languages have it match at the end of the previous match. (Sun tells me that the next release of java.util.regex will have its \G behavior match the documentation.)

3.4.3.5 Word boundaries: `\b, \B, \<, \>, ...`

Like line anchors, word-boundary anchors match a location in the string. There are two distinct approaches. One provides separate metasequences for start- and end-of- word boundaries (often \< and \>), while the other provides ones catch-all word boundary metasequence (often \b). Either generally provides a not-word-boundary metasequence as well (often \B). Table 3-12 shows a few examples. Tools that don't provide separate start- and end-of-word anchors, but do support lookaround, can mimic word-boundary anchors with the lookaround. In the table, I've filled in the otherwise empty spots that way, wherever practical.

A word boundary is generally defined as a location where there is a "word character" on one side, and not on the other. Each tool has its own idea of what constitutes a "word character," as far as word boundaries go. It would make sense if the word boundaries agree with \w, but that's not always the case. With Sun's Java regex package, for example, \w applies only to ASCII and not the full Unicode that Java supports, so in the table I've used lookaround with the Unicode letter property \pL (which is a shorthand for \p{L} see Section 3.4.2.5).

Whatever the word boundaries consider to be "word characters," word boundary tests are always a simple test of adjoining characters. No regex engine actually does linguistic analysis to decide about words: all consider "NE14AD8" to be a word, but not "M.I.T."

Table 12. A Few Utilities and Their Word Boundary Metacharacters
Program Start-of-word . . . End-of-word Word boundary Not word-boundary
GNU egrep \< . . . \> \b \B
GNU Emacs \< . . . \> \b \B
GNU awk \< . . . \> \y \B
MySQL [[:<:]] . . . [[:>:]] [[:<:]]|[[:>:]]
Perl (?<!\w)(?=\w) . . . (?<=\w)(?!\w) \b \B
PHP (?<!\w)(?=\w) . . . (?<=\w)(?!\w) \b \B
Python (?<!\w)(?=\w) . . . (?<=\w)(?!\w) \b \B
Ruby \b \B
GNU sed \< . . . \> \b \B
Java (?<!\pL)(?=\pL) . . . (?<=\pL)(?!\pL) \b \B
Tcl \m . . . \M \y \Y
.NET (?<!\w)(?=\w) . . . (?<=\w)(?!\w) \b \B

3.4.3.6 Lookahead `(?=···), (?!···)`; Lookbehind, `(?<=···), (?<!···)`

Lookahead and lookbehind constructs (collectively, lookaround) are discussed with an extended example in the previous chapter's "Adding Commas to a Number with Lookaround" (see Section 2.3.5). One important issue not discussed there relates to what kind of expression can appear within either of the lookbehind constructs. Most implementations have restrictions about the length of text matchable within lookbehind (but not within lookahead, which is unrestricted).

The most restrictive rule exists in Perl and Python, where the lookbehind can match only fixed-length strings. For example, (?<!\w) and (?<!this|that) are allowed, but (?<!books?) and (?<!^\w+:) are not, as they can match a variable amount of text. In some cases, such as with (?<!books?), you can accomplish the same thing by rewriting the expression, as with (?<!book)(?<!books) , although that's certainly not easy to read at first glance.

The next level of support allows alternatives of different lengths within the lookbehind, so (?<!books?) can be written as (?<!book|books). PCRE (and the pcre routines in PHP) allows this.

The next level allows for regular expressions that match a variable amount of text, but only if it's of a finite length. This allows (?<!books?) directly, but still disallows (?<!^\w+:) since the \w+ is open-ended. Sun's Java regex package supports this level.

When it comes down to it, these first three levels of support are really equivalent, since they can all be expressed, although perhaps somewhat clumsily, with the most restrictive fixed-length matching level of support. The intermediate levels are just "syntactic sugar" to allow you to express the same thing in a more pleasing way. The fourth level, however, allows the subexpression within lookbehind to match any amount of text, including the (?<!^\w+:) example. This level, supported by Microsoft's .NET languages, is truly superior to the others, but does carry a potentially huge efficiency penalty if used unwisely. (When faced with lookbehind that can match any amount of text, the engine is forced to check the lookbehind subexpression from the start of the string, which may mean a lot of wasted effort when requested from near the end of a long string.)

3.4.4 Comments and Mode Modifiers

With many flavors, the regex modes and match modes described earlier (see Section 3.3.3) can be modified within the regex (on the fly, so to speak) by the following constructs.

3.4.4.1 Mode modifier: `(?modifier)`, such as `(?i)` or `(?-i)`

Many flavors now allow some of the regex and match modes (see Section 3.3.3) to be set within the regular expression itself. A common example is the special notation (?i) , which turns on case-insensitive matching, and (?-i) , which turns it off. For example, (?i)very(?-i) has the very part match with case insensitivity, while still keeping the tag names case-sensitive. This matches 'VERY' and 'Very', for example, but not 'Very'.

This example works with most systems that support (?i) , including Perl, java.util.regex, Ruby, and the .NET languages. But, some systems have different semantics. With Python, for example, the appearance of (?i) anywhere in the regex turns on case-insensitive matching for the entire regex, and Python doesn't support turning it off with (?-i) . Tcl's case-insensitive matching is also all-ornothing, but Tcl requires the (?i) to be at the beginning of the regex—anywhere else is an error. Ruby has a bug whereby sometimes (?i) doesn't apply to | -separated alternatives that are lowercase (but does if they're uppercase). PHP has the special case that if (?i) is used outside of all parentheses, it applies to the entire regex. So, in PHP, we'd have to write our example with an extra set of "constraining" parentheses: (?:(?i)very(?-i)) .

Actually, that last PHP example can be simplified a bit because with many implementations (including PHP's), when (?i) is used within any type of parentheses, its effects are limited by the parentheses (i.e., turn off at the closing parentheses). So, the (?-i) can simply be eliminated: (?:(?i)very) .

The mode-modifier constructs support more than just 'i'. With most systems, you can use at least those shown in Table 3-13.

Table 13. Common Mode Modifiers
Letter Mode
i case-insensitivity match mode (see Section 3.3.3.1)
x free-spacing and comments regex mode (see Section 3.3.3.2)
s dot-matches-all match mode (see Section 3.3.3.3)
m enhanced line-anchor match mode (see Section 3.3.3.4)

Some systems have additional letters for additional functions. Tcl has a number of different letters for turning its various modes on and off — see its documentation for the complete list.

3.4.4.2 Mode-modified span: `(?modifier:···)`, such as `(?i:···)`

The example from the previous section can be made even simpler for systems that support a mode-modified span. Using a syntax like (?i:··· ) , a mode-modified span turns on the mode only for what's matched within the parentheses. Using this, the (?:(?i)very) example is simplified to (?i:very).

When supported, this form generally works for all mode-modifier letters the system supports. Tcl and Python are two examples that support the (?i) form, but not the mode-modified span (?i:···) form.

3.4.4.3 Comments: `(?#···)` and `#···`

Some flavors support comments via (?#···) . In practice, this is rarely used, in favor of the free-spacing and comments regex mode (see Section 3.3.3.2). However, this type of comment is particularly useful in languages for which it's difficult to get a newline into a string literal, such as VB.NET (see Section 3.2.3.2, Section 9.3.1.2).

3.4.4.4 Literal-text span: `\Q···\E`

First introduced with Perl, the special sequence \Q···\E turns off all regex metacharacters between them, except for \E itself. (If the \E is omitted, they are turned off until the end of the regex.) It allows what would otherwise be taken as normal metacharacters to be treated as literal text. This is especially useful when including the contents of a variable while building a regular expression.

For example, to respond to a web search, you might accept what the user types as $query, and search for it with m/$query/i. As it is, this would certainly have unexpected results if $query were to contain, say, 'C:\WINDOWS\', which results in a run-time error because the search term contains something that isn't a valid regular expression (the trailing lone backslash). To get around this, you could use m/\Q$query\E/i, which effectively turns 'C:\WINDOWS\' into 'C:\\WINDOWS\\', resulting in a search that finds 'C:\WINDOWS\' as the user expects.

This kind of feature is less useful in systems with procedural and object-oriented handling (see Section 3.2.2), as they accept normal strings. While building the string to be used as a regular expression, it's fairly easy to call a function to make the value from the variable "safe" for use in a regular expression. In VB, for example, one would use the Regex.Escape method.

Currently, the only regex engine I know of that fully supports \Q···\E is Sun's java.util.regex engine. Considering that I just mentioned that this was introduced with Perl (and I gave an example in Perl), you might wonder why I don't include Perl in that statement. Perl supports \Q···\E within regex literals (regular expressions typed directly in the program), but not within the contents of variables that might be interpolated into them. See Chapter 7 (Section 7.2.1.1) for details.

3.4.5 Grouping, Capturing, Conditionals, and Control

3.4.5.1 Capturing/Grouping Parentheses: `(···)` and `\1, \2, ...`

Common, unadorned parentheses generally perform two functions, grouping and capturing. Common parentheses are almost always of the form (···) , but a few flavors use $···$ . These include GNU Emacs, sed, vi, and grep.

Capturing parentheses are numbered by counting their opening parentheses from the left, as shown in figures in Section 2.2.2, Section 2.2.3, and Section 2.3.4. If backreferences are available, the text matched via an enclosed subexpression can itself be matched later in the same regular expression with \1 , \2 , etc.

One of the most common uses of parentheses is to pluck data from a string. The text matched by a parenthesized subexpression (also called "the text matched by the parentheses") is made available after the match in different ways by different programs, such as Perl's $1, $2, etc. (A common mistake is to try to use the \1 syntax outside the regular expression; something allowed only with sed and vi.)

Table 3-14 below shows how a number of programs make the captured text available after a match. It shows how to access the text matched by the whole expression, and the text matched by a set of capturing parentheses.

Table 14. A Few Utilities and Their Access to Captured Text
Program Entire match First set of parentheses
GNU egrep N/A N/A
GNU Emacs (match-string 0) (match-string 1)
GNU awk substr($text, RSTART, RLENGTH) N/A
MySQL N/A N/A
Perl see Section 2.2.2 $& $1
PHP $ Matches [0] $ Matches [1]
Python see Section 3.2.3 MatchObj.group(0) MatchObj.group(1)
Ruby $& $1
GNU sed \& (in replacement only) \1 (in replacement only)
Java see Section 3.2.2 MatcherObj.group() MatcherObj.group(1)
Tcl set to user-selected variables via regexp command
VB.NET see Section 3.2.2.2 MatchObj.Groups(0) MatchObj.Groups(1)
C# MatchObj.Groups[0] MatchObj.Groups[1]
vi & \1
(see Section 3.1.1.9 for version information.)

3.4.5.2 Grouping-only parentheses: `(?:···)`

Now supported by many common flavors, grouping-only parentheses (?:···) don't capture, but just group regex components for alternation and the application of quantifiers. Grouping-only parentheses are not counted as part of $1, $2, etc. After a successful match of (1|one)(?:and|or)(2|two) , for example, $1 contains '1' or 'one', while $2 contains '2' or 'two'. Grouping-only parentheses are also called non-capturing parentheses.

Non-capturing parentheses are useful for a number of reasons. They can help make the use of a complex regex more clear in that the reader doesn't need to wonder if what's matched by what they group is accessed elsewhere by $1 or the like. They can also be more efficient — if the regex engine doesn't need to keep track of the text matched for capturing purposes, it can work faster and use less memory. (Efficiency is covered in detail in Chapter 6.)

Non-capturing parentheses are useful when building up a regex from parts. Recall the example from Section 2.3.6.7, where the variable $HostnameRegex holds a regex to match a hostname. Imagine now using that to pluck out the whitespace around a hostname, as in the Perl snippet m/(\s*)$HostnameRegex(\s*)/. After this, you might expect $1 to hold any leading whitespace, and $2 to hold trailing whitespace, but that's not the case: the trailing whitespace is actually in $4 because the definition of $HostnameRegex uses two sets of capturing parentheses:


$HostnameRegex = qr/[-a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info)/i;

Were those sets of parentheses non-capturing instead, $HostnameRegex could be used without generating this surprise:


$HostnameRegex = qr/[-a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info)/i;

Another way to avoid the surprise, although not available in Perl, is to use named capture, discussed next.

3.4.5.3 Named capture: `(?<Name>···)`

Python and .NET languages support captures to named locations. Python uses the syntax (?P<name>···) , while the .NET languages offer a syntax that I prefer, (?<name>···) . Here's an example:


 \b(?<Area>\d\d\d\)-(?<Exch>\d\d\d)-(?<Num>\d\d\d\d)\b

This "fills the names" Area, Exch, and Num with the components of a phone number. The program can then refer to each matched substring through its name, for example, RegexObj.Groups("Area") in VB.NET and most other .NET languages, RegexObj.Groups["Area"]in C#, and RegexObj.group("Area") in Python. The result is clearer code.

Within the regular expression itself, the captured text is available via \k<Month> with .NET, and (?P=Month) in Python.

You can use the same name more than once within the same expression. For example, to match an area code that looks like '(###)' as well as '###-', you might use ··· (?:$(?<Area>\d\d\d)$|(?<Area>\d\d\d)-)··· . When either matches, the three-digit code is saved to the name Area.

3.4.5.4 Atomic grouping: `(?>···)`

Atomic grouping, (?>···) , will be very easy to explain once the important details of how the regex engine carries out its work is understood (see Section 4.5.6). Here, I'll just say that once the parenthesized subexpression matches, what it matches is fixed (becomes atomic, unchangeable) for the rest of the match, unless it turns out that the whole set of atomic parentheses needs to be abandoned or revisited. A simple example helps to illustrate this indivisible, "atomic" nature of text matched by these parentheses.

The regex ¡.*! matches '¡Hola!', but that string is not matched if the .* is wrapped with atomic grouping, ¡(?>;.*)! . In either case, the .* first internally matches as much as it can ('¡Hola!'), but in the first case, the ending ! forces the .* to give up some of what it had matched (the final '!') to complete the overall match. In the second case, the .* is inside atomic grouping (which never "give up" anything once the matching leaves them), so nothing is left for the final ! , and it can never match.

This example gives no hint to the usefulness of atomic grouping, but atomic grouping has important uses. In particular, they can help make matching more efficient (see Section 4.5.6.1.1), and can be used to finely control what can and can't be matched (see Section 6.6.6.1).

3.4.5.5 Alternation: ···|···|···

Alternation allows several subexpressions to be tested at a given point. Each subexpression is called an alternative. The | symbol is called various things, but or and bar seem popular. Some flavors use \| instead.

Alternation is a high-level construct (one that has very low precedence) in almost all regex flavors. This means that this and|or that has the same meaning as (this and)|(or that) , and not this (and|or) that , even though visually, the and|or looks like a unit.

Most flavors allow an empty alternative, like with (this|that|) . The empty subexpression means to always match, so this example is logically comparable to (this|that)? .^[14] The POSIX standard disallows an empty alternative, as does lex and most versions of awk. I think it's useful for its notational convenience or clarity. As Larry Wall explained to me once, "It's like having a zero in your numbering system."

^[14] Actually, to be pedantic, (this|that|) is logically comparable to ((?:this|that)?) . With either of these, the subexpression within the capturing parentheses is always able to match (although it may match nothingness, but that's the whole point of the empty alternative or the question mark quantifier). On the other hand, with (this|that)? , it may be that the whole set of capturing parentheses does not match. The difference may seem minor, but some languages provide a way to find out if a certain set of capturing parentheses participated in the match, and with (this|that|) the answer is always yes, but with (this|that)? , the answer could be no.

3.4.5.6 Conditional: (?if then |else)

This construct allows you to express an if/then/else within a regex. The if part is a special kind of conditional expression discussed in a moment. Both the then and else parts are normal regex subexpressions. If the if part tests true, the then expression is attempted. Otherwise, the else part is attempted. (The else part my be omitted, and if so, the '|' before it may be omitted as well.)

The kinds of if tests available are flavor-dependent, but most implementations allow at least special references to capturing subexpressions and lookaround.

Using a special reference to capturing parentheses as the test. If the if part is a number in parentheses, it evaluates to "true" if that numbered set of capturing parentheses has participated in the match to this point. Here's an example that matches an <IMG> HTML tag, either alone, or surrounded by <A>···</A> link tags. It's shown in a free-spacing mode with comments, and the conditional construct (which in this example has no else part) is bold:


     ( <A\s+[^>]+> \s* )?  # Match leading <A> tag, if there.

     <IMG\s+[^>]+>         # Match <IMG> tag.

     (?(1)\s*</A>) #  Match a closing </A>, if we'd matched an <A> before.

The (1) in (?(1)···) tests whether the first set of capturing parentheses participated in the match. "Participating in the match" is very different from "actually matched some text," as a simple example illustrates...

Consider these two approaches to matching a word optionally wrapped in "<···>": (<)?\w+(?(1)>) works, but (<?)\w+(?(1)>) does not. The only difference between them is the location of the first question mark. In the first (correct) approach, the question mark governs the capturing parentheses, so the parentheses (and all they contain) are optional. In the flawed second approach, the capturing parentheses are not optional — only the < matched within them is, so they "participate in the match" regardless of a '<' being matched or not. This means that the if part of (?(1)···) always tests "true."

If named capture (see Section 3.4.5.3) is supported, you can generally use the name in parentheses instead of the number.

Using lookaround as the test. A full lookaround construct, such as (?=···) and (?<=···) , can used as the if test. If the lookaround matches, it evaluates to "true," and so the then part is attempted. Otherwise, the else part is attempted. A somewhat contrived example that illustrates this is (? (?<=NUM:)\d+|\w+) , which attempts \d+ at positions just after NUM: , but attempts \w+ at other positions. The lookbehind conditional is underlined.

Other tests for the conditional. Perl adds an interesting twist to this conditional construct by allowing arbitrary Perl code to be executed as the test. The return value of the code is the test's value, indicating whether the then or else part should be attempted. This is covered in Chapter 7, in Section 7.8.

3.4.5.7 Greedy quantifiers: `*, +, ?, {num,num}`

The quantifiers (star, plus, question mark, and intervals—metacharacters that affect the quantity of what they govern) have already been discussed extensively. However, note that in some tools, \+ and \? are used instead of + and ? . Also, with some older tools, quantifiers can't be applied to a backreference or to a set of parentheses.

3.4.5.8 Intervals—`{min,max} or \{min,max \}`

Intervals can be considered a "counting quantifier" because you specify exactly the minimum number of matches you wish to require, and the maximum number of matches you wish to allow. If only a single number is given (such as in [a-z]{3} or [a-z]\{3\} , depending upon the flavor), it matches exactly that many of the item. This example is the same as [a-z][a-z][a-z] (although one may be more or less efficient than the other see Section 6.4.6.9).

One caution: don't think you can use something like X{0,0} to mean "there must not be an X here." X{0,0} is a meaningless expression because it means " no requirement to match X , and, in fact, don't even bother trying to match any. Period. " It's the same as if the whole X{0,0} wasn't there at all—if there is an X present, it could still be matched by something later in the expression, so your intended purpose is defeated.^[15] Use negative lookahead for a true "must not be here" construct.

^[15] In theory, what I say about {0,0} is correct. In practice, what actually happens is even worse — it's almost random! In many programs (including GNU awk, GNU grep, and older versions of Perl) it seems that {0,0} means the same as *, while in many others (including most versions of sed that I've seen, and some versions of grep) it means the same as ?. Crazy!

3.4.5.9 Lazy quantifiers: `*?, +?, ??, {num,num}?`

Some tools offer the rather ungainly looking *?, +?, ??, and { min,max }?. These are the lazy versions of the quantifiers. (They are also called minimal matching, non-greedy, and ungreedy.) Quantifiers are normally "greedy," and try to match as much as possible. Conversely, these non-greedy versions match as little as possible, just the bare minimum needed to satisfy the match. The difference has farreaching implications, covered in detail in the next chapter (see Section 4.4.2).

3.4.5.10 Possessive quantifiers: `*+, ++, ?+, {num,num}+`

Currently supported only by java.util.regex, but likely to gain popularity, possessive quantifiers are like normally greedy quantifiers, but once they match something, they never "give it up." Like the atomic grouping to which they're related, understanding possessive quantifiers is much easier once the underlying match process is understood (which is the subject of the next chapter).

In one sense, possessive quantifiers are just syntactic sugar, as they can be mimicked with atomic grouping. Something like .++ has exactly the same result as (?>.+) , although a smart implementation can optimize possessive quantifiers more than atomic grouping (see Section 6.4.6.7).

< Free Open Study >