9.1 .NET's Regex Flavor

.NET has been built with a Traditional NFA regex engine, so all the important NFArelated lessons from Chapters 4, 5, and 6 are applicable. Table 9-1 below summarizes .NET's regex flavor, most of which is discussed in Chapter 3,

Certain aspects of the flavor can be modified by match modes (see Section 3.3.3), turned on via option flags to the various functions and constructors that accept regular expressions, or in some cases, turned on and off within the regex itself via (?mods-mods) and (?mods-mods : ···) constructs. The modes are listed in Table 9-2 in Section 9.1.1.

A regex flavor can't be described with just a simple table or two, so here are some notes to augment Table 9-1:

In the table, "raw" escapes like \w are shown. These can be used directly in VB.NET string literals ("\w"), and in C# verbatim strings (@"\w"). In languages without regex-friendly string literals, such as C++, each backslash in the regex requires two in the string literal ("\\w"). See "Strings as Regular Expressions" (see Section 3.3.1).
\b is valid as a backspace only within a character class (outside, it matches a word boundary).
\x## allows exactly two hexadecimal digits, e.g., \xFCber matches 'über'.
\u#### allows exactly four hexadecimal digits, e.g., \u00FCber matches 'über', and \u20AC matches '€'.
\w, \d, and \s (and their uppercase counterparts) normally match the full range of appropriate Unicode characters, but change to an ASCII-only mode with the RegexOptions.ECMAScript option (see Section 9.1.1.7).
In its default mode, \w matches the Unicode properties \p{Ll}, \p{Lu}, \p{Lt}, \p{Lo}, \p{Nd}, and \p{Pc}. Note that this does not include the \p{Lm} property. (See the Table 3-9 for the property list.)
In its default mode, \s matches [•\f\n\r\t\v\x85\p{Z}]. U+0085 is the Unicode NEXT LINE control character, and \p{Z} matches Unicode "separator" characters (see Section 3.4.2.4).
\p{···} and \P{···} support most standard Unicode properties and blocks. Unicode scripts are not supported. Only the short property names like \p{Lu} are supported—long names like \p{Lowercase_Letter} are not supported. (See the tables in Section 3.4.2.4 and Section 3.4.2.4.) Note, however, that the special composite property \p{L& } is not supported, nor, for some reason, are the \p{Pi} and \p{Pf} properties. Single-letter properties do require the braces (that is, the \pL shorthand for \p{L} is not supported).
Also not supported are the special properties \p{All}, \p{Assigned}, and \p{Unassigned}. Instead, you might use (?s:.), \P{Cn}, and \p{Cn}, respectively.

Table 1. Overview of .NET's Regular-Expression Flavor

Character Shorthands

see Section 3.4.1 (c)
\a \b \e \f \n \r \t \v \ octal \x## \u#### \c char

Character Classes and Class-Like Constructs

see Section 3.4.2 Classes: [···] [^···]
see Section 3.4.2.2 Any character except newline: dot (sometimes any character at all)

see Section 3.4.2.4 (c)
Class shorthands: \w \d \s \W \D \S

see Section 3.4.2.4 (c)

Unicode properties and blocks: \p{ Prop } \P{ Prop }

Anchors and other Zero-Width Tests

see Section 3.4.3.1

Start of line/string: ^ \A

see Section 3.4.3.2
End of line/string: $ \z \Z

see Section 3.4.3.3
End of previous match: \G

see Section 3.4.3.5 Word boundary: \b \B
see Section 3.4.3.6
Lookaround: (?=···) (?!···) (?<=···) (?<!···)

Comments and Mode Modifiers

see Section 3.4.4.1
Mode modifiers: (? mods - mods ) Modifiers allowed: x s m i n (see Table 9-2)

see Section 3.4.4.2 Mode-modified spans: (? mods - mods : ···)
see Section 3.4.4.3
Comments: (?#···)

Grouping, Capturing, Conditional, and Control

see Section 3.4.5.1
Capturing parentheses: (···) \1 \2 . . .

see Section 9.6.2
Balanced grouping: (?< name - name >···)

see Section 3.4.5.3
Named capture, backreference: (?< name >···) \k< name >

see Section 3.4.5.2
Grouping-only parentheses: (?:···)

see Section 3.4.5.4
Atomic grouping: (?>···)

see Section 3.4.5.5
Alternation: |

see Section 3.4.5.10
Greedy quantifiers: * + ? {n} {n, } {x,y}

see Section 3.4.5.9
Lazy quantifiers: * ? +? ?? {n}? {n,}? {x,y}?

see Section 3.4.5.6
Conditional: (? if then | else ) - "if" can be lookaround, ( num ), or ( name )

(c) - may be used within a character class

This package understands Unicode blocks as of Unicode Version 3.1. Additions and modifications since Version 3.1 are not known (see Section 3.3.2.2).
Block names require the 'Is' prefix (see the Table 3-10), and only the raw form unadorned with spaces and underscores may be used. For example, \p{Is_Greek_Extended} and \p{Is Greek Extended} are not allowed; \p{IsGreekExtended} is required.
\G matches the end of the previous match, despite the documentation's claim that it matches at the beginning of the current match (see Section 3.4.3.3).
Both lookahead and lookbehind can employ arbitrary regular expressions. of this writing, the .NET regex engine is the only one that I know of that allows lookbehind with a subexpression that can match an arbitrary amount text (see Section 3.4.3.6).
The RegexOptions.ExplictCapture option (also available via the (?n)mode modifier) turns off capturing for raw (···) parentheses. Explicitly-named captures like (?<num>\d+) still work (see Section 3.4.5.3). If you use named captures, this option allows you to use the visually more pleasing (···) for grouping instead of (?···).

Table 2. The .NET Match and Regex Modes
RegexOptions option (? mode ) Description
.Singleline s Causes dot to match any character (see Section 3.3.3.3)
.Multiline m Expands where ^ and $ can match (see Section 3.3.3.5)
.IgnorePatternWhitespace x Sets free-spacing and comment mode (see Section 2.3.6.4)
.IgnoreCase i Turns on case-insensitive matching
.ExplicitCapture n Turns capturing off for (···), so only (?< name >···) capture
.ECMAScript Restricts \w , \s , and \d to match ASCII characters only, and more (see Section 9.1.1.7)
.RightToLeft The transmission applies the regex normally, but in the opposite direction (starting at the end of the string and moving toward the start). Unfortunately, buggy. (see Section 9.1.1.5)
.Compiled Spends extra time up front optimizing the regex so it matches more quickly when applied (see Section 9.1.1.4)

9.1.1 Additional Comments on the Flavor

A few issues merit longer discussion than a bullet point allows.

9.1.1.1 Named capture

.NET supports named capture (see Section 3.4.5.3), through the (?< name>···) or (?' name'···) syntax. Both syntaxes mean the same thing and you can use either freely, but I prefer the syntax with <···>, as I believe it will be more widely used.

You can backreference the text matched by a named capture within the regex with \k< name > or \k' name '.

After the match (once a Match object has been generated; an overview of .NET's object model follows, starting in Section 9.2.3), the text matched within the named capture is available via the Match object's Groups( name ) property. (C# requires Groups[ name ] instead.)

Within a replacement string (see Section 9.3.2.1), the results of named capture are available via a ${ name } sequence.

In order to allow all groups to be accessed numerically, which may be useful at times, named-capture groups are also given numbers. They receive their numbers after all the non-named ones receive theirs:

The text matched by the \d+ part of this example is available via both Groups("Num") and Groups(3). It's still just one group, but with two names.

9.1.1.2 An unfortunate consequence

It's not recommended to mix normal capturing parentheses and named captures, but if you do, the way the capturing groups are assigned numbers has important consequences that you should be aware of. The ordering becomes important when capturing parentheses are used with Split (see Section 9.3.2.1), and for the meaning of '$+' in a replacement string (see Section 9.3.2.1). Both currently have additional, unrelated problems that make them more or less broken anyway (although Microsoft is working on a fix for the 2004 release of .NET).

9.1.1.3 Conditional tests

The if part of an (? if then | else ) conditional (see Section 3.4.5.6) can be any type of lookaround, or a captured group number or captured group name in parentheses. Plain text (or a plain regex) in this location is automatically treated as positive lookahead (that it, it has an implicit (?=···) wrapped around it). This can lead to an ambiguity: for instance, the (Num) of ···(?(Num) then | else )··· is turned into (?=Num) (lookahead for 'Num') if there is no (?<Num>···) named capture elsewhere in the regex. If there is such a named capture, whether it was successful is the result of the if.

I recommend not relying on "auto-lookaheadification." Use the explicit (?=···) to make your intentions clearer to the human reader, and also to avert a surprise if some future version of the regex engine adds additional if syntax.

9.1.1.4 "Compiled" expressions

In earlier chapters, I use the word "compile" to describe the pre-application work any regex system must do to check that a regular expression is valid, and to convert it to an internal form suitable for its actual application to text. For this, .NET regex terminology uses the word "parsing." It uses two versions of "compile" to refer to optimizations of that parsing phase.

Here are the details, in order of increasing optimization:

Parsing The first time a regex is seen during the run of a program, it must be checked and converted into an internal form suitable for actual application by the regex engine. This process is referred to as "compile" elsewhere in this book (see Section 6.4.3).
On-the-Fly Compilation RegexOptions.Compiled is one of the options available when building a regex. Using it tells the regex engine to go further than simply converting to the default internal form, but to compile it to low-level MSIL (Microsoft Intermediate Language) code, which itself is then amenable to being optimized even further into even faster native machine code by the JIT ("Just-In-Time" compiler) when the regex is actually applied.
It takes more time and memory to do this, but it allows the resulting regular expression to work faster. These tradeoffs are discussed later in this section.
Pre-Compiled Regexes A Regex object (or objects) can be encapsulated into an assembly written to disk in a DLL (a Dynamically Loaded Library, i.e., a shared library). This makes it available for general use in other programs. This is called "compiling the assembly." For more, see "Regex Assemblies" (see Section 9.6.1).

When considering on-the-fly compilation with RegexOptions.Compiled, there are important tradeoffs among initial startup time, ongoing memory usage, and regex match speed:

Metric Without RegexOptions.Compiled With RegexOptions.Compiled

Startup time
Memory usage
Match speed

Faster
Low
Not as fast

Slower (by 60x)
High (about 5-15k each)
Up to 10x faster

The initial regex parsing (the default kind, without RegexOptions.Compiled) that must be done the first time each regex is seen in the program is relatively fast. Even on my clunky old 550MHz NT box, I benchmark about 1,500 complex compilations/ second. When RegexOptions.Compiled is used, that goes down to about 25/second, and increases memory usage by about 10k bytes per regex. More importantly, that memory remains used for the life of the program — there's no way to unload it.

It definitely makes sense to use RegexOptions.Compiled in time-sensitive areas where processing speed is important, particularly for expressions that work with a lot of text. On the other hand, it makes little sense to use it on simple regexes that aren't applied to a lot of text. It's less clear which is best for the multitude of situations in between—you'll just have to weight the benefits and decide on a case-bycase basis.

In some cases, it may make sense to encapsulate an application's compiled expressions into its own DLL, as pre-compiled Regex objects. This uses less memory in the final program (the loading of the whole regex compilation package is bypassed), and allows faster loading (since they're compiled when the DLL is built, you don't have to wait for them to be compiled when you use them). A nice byproduct of this is that the expressions are made available to other programs that might wish to use them, so it's a great way to make a personal regex library. See "Creating Your Own Regex Library With an Assembly" in Section 9.6.1.

9.1.1.5 Right-to-left matching

The concept of "backwards" matching (matching from right to left in a string, rather than from left to right) has long intrigued regex developers. Perhaps the biggest issue facing the developer is to define exactly what "right-to-left matching" really means. Is the regex somehow reversed? Is the target text flipped? Or is it just that the regex is applied normally from each position within the target string, with the difference being that the transmission starts at the end of the string instead of at the beginning, and moves backwards with each bump-along rather than forward?

Just to think about it in concrete terms for a moment, consider applying \d+ to the string '123•and•456'. We know a normal application matches '123', and instinct somehow tells us that a right-to-left application should match '456'. However, if the regex engine uses the semantics described at the end of the previous paragraph, where the only difference is the starting point of the transmission and the direction of the bump-along, the results may be surprising. In these semantics, the regex engine works normally ("looking" to the right from where it's started), so the first attempt of \d+, at '···456', doesn't match. The second attempt, at '···456' does match, as the bump-along has placed it "looking at" the '6', which certainly matches \d+. So, we have a final match of only the final '6'.

One of .NET's regex options is RegexOptions.RightToLeft. What are its semantics? The answer is: "that's a good question." The semantics are not documented, and my own tests indicate only that I can't pin them down. In many cases, such as the '123•and•456' example, it acts surprisingly intuitively (it matches '456'). However, it sometimes fails to find any match, and at other times finds a match that seems to make no sense when compared with other results.

If you have a need for it, you may find that RegexOptions.RightToLeft seems to work exactly as you wish, but in the end, you use it at your own risk. Microsoft is working on pinning down the semantics (to be released in the 2004 or 2005 version of .NET), and so the semantics that you happen to see now may change.

9.1.1.6 Backlash-digit ambiguities

When a backslash is followed by a number, it's either an octal escape or a backrefer ence. Which of the two it's interpreted as, and how, depends on whether the RegexOptions.ECMAScript option has been specified. If you don't want to have to understand the subtle differences, you can always use \k< num > for a backreference, or start the octal escape with a zero (e.g., \08 ) to ensure it's taken as one. These work consistently, regardless of RegexOptions.ECMAScript being used or not.

If RegexOptions.ECMAScript is not used, single-digit escapes from \1 through \9 are always backreferences, and an escaped number beginning with zero is always an octal escape (e.g., \012 matches an ASCII linefeed character). If it's not either of these cases, the number is taken as a backreference if it would "make sense" to do so (i.e., if there are at least that many capturing parentheses in the regex). Otherwise, so long as it has a value between \000 and \377, it's taken as an octal escape. For example, \12 is taken as a backreference if there are at least 12 sets of capturing parentheses, or an octal escape otherwise.

The semantics for when RegexOptions.ECMAScript is specified is described in the next section.

9.1.1.7 ECMAScript mode

ECMAScript is a standardized version of JavaScript^[2] with its own semantics of how regular expressions should be parsed and applied. A .NET regex attempts to mimic those semantics if created with the RegexOptions.ECMAScript option. If you don't know what ECMAScript is, or don't need compatibility with it, you can safely ignore this section.

^[2] ECMA stands for "European Computer Manufacturers Association," a group formed in 1960 to standardize aspects of the growing field of computers.

When RegexOptions.ECMAScript is in effect, the following apply:

Only the following may be combined with RegexOptions.ECMAScript:

     RegexOptions.IgnoreCase

     RegexOptions.Multiline

     RegexOptions.Compiled

\w, \d, and \s (and \W, \D, and \S ) change to ASCII-only matching.
When a backslash-digit sequence is found in a regex, the ambiguity between backreference and octal escape changes to favor a backreference, even if that means having to ignore some of the trailing digits. For example, with (···)\10 , the \10 is taken as a backreference to the first group, followed by a literal '0'.

< Free Open Study >