< Free Open Study > |
9.1 .NET's Regex Flavor.NET has been built with a Traditional NFA regex engine, so all the important NFArelated lessons from Chapters 4, 5, and 6 are applicable. Table 9-1 below summarizes .NET's regex flavor, most of which is discussed in Chapter 3, Certain aspects of the flavor can be modified by match modes (see Section 3.3.3), turned on via option flags to the various functions and constructors that accept regular expressions, or in some cases, turned on and off within the regex itself via (?mods-mods) and (?mods-mods : ···) constructs. The modes are listed in Table 9-2 in Section 9.1.1. A regex flavor can't be described with just a simple table or two, so here are some notes to augment Table 9-1:
9.1.1 Additional Comments on the FlavorA few issues merit longer discussion than a bullet point allows. 9.1.1.1 Named capture.NET supports named capture (see Section 3.4.5.3), through the (?< name>···) or (?' name'···) syntax. Both syntaxes mean the same thing and you can use either freely, but I prefer the syntax with <···>, as I believe it will be more widely used. You can backreference the text matched by a named capture within the regex with \k< name > or \k' name ' . After the match (once a Match object has been generated; an overview of .NET's object model follows, starting in Section 9.2.3), the text matched within the named capture is available via the Match object's Groups( name ) property. (C# requires Groups[ name ] instead.) Within a replacement string (see Section 9.3.2.1), the results of named capture are available via a ${ name } sequence. In order to allow all groups to be accessed numerically, which may be useful at times, named-capture groups are also given numbers. They receive their numbers after all the non-named ones receive theirs:
The text matched by the \d+ part of this example is available via both Groups("Num") and Groups(3). It's still just one group, but with two names. 9.1.1.2 An unfortunate consequenceIt's not recommended to mix normal capturing parentheses and named captures, but if you do, the way the capturing groups are assigned numbers has important consequences that you should be aware of. The ordering becomes important when capturing parentheses are used with Split (see Section 9.3.2.1), and for the meaning of '$+' in a replacement string (see Section 9.3.2.1). Both currently have additional, unrelated problems that make them more or less broken anyway (although Microsoft is working on a fix for the 2004 release of .NET). 9.1.1.3 Conditional testsThe if part of an (? if then | else ) conditional (see Section 3.4.5.6) can be any type of lookaround, or a captured group number or captured group name in parentheses. Plain text (or a plain regex) in this location is automatically treated as positive lookahead (that it, it has an implicit (?=···) wrapped around it). This can lead to an ambiguity: for instance, the (Num) of ···(?(Num) then | else )··· is turned into (?=Num) (lookahead for 'Num') if there is no (?<Num>···) named capture elsewhere in the regex. If there is such a named capture, whether it was successful is the result of the if. I recommend not relying on "auto-lookaheadification." Use the explicit (?=···) to make your intentions clearer to the human reader, and also to avert a surprise if some future version of the regex engine adds additional if syntax. 9.1.1.4 "Compiled" expressionsIn earlier chapters, I use the word "compile" to describe the pre-application work any regex system must do to check that a regular expression is valid, and to convert it to an internal form suitable for its actual application to text. For this, .NET regex terminology uses the word "parsing." It uses two versions of "compile" to refer to optimizations of that parsing phase. Here are the details, in order of increasing optimization:
When considering on-the-fly compilation with RegexOptions.Compiled, there are important tradeoffs among initial startup time, ongoing memory usage, and regex match speed:
The initial regex parsing (the default kind, without RegexOptions.Compiled) that must be done the first time each regex is seen in the program is relatively fast. Even on my clunky old 550MHz NT box, I benchmark about 1,500 complex compilations/ second. When RegexOptions.Compiled is used, that goes down to about 25/second, and increases memory usage by about 10k bytes per regex. More importantly, that memory remains used for the life of the program — there's no way to unload it. It definitely makes sense to use RegexOptions.Compiled in time-sensitive areas where processing speed is important, particularly for expressions that work with a lot of text. On the other hand, it makes little sense to use it on simple regexes that aren't applied to a lot of text. It's less clear which is best for the multitude of situations in between—you'll just have to weight the benefits and decide on a case-bycase basis. In some cases, it may make sense to encapsulate an application's compiled expressions into its own DLL, as pre-compiled Regex objects. This uses less memory in the final program (the loading of the whole regex compilation package is bypassed), and allows faster loading (since they're compiled when the DLL is built, you don't have to wait for them to be compiled when you use them). A nice byproduct of this is that the expressions are made available to other programs that might wish to use them, so it's a great way to make a personal regex library. See "Creating Your Own Regex Library With an Assembly" in Section 9.6.1. 9.1.1.5 Right-to-left matchingThe concept of "backwards" matching (matching from right to left in a string, rather than from left to right) has long intrigued regex developers. Perhaps the biggest issue facing the developer is to define exactly what "right-to-left matching" really means. Is the regex somehow reversed? Is the target text flipped? Or is it just that the regex is applied normally from each position within the target string, with the difference being that the transmission starts at the end of the string instead of at the beginning, and moves backwards with each bump-along rather than forward? Just to think about it in concrete terms for a moment, consider applying \d+ to the string '123•and•456'. We know a normal application matches '123', and instinct somehow tells us that a right-to-left application should match '456'. However, if the regex engine uses the semantics described at the end of the previous paragraph, where the only difference is the starting point of the transmission and the direction of the bump-along, the results may be surprising. In these semantics, the regex engine works normally ("looking" to the right from where it's started), so the first attempt of \d+ , at '···456 ', doesn't match. The second attempt, at '···456 ' does match, as the bump-along has placed it "looking at" the '6', which certainly matches \d+ . So, we have a final match of only the final '6'. One of .NET's regex options is RegexOptions.RightToLeft. What are its semantics? The answer is: "that's a good question." The semantics are not documented, and my own tests indicate only that I can't pin them down. In many cases, such as the '123•and•456' example, it acts surprisingly intuitively (it matches '456'). However, it sometimes fails to find any match, and at other times finds a match that seems to make no sense when compared with other results. If you have a need for it, you may find that RegexOptions.RightToLeft seems to work exactly as you wish, but in the end, you use it at your own risk. Microsoft is working on pinning down the semantics (to be released in the 2004 or 2005 version of .NET), and so the semantics that you happen to see now may change. 9.1.1.6 Backlash-digit ambiguitiesWhen a backslash is followed by a number, it's either an octal escape or a backrefer ence. Which of the two it's interpreted as, and how, depends on whether the RegexOptions.ECMAScript option has been specified. If you don't want to have to understand the subtle differences, you can always use \k< num > for a backreference, or start the octal escape with a zero (e.g., \08 ) to ensure it's taken as one. These work consistently, regardless of RegexOptions.ECMAScript being used or not. If RegexOptions.ECMAScript is not used, single-digit escapes from \1 through \9 are always backreferences, and an escaped number beginning with zero is always an octal escape (e.g., \012 matches an ASCII linefeed character). If it's not either of these cases, the number is taken as a backreference if it would "make sense" to do so (i.e., if there are at least that many capturing parentheses in the regex). Otherwise, so long as it has a value between \000 and \377, it's taken as an octal escape. For example, \12 is taken as a backreference if there are at least 12 sets of capturing parentheses, or an octal escape otherwise. The semantics for when RegexOptions.ECMAScript is specified is described in the next section. 9.1.1.7 ECMAScript modeECMAScript is a standardized version of JavaScript[2] with its own semantics of how regular expressions should be parsed and applied. A .NET regex attempts to mimic those semantics if created with the RegexOptions.ECMAScript option. If you don't know what ECMAScript is, or don't need compatibility with it, you can safely ignore this section.
When RegexOptions.ECMAScript is in effect, the following apply:
|
< Free Open Study > |