9.3 Core Object Details

Now that we've seen an overview, let's look at the details. First, we'll look at how to create a Regex object, followed by how to apply it to a string to yield a Match object, and how to work with that object and its Group objects.

In practice, you can often avoid having to explicitly create a Regex object, but it's good to be comfortable with them, so during this look at the core objects, I'll always explicitly create them. We'll see later what shortcuts .NET provides to make things more convenient.

In the lists that follow, I don't mention little-used methods that are merely inherited from the Object class.

9.3.1 Creating `Regex` Objects

The constructor for creating a Regex object is uncomplicated. It accepts either one argument (the regex, as a string), or two arguments (the regex and a set of options). Here's a one-argument example:

     Dim StripTrailWS = new Regex("\s+$") ' for removing trailing whitespace

This just creates the Regex object, preparing it for use; no matching has been done to this point.

Here's a two-argument example:

     Dim GetSubject = new Regex("^subject: (.*)", RegexOptions.IgnoreCase)

That passes one of the RegexOptions flags, but you can pass multiple flags if they're OR'd together, as with:

     Dim GetSubject = new Regex("^subject: (.*)", _

        RegexOptions.IgnoreCase OR RegexOptions.Multiline)

9.3.1.1 Catching exceptions

An ArgumentException error is thrown if a regex with an invalid combination of metacharacters is given. You don't normally need to catch this exception when using regular expressions you know to work, but it's important to catch it if using regular expressions from "outside" the program (e.g., entered by the user, or read from a configuration file). Here's an example:

     Dim R As Regex

     Try

        R = New Regex(SearchRegex)

     Catch e As ArgumentException

        Console.WriteLine("*ERROR* bad regex: " & e.ToString)

        Exit Sub

     End Try

Of course, depending on the application, you may want to do something other than writing to the console upon detection of the exception.

9.3.1.2 `Regex` options

The following option flags are allowed when creating a Regex object:

RegexOptions.IgnoreCase: This option indicates that when the regex is applied, it should be done in a case-insensitive manner (see Section 3.3.3.1).

RegexOptions.IgnorePatternWhitespace

This option indicates that the regex should be parsed in a free-spacing and comments mode (see Section 3.3.3.2). If you use raw #··· comments, be sure to include a newline at the end of each logical line, or the first raw comment "comments out" the entire rest of the regex.

In VB.NET, this can be achieved with chr(10), as in this example:


     Dim R as Regex = New Regex( _

        "# Match a floating-point number ...        " & chr(10) & _

        " \d+(?:\.\d*)? # with a leading digit...   " & chr(10) & _

        " ;             # or ...                    " & chr(10) & _

        " \.\d+         # with a leading decimal point", _

        RegexOptions.IgnorePatternWhitespace)

That's cumbersome; in VB.NET, (?#···) comments can be more convenient:


     Dim R as Regex = New Regex( _

        "(?# Match a floating-point number ...           )" & _

        " \d+(?:\.\d*)? (?# with a leading digit...      )" & _

        " |             (?# or ...                       )" & _

        " \.\d+         (?# with a leading decimal point )", _

        RegexOptions.IgnorePatternWhitespace)

RegexOptions.Multiline: This option indicates that the regex should be applied in an enhanced lineanchor mode (see Section 3.3.3.5). This allows ^ and $ to match at embedded newlines in addition to the normal beginning and end of string, respectively.

RegexOptions.Singleline: This option indicates that the regex should be applied in a dot-matches-all mode (see Section 3.3.3.3). This allows dot to match any character, rather than any character except a newline.

RegexOptions.ExplicitCapture

This option indicates that even raw (···) parentheses, which are normally capturing parentheses, should not capture, but rather behave like (?:···) grouping- only non-capturing parentheses. This leaves named-capture (?< name >···) parentheses as the only type of capturing parentheses.

If you're using named capture and also want non-capturing parentheses for grouping, it makes sense to use normal (···) parentheses and this option, as it keeps the regex more visually clear.

RegexOptions.RightToLeft: This option sets the regex to a right-to-left match mode (see Section 9.1.1.5).

RegexOptions.Compiled

This option indicates that the regex should be compiled, on the fly, to a highly-optimized format, which generally leads to much faster matching. This comes at the expense of increased compile time the first time it's used, and increased memory use for the duration of the program's execution.

If a regex is going to be used just once, or sparingly, it makes little sense to use RegexOptions.Compiled, since its extra memory remains used even when a Regex object created with it has been disposed of. But if a regex is used in a time-critical area, it's probably advantageous to use this flag.

You can see an example in Section 6.3.3, where this option cuts the time for one benchmark about in half. Also, see the discussion about compiling to an assembly (see Section 9.6.1).

RegexOptions.ECMAScript: This option indicates that the regex should be parsed in a way that's compatible with ECMAScript (see Section 9.1.1.7). If you don't know what ECMAScript is, or don't need compatibility with it, you can safely ignore this option.

RegexOptions.None: This is a "no extra options" value that's useful for initializing a RegexOptionsvariable, should you need to. As you decide options are required, they can be OR'd in to it.

9.3.2 Using `Regex` Objects

Just having a regex object is not useful unless you apply it, so the following methods swing it into action.

RegexObj .IsMatch ( target ) Return type: Boolean
RegexObj .IsMatch( target , offset )

The IsMatch method applies the object's regex to the target string, returning a simple Boolean indicating whether the attempt is successful. Here's an example:

     Dim R as RegexObj = New Regex("^\s*$")

        .

        .

        .

     If R.IsMatch(Line) Then

             ' Line is blank . . .

               .

               .

               .

     Endif

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted.

RegexObj .Match( target ) Return type: Match object
RegexObj .Match( target , offset )
RegexObj .Match( target , offset , maxlength )

The Match method applies the object's regex to the target string, returning a Match object. With this Match object, you can query information about the results of the match (whether it was successful, the text matched, etc.), and initiate the "next" match of the same regex in the string. Details of the Match object follow, starting in Section 9.3.3.

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted.

If you provide a maxlength argument, it puts matching into a special mode where the maxlength characters starting offset characters into the target string are taken as the entire target string, as far as the regex engine is concerned. It pretends that characters outside the range don't even exist, so, for example, ^ can match at offset characters into the original target string, and $ can match at maxlength characters after that. It also means that lookaround can't "see" the characters outside of that range. This is all very different from when only offset is provided, as that merely influences where the transmission begins applying the regex — the engine still "sees" the entire target string.

This table shows examples that illustrate the meaning of offset and maxlength :

Method call Results when RegexObj is built with . . .
\d\d ^\d\d ^\d\d$
RegexObj.Match("May 16, 1998")
RegexObj.Match("May 16, 1998", 9)
RegexObj.Match("May 16, 1998", 9, 2) match '16' fail fail
match '99' fail fail
match '99' match '99' match '99'

RegexObj .Matches( target ) Return type: MatchCollection
RegexObj .Matches( target , offset )

The Matches method is similar to the Match method, except Matches returns a collection of Match objects representing all the matches in the target, rather than just one Match object representing the first match. The returned object is a MatchCollection.

For example, after this initialization:

     Dim R as New Regex("\w+")

     Dim Target as String = "a few words"

this code snippet

     Dim BunchOfMatches as MatchCollection = R.Matches(Target)

     Dim I as Integer

     For I = 0 to BunchOfMatches.Count - 1

          Dim MatchObj as Match = BunchOfMatches.Item(I)

          Console.WriteLine("Match: " & MatchObj.Value)

     Next

produces this output:

     Match: a

     Match: few

     Match: words

The following example, which produces the same output, shows that you can dispense with the MatchCollection variable altogether:

     Dim MatchObj as Match

     For Each MatchObj in R.Matches(Target)

          Console.WriteLine("Match: " & MatchObj.Value)

     Next

Finally, as a comparison, here's how you can accomplish the same thing another way, with the Match (rather than Matches) method:

     Dim MatchObj as Match = R.Match(Target)

     While MatchObj.Success

          Console.WriteLine("Match: " & MatchObj.Value)

          MatchObj = MatchObj.NextMatch()

     End While

RegexObj.Replace( target , replacement ) Return type: String
RegexObj.Replace( target , replacement , count )
RegexObj.Replace( target , replacement , count , offset )

The Replace method does a search-and-replace on the target string, returning a (possibly changed) copy of it. It applies the Regex object's regular expression, but instead of returning a Match object, it replaces the matched text. What the matched text is replaced with depends on the replacement argument. The replacement argument is overloaded; it can be either a string or a MatchEvaluator delegate. If replacement is a string, it is interpreted according to the sidebar on the below. For example,

     Dim R_CapWord as New Regex("\b[A-Z]\w*")

        .

        .

        .

     Text = R_CapWord.Replace(Text, "<B>$1</B>")

wraps each capitalized word with <B>···</B>.

If count is given, only that number of replacements is done. (The default is to do all replacements). To replace just the first match found, for example, use a count of one. If you know that there will be only one match, using an explicit count of one is more efficient than letting the Replace mechanics go through the work of trying to find additional matches. A count of -1 means "replace all" (which, again, is the default when no count is given).

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is applied. Bypassed characters are copied through to the result unchanged.

For example, this canonicalizes all whitespace (that is, reduces sequences of whitespace down to a single space):

     Dim AnyWS as New Regex("\s+")

        .

        .

        .

     Target = AnyWS.Replace(Target, " ")

This converts 'some•••••random•••••spacing' to 'some•random•spacing'. The following does the same, except it leaves any leading whitespace alone:

     Dim AnyWS as New Regex("\s+")

     Dim LeadingWS as New Regex("^\s+")

        .

        .

        .

     Target = AnyWS.Replace(Target, " ", -1, LeadingWS.Match(Target).Length)

This converts '••••some•••random•••••spacing' to '•••••some•random•spacing'. It uses the length of what's matched by LeadingWS as the offset (as the count of characters to skip) when doing the search and replace. It uses a convenient feature of the Match object, returned here by LeadingWS.Match(Target), that its Length property may be used even if the match fails. (Upon failure, the Length property has a value of zero, which is exactly what we need to apply AnyWS to the entire target.)

Special Per-Match Replacement Sequences

Both the Regex.Replace method and the Match.Result method accept a "replacement" string that's interpreted specially. Within it, the following sequences are replaced by appropriate text from the match:

Sequence Replacedby
$& $1, $2, . . . ${ name } $' $' $$ $_ $+ text matched by the regex (also available as $0)
text matched by the corresponding set of capturing parentheses
text matched by the corresponding named capture
text of the target string before the match location
text of the target string after the match location
a single '$' character
a copy of the entire original target string
(see text below)

The $+ sequence is fairly useless as currently implemented. Its origins lie with Perl's useful $+ variable, which references the highest-numbered set of capturing parentheses that actually participated in the match. (There's an example of it in use in Section 5.3.2.) This .NET replacement-string $+, though, merely references the highest-numbered set of capturing parentheses in the regex. It's particularly useless in light of the capturing-parentheses renumbering that's automatically done when named captures are used (see Section 9.1.1.1).

Any uses of '$' in the replacement string in situations other than those described in the table are left unmolested.

9.3.2.1 Using a replacement delegate

The replacement argument isn't limited to a simple string. It can be a delegate (basically, a pointer to a function). The delegate function is called after each match to generate the text to use as the replacement. Since the function can do any processing you want, it's an extremely powerful replacement mechanism.

The delegate is of the type MatchEvaluator, and is called once per match. The function it refers to should accept the Match object for the match, do whatever processing you like, and return the text to be used as the replacement.

As examples for comparison, the following two code snippets produce identical results:


     Target = R.Replace(Target, "<<$&>>"))

     ..........................................................

     Function MatchFunc(ByVal M as Match) as String

        return M.Result("<<$&>>")

     End Function

     Dim Evaluator as MatchEvaluator = New MatchEvaluator(AddressOf MatchFunc)

        .

        .

        .

     Target = R.Replace(Target, Evaluator)

Both snippets highlight each match by wrapping the matched text in <<···>>. The advantage of using a delegate is that you can include code as complex as you like in computing the replacement. Here's an example that converts Celsius temperatur es to Fahrenheit:

     Function MatchFunc(ByVal M as Match) as String

        'Get numeric temperature from $1, then convert to Fahrenheit

        Dim Celsius as Double = Double.Parse(M.Groups(1).Value)

        Dim Fahrenheit as Double = Celsius * 9/5 + 32

        Return Fahrenheit & "F" 'Append an "F", and return

     End Function


     Dim Evaluator as MatchEvaluator = New MatchEvaluator(AddressOf MatchFunc)

        .

        .

        .

     Dim R_Temp as Regex = New Regex("(\d+)C\b", RegexOptions.IgnoreCase)

     Target = R_Temp.Replace(Target, Evaluator)

Given 'Temp is 37C.' in Target, it replace it with 'Temp is 98.6F.'.

RegexObj. Split( target ) Return type: array of String
RegexObj. Split( target , count )
RegexObj. Split( target , count , offset )

The Split method applies the object's regex to the target string, returning an array of the strings separated by the matches. Here's a trivial example:

     Dim R as New Regex("\.")

     Dim Parts as String() = R.Split("209.204.146.22")

The R.Split returns the array of four strings ('209', '204', '146', and '22') that are separated by the three matches of \. in the text.

If a count is provided, no more than count strings will be returned (unless capturing parentheses are used—more on that in a bit). If count is not provided, Split returns as many strings as are separated by matches. Providing a count may mean that the regex stops being applied before the final match, and if so, the last string has the unsplit remainder of the line:

     Dim R as New Regex("\.")

     Dim Parts as String() = R.Split("209.204.146.22", 2)

This time, Parts receives two strings, '209' and '204.146.22'.

If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is attempted. The bypassed text becomes part of the first string returned (unless RegexOptions.RightToLeft has been specified, in which case the bypassed text becomes part of the last string returned).

9.3.2.2 Using `Split` with capturing parentheses

If capturing parentheses of any type are used, additional entries for captured text are usually inserted into the array. (We'll see in what cases they might not be inserted in a bit.) As a simple example, to separate a string like '2002-12-31' or '04/12/2003' into its component parts, you might split on [-/], like:

     Dim R as New Regex("[-/]")

     Dim Parts as String() = R.Split(MyDate)

This returns a list of the three numbers (as strings). However, adding capturing parentheses and using ([-/,]) as the regex causes Split to return five strings: if MyDate contains '2002-12-31', the strings are '2002', '-', '12', '-', and '31'. The extra '-' elements are from the per-capture $1.

If there are multiple sets of capturing parentheses, they are inserted in their numerical ordering (which means that all named captures come after all unnamed captures see Section 9.1.1.1).

Split works consistently with capturing parentheses so long as all sets of capturing parentheses actually participate in the match. However, there's a bug with the current version of .NET such that if there is a set of capturing parentheses that doesn't participate in the match, it and all higher-numbered sets don't add an element to the returned list.

As a somewhat contrived example, consider wanting to split on a comma with optional whitespace around it, yet have the whitespace added to the list of elements returned. You might use (\s+)?,(\s+)? for this. When applied with Split to 'this•,••that', four strings are returned, 'this', '•', '••', and 'that'. However, when applied to 'this,•that', the inability of the first set of capturing parentheses to match inhibits the element for it (and for all sets that follow) from being added to the list, so only two strings are returned, 'this' and 'that'. The inability to know beforehand exactly how many strings will be returned per match is a major shortcoming of the current implementation.

In this particular example, you could get around this problem simply by using (\s*),(\s*) (in which both groups are guaranteed to participate in any overall match). However, more complex expressions are not easily rewritten.

RegexObj .GetGroupNames()
RegexObj .GetGroupNumbers()
RegexObj .GroupNameFromNumber( number )
RegexObj .GroupNumberFromName( name )

These methods allow you to query information about the names (both numeric and, if named capture is used, by name) of capturing groups in the regex. They don't refer to any particular match, but merely to the names and numbers of groups that exist in the regex. The sidebar in below shows an example of their use.

RegexObj .ToString()
RegexObj .RightToLeft
RegexObj .Options

These allow you to query information about the Regex object itself (as opposed to applying the regex object to a string). The ToString() method returns the pattern string originally passed to the regex constructor. The RightToLeft property returns a Boolean indicating whether RegexOptions.RightToLeft was specified with the regex. The Options property returns the RegexOptions that are associated with the regex. The following table shows the values of the individual options, which are added together when reported:

0 None 16 Singleline
1 IgnoreCase 32 IgnorePatternWhitespace
2 Multiline 64 RightToLeft
4 ExplicitCapture 256 ECMAScript
8 Compiled

The missing 128 value is for a Microsoft debugging option not available in the final product.

The sidebar in below shows an example these methods in use.

9.3.3 Using `Match` Objects

Match objects are created by a Regex's Match method, the Regex.Match static function (discussed in a bit), and a Match object's own NextMatch method. It encapsulates all information relating to a single application of a regex. It has the following properties and methods:

MatchObj .Success
This returns a Boolean indicating whether the match was successful. If not, the object is a copy of the static Match.Empty object.

Displaying Information about a Regex Object

This displays what's known about the Regex object in the variable R:

'Display information known about the Regex object in the variable R Console.WriteLine("Regex is: " & R.ToString()) Console.WriteLine("Options are: " & R.Options) If R.RightToLeft Console.WriteLine("Is Right-To-Left: True") Else Console.WriteLine("Is Right-To-Left: False") End If
Dim S as String For Each S in R.GetGroupNames() Console.WriteLine("Name """ & S & """ is Num #" & _ R.GroupNumberFromName(S))
Next Console.WriteLine("---") Dim I as Integer For Each I in R.GetGroupNumbers() Console.WriteLine("Num #" & I & " is Name """ & _ R.GroupNameFromNumber(I) & """") Next

Run twice, once with each of the two Regex objects created with

New Regex("^(\w+)://([^/]+)(/\S*)")
New Regex("^(?<proto>\w+)://(?<host>[^/]+)(?<page>/\S*)", RegexOptions.Compiled)

the following output is produced (with one regex cut off to fit the page):

Regex is: ^(\w+)://([^/]+)(/\S*) Option are: 0 (Is Right-To-Left: False) Name "0" is Num #0 Name "1" is Num #1 Name "2" is Num #2 Name "3" is Num #3 --- Num #0 is Name "0" Num #1 is Name "1" Num #2 is Name "2" Num #3 is Name "3" Regex is: ^(?<proto>\w+)://(?<host> ··· Option are: 8 (Is Right-To-Left: False) Name "0" is Num #0 Name "proto" is Num #1 Name "host" is Num #2 Name "page" is Num #3 --- Num #0 is Name "0" Num #1 is Name "proto" Num #2 is Name "host" Num #3 is Name "page"

MatchObj .Value
MatchObj .ToString()
These return copies of the text actually matched.

MatchObj .Length
This returns the length of the text actually matched.

MatchObj .Index
This returns an integer indicating the position in the target text where the match was found. It's a zero-based index, so it's the number of characters from the start (left) of the string to the start (left) of the matched text. This is true even if RegexOptions.RightToLeft had been used to create the regex that generated this Match object.

MatchObj .Groups
This property is a GroupCollection object, in which a number of Group objects are encapsulated. It is a normal collection object, with a Count and Item properties, but it's most commonly accessed by indexing into it, fetching an individual Group object. For example, M.Groups(3) is the Group object related to the third set of capturing parentheses, and M.Groups("HostName") is the group object for the "Hostname" named capture (e.g., after the use of (?<HostName>···) in a regex).

Note that C# requires M.Groups[3] and M.Groups["HostName"] instead.

The zeroth group represents the entire match itself. MatchObj .Groups(0).Value, for example, is the same as MatchObj .Value.

MatchObj .NextMatch()
The NextMatch() method re-invokes the original regex to find the next match in the original string, returning a new Match object.

MatchObj .Result( string )
Special sequences in the given string are processed as shown in the sidebar in Section 9.3.2, returning the resulting text. Here's a simple example:

     Dim M as Match = Regex.Match(SomeString, "\w+")

     Console.WriteLine(M.Result("The first word is '$&'"))

You can use this to get a copy of the text to the left and right of the match, with


     M.Result("$'") 'This is the text to the left of the match



     M.Result("$'") 'This is the text to the right of the match

During debugging, it may be helpful to display something along the lines of:


     M.Result("[$'<$&>$']"))

Given a Match object created by applying \d+ to the string 'May 16, 1998', it returns 'May <16>, 1998', clearly showing the exact match.

MatchObj .Synchronized()
This returns a new Match object that's identical to the current one, except that it's safe for multi-threaded use.

MatchObj .Captures
The Captures property is not used often, but is discussed starting in Section 9.6.3.

9.3.4 Using `Group` Objects

A Group object contains the match information for one set of capturing parentheses (or, if a zeroth group, for an entire match). It has the following properties and methods:

GroupObj .Success
This returns a Boolean indicating whether the group participated in the match. Not all groups necessarily "participate" in a successful overall match. For example, if (this)|(that) matches successfully, one of the sets of parentheses is guaranteed to have participated, while the other is guaranteed to have not. See the footnote in Section 3.5.5 for another example.

GroupObj .Value
GroupObj .ToString()
These both return a copy of the text captured by this group. If the match hadn't been successful, these return an empty string.

GroupObj .Length
This returns the length of the text captured by this group. If the match hadn't been successful, it returns zero.

GroupObj .Index
This returns an integer indicating where in the target text the match was found. The return value is a zero-based index, so it's the number of characters from the start (left) of the string to the start (left) of the captured text. (This is true even if RegexOptions.RightToLeft had been used to create the regex that generated this Match object.)

GroupObj .Captures
The Group object also has a Captures property discussed starting in Section 9.6.3.

< Free Open Study >