Previous Section  < Free Open Study >  Next Section

9.2 Using .NET Regular Expressions

.NET regular expressions are powerful, clean, and provided through a complete and easy-to-use class interface. But as wonderful a job that Microsoft did building the package, the documentation is just the opposite—it's horrifically bad. It's woefully incomplete, poorly written, disorganized, and sometimes even wrong. It took me quite a while to figure the package out, so it's my hope that the presentation in this chapter makes the use of .NET regular expressions clear for you.

9.2.1 Regex Quickstart

You can get quite a bit of use out of the .NET regex package without even knowing the details of its regex class model. Knowing the details lets you get more information more efficiently, but the following are examples of how to do simple operations without explicitly creating any classes. These are just examples; all the details follow shortly.

Any program that uses the regex library must have the line

     Imports System.Text.RegularExpressions

at the beginning of the file (see Section 9.2.2), so these examples assume that's there.

The following examples all which work with the text in the String variable TestStr. As with all examples in this chapter, names I've chosen are in italic.

9.2.1.1 Quickstart: Checking a string for match

This example simply checks to see whether a regex matches a string:

     If Regex.IsMatch(TestStr, "^\s*$")

       Console.WriteLine("line is empty")

     Else

       Console.WriteLine("line is not empty")

     End If

This example uses a match option:

     If Regex.IsMatch(TestStr, "^subject:", RegexOptions.IgnoreCase)

       Console.WriteLine("line is a subject line")

     Else

       Console.WriteLine("line is not a subject line")

     End If
9.2.1.2 Quickstart: Matching and getting the text matched

This example identifies the text actually matched by the regex. If there's no match, TheNum is set to an empty string.

     Dim TheNum as String = Regex.Match(TestStr, "\d+").Value

     If TheNum <> ""

       Console.WriteLine("Number is: " & TheNum)

     End If

This example uses a match option:


Dim ImgTag as String = Regex.Match(TestStr, "<img\b[^>]*>", 

                                   RegexOptions.IgnoreCase).Value

If ImgTag <> ""

       Console.WriteLine("Image tag: " & ImgTag)

     End If
9.2.1.3 Quickstart: Matching and getting captured text

This example gets the first captured group (e.g., $1) as a string:

     Dim Subject as String = _

        Regex.Match(TestStr, "^Subject: (.*)").Groups(1).Value

     If Subject <> ""

        Console.WriteLine("Subject is: " & Subject)

     End If

Note that C# uses Groups[1] instead of Groups(1).

Here's the same thing, using a match option:

     Dim Subject as String = _
Regex.Match(TestStr, "^subject: (.*)", _
RegexOptions.IgnoreCase).Groups(1).Value If Subject <> "" Console.WriteLine("Subject is: " & Subject) End If

This example is the same as the previous, but using named capture:

     Dim Subject as String = _
Regex.Match(TestStr, "^subject: (?<Subj>.*)", _
RegexOptions.IgnoreCase).Groups("Subj").Value If Subject <> "" Console.WriteLine("Subject is: " & Subject) End If
9.2.1.4 Quickstart: Search and replace

This example makes our test string "safe" to include within HTML, converting characters special to HTML into HTML entities:


     TestStr = Regex.Replace(TestStr, "&", "&amp;")

     TestStr = Regex.Replace(TestStr, "<", "&lt;")

     TestStr = Regex.Replace(TestStr, ">", "&gt;")

     Console.WriteLine("Now safe in HTML: " & TestStr)

The replacement string (the third argument) is interpreted specially, as described in the sidebar in Section 9.3.2. For example, within the replacement string, '$&' is replaced by the text actually matched by the regex. Here's an example that wraps <B>···</B> around capitalized words:


     TestStr = Regex.Replace(TestStr, "\b[A-Z]\w*", "<B>$&<B>")

     Console.WriteLine("Modified string: " & TestStr)

This example replaces <B>···</B> (in a case-insensitive manner) with <I>···</I>:


     TestStr = Regex.Replace(TestStr, "<b>(.*?)</b>", "<I>$1</I>", _
RegexOptions.IgnoreCase) Console.WriteLine("Modified string: " & TestStr)

9.2.2 Package Overview

You can get the most out .NET regular expressions by working with its rich and convenient class structure. To give us an overview, here's a complete console application that shows a simple match using explicit objects:

     Option Explicit On ' These are not specifically required to use regexes,

     Optiin Strict in ' but their use is good general practice.

' Make regex-related classes easily available. Imports System.Text.RegularExpressiins

Module SimpleTest Sub Main() Dim SampleText as String = "this is the 1st test string" Dim R as Regex = New Regex("\d+\w+") 'Compile the pattern. Dim M as Match = R.match(SampleText) 'Check against a string. If not M.Success Cinsole.WriteLine("no match") Else Dim MatchedText as String = M.Value 'Query the results . . . Dim MatchedFrom as Integer = M.Index Dim MatchedLen as Integer = M.Length Console.WriteLine("matched [" & MatchedText & "]" & _
" from char#" & MatchedFrom.ToString() & _
" for " & MatchedLen.ToString() & " chars.") End If End Sub End Module

When executed from a command prompt, it applies figs/boxdr.jpg \d+\w+ figs/boxul.jpg to the sample text and displays:

matched [1st] from char#12 for 3 chars.
9.2.2.1 Importing the regex namespace

Notice the Imports System.Text.RegularExpressions line near the top of the program? That's required in any VB program that wishes to access the .NET regex objects, to make them available to the compiler.

The analogous statement in C# is:

     using System.Text.RegularExpressions; // This is for C#

The example shows the use of the underlying raw regex objects. The two main action lines:

     Dim R as Regex = New Regex("\d+\w+") 'Compile the pattern.

     Dim M as Match = R.Match(SampleText) 'Check against a string.

can also be combined, as:

     Dim M as Match = Regex.Match(SampleText, "\d+\w+") 'Check pattern against string.

The combined version is easier to work with, as there's less for the programmer to type, and less objects to keep track of. It does, however, come with at a slight effi- ciency penalty (see Section 9.4.1). Over the coming sections, we'll first look at the raw objects, and then at the "convenience" functions like the Regex.Match static function, and when it makes sense to use them.

For brevity's sake, I'll generally not repeat the following lines in examples that are not complete programs:

     Option Explicit On

     Option Strict On

     Imports System.Text.RegularExpressions

It may also be helpful to look back at some of VB examples earlier in the book, in Sections 3.2.2.2, 3.2.4, 5.3.4, 5.4.2.2, and 6.3.3.

9.2.3 Core Object Overview

Before getting into the details, let's first take a step back and look the .NET regex object model. An object model is the set of class structures through which regex functionality is provided. .NET regex functionality is provided through seven highly-interwoven classes, but in practice, you'll generally need to understand only the three shown visually in Figure 9-1, which depicts the repeated application of figs/boxdr.jpg \s+(\d+) figs/boxul.jpg to the string 'Mar•16,•1998'.

9.2.3.1 Regex objects

The first step is to create a Regex object, as with:

     Dim R as Regex = New Regex("\s+(\d+)")

Here, we've made a regex object representing figs/boxdr.jpg\s+(\d+) figs/boxul.jpg and stored it in the R variable. Once you've got a Regex object, you can apply it to text with its Match( text) method, which returns information on the first match found:

     Dim M as Match = R.Match("May 16, 1998")
Figure 1. .NET's Regex-related object model
figs/mre2_0901.jpg
9.2.3.2 Match objects

A Regex object's Match(···) method provides information about a match result by creating and returning a Match object. A Match object has a number of properties, including Success (a Boolean value indicating whether the match was successful) and Value (a copy of the text actually matched, if the match was successful). We'll look at the full list of Match properties later.

Among the details you can get about a match from a Match object is information about the text matched within capturing parentheses. The Perl examples in earlier chapters used Perl's $1 variable to get the text matched within the first set of capturing parentheses. .NET offers two methods to retrieve this data: to get the raw text, you can index into a Match object's Groups property, such as with Groups(1).Value to get the equivalent of Perl's $1. (Note: C# requires a different syntax, Groups[1].Value, instead.) Another approach is to use the Result method, which is discussed starting in Section 9.3.3.

9.2.3.3 Group objects

The Groups(1) part in the previous paragraph actually references a Group object, and the subsequent .Value references its Value property (the text associated with the group). There is a Group object for each set of capturing parentheses, and a "virtual group," numbered zero, which holds the information about the overall match.

Thus, MatchObj.Value and MatchObj.Groups(0).Value are the same — a copy of the entire text matched. It's more concise and convenient to use the first, shorter approach, but it's important to know about the zeroth group because MatchObj.Groups.Count (the number of groups known to the Match object) includes it. The MatchObj.Groups.Count resulting from a successful match with figs/boxdr.jpg \s+(\d+) figs/boxul.jpg is two (the whole-match "zeroth" group, and the $1 group).

9.2.3.4 Capture objects

There is also a Capture object. It's not used often, but it's discussed starting in Section 9.6.3.

9.2.3.4.1 All results are computed at match time

When a regex is applied to a string, resulting in a Match object, all the results (where it matched, what each capturing group matched, etc.) are calculated and encapsulated into the Match object. Accessing properties and methods of the Match object, including its Group objects (and their properties and methods) merely fetches the results that have already been computed.

    Previous Section  < Free Open Study >  Next Section