Previous Section  < Free Open Study >  Next Section

9.6 Advanced .NET

The following sections cover a few features that haven't fit into the discussion so far: building a regex library with regex assemblies, using an interesting .NET-only regex feature for matching nested constructs, and a discussion of the Capture object.

9.6.1 Regex Assemblies

.NET allows you to encapsulate Regex objects into an assembly, which is useful in creating a regex library. The example in the sidebar in below shows how to build one.

When the sidebar example executes, it creates the file JfriedlsRegexLibrary.DLL in the project's bin directory.

I can then use that assembly in another project, after first adding it as a reference via Visual Studio .NET's Project > Add Reference dialog.

To make the classes in the assembly available, I first import them:

     Imports jfriedl

I can then use them just like any other class, as in this example::


Dim FieldRegex as CSV.GetField = New CSV.GetField'This makes a new Regex object          

    .

    .

    .

Dim FieldMatch as Match = FieldRegex.Match(Line) 'Apply the regex to a string . . .

While FieldMatch.Success

  Dim Field as String

  If FieldMatch.Groups(1).Success

     Field = FieldMatch.Groups("QuotedField").Value

     Field = Regex.Replace(Field, """""", """") 'replace two double quotes with one

  Else

    Field = FieldMatch.Groups("UnquotedField").Value

  End If



  Console.WriteLine("[" & Field & "]")

  ' Can now work with 'Field'....



  FieldMatch = FieldMatch.NextMatch

End While

In this example, I chose to import only from the jfriedl namespace, but could have just as easily imported from the jfriedl.CSV namespace, which then would allow the Regex object to be created with:

     Dim FieldRegex as GetField = New GetField

'This makes a new Regex object

The difference is mostly a matter of style. You can also choose to not import anything, but rather use them directly:

     Dim FieldRegex as jfriedl.CSV.GetField = New jfriedl.CSV.GetField

This is a bit more cumbersome, but documents clearly where exactly the object is coming from. Again, it's a matter of style.

Creating Your Own Regex Library With an Assembly

This example builds a small regex library. This complete program builds an assembly (DLL) that holds three pre-built Regex constructors I've named jfriedl.Mail.Subject, jfriedl.Mail.From, and jfriedl.CSV.GetField.

The first two are simple examples just to show how it's done, but the complexity of the final one really shows the promise of building your own library. Note that you don't have to give the RegexOptions.Compiled flag, as that's implied by the process of building an assembly.

See the text (in Section 9.6.1) for how to use the assembly after it's built.

Option Explicit On

Option Strict On

     

Imports System.Text.RegularExpressions

Imports System.Reflection
Module BuildMyLibrary Sub Main() 'The calls to RegexCompilationInfo below provide the pattern, regex options, name within the class, 'class name, and a Boolean indicating whether the new class is public. The first class, for example, 'will be available to programs that use this assembly as "jfriedl.Mail.Subject", a Regex constructor. Dim RCInfo() as RegexCompilationInfo = { _ New RegexCompilationInfo( _ "^Subject:\s*(.*)", RegexOptions.IgnoreCase, _ "Subject", "jfriedl.Mail", true), _ New RegexCompilationInfo( _ "^From:\s*(.*)", RegexOptions.IgnoreCase, _ "From", "jfriedl.Mail", true), _ New RegexCompilationInfo( _ "\G(?:^|,) " & _ "(?: " & _ " (?# Either a double-quoted field... ) " & _ " "" (?# field's opening quote ) " & _ " (?<QuotedField> (?> [^""]+ | """" )* ) " & _ " "" (?# field's closing quote ) " & _ " (?# ...or... ) " & _ " | " & _ " (?# ...some non-quote/non-comma text... ) " & _ " (?<UnquotedField> [^"",]*) " & _ " )", _ RegexOptions.IgnorePatternWhitespace, _ "GetField", "jfriedl.CSV", true) _ } 'Now do the heavy lifting to build and write out the whole thing . . . Dim AN as AssemblyName = new AssemblyName() AN.Name = "JfriedlsRegexLibrary" 'This will be the DLL's filename AN.Version = New Version("1.0.0.0") Regex.CompileToAssembly(RCInfo, AN) 'Build everything End Sub End Module


9.6.2 Matching Nested Constructs

Microsoft has included an interesting innovation for matching balanced constructs (historically, something not possible with a regular expression). It's not particularly easy to understand—this section is short, but be warned, it is very dense.

It's easiest to understand with an example, so I'll start with one:

     Dim R As Regex = New Regex(" \(                    " & _

                                "    (?>                " & _

                                "        [^()]+         " & _

                                "      |                " & _

                                "        \( (?<DEPTH>)  " & _

                                "      |                " & _

                                "        \) (?<-DEPTH>) " & _

                                "    )*                 " & _

                                "    (?(DEPTH)(?!))     " & _

                                " \)                    ",  _

               RegexOptions.IgnorePatternWhitespace)

This matches the first properly-paired nested set of parentheses, such as the underlined portion of 'before (nope (yes (here) okay) after'. The first parenthesis isn't matched because it has no associated closing parenthesis.

Here's the super-short overview of how it works:

  1. With each '(' matched, figs/boxdr.jpg (?<DEPTH>) figs/boxul.jpg adds one to the regex's idea of how deep the parentheses are currently nested (at least, nested beyond the initial figs/boxdr.jpg \( figs/boxul.jpg at the start of the regex).

  2. With each ')' matched, figs/boxdr.jpg (?<-DEPTH>) figs/boxul.jpg subtracts one from that depth.

  3. figs/boxdr.jpg (?(DEPTH)(?!)) figs/boxul.jpg ensures that the depth is zero before allowing the final literal figs/boxdr.jpg \) figs/boxul.jpg to match.

This works because the engine's backtracking stack keeps track of successfullymatched groupings. figs/boxdr.jpg (?<DEPTH>) figs/boxul.jpg is just a named-capture version of figs/boxdr.jpg () figs/boxul.jpg , which is always successful. Since it has been placed immediately after figs/boxdr.jpg \( figs/boxul.jpg , its success (which remains on the stack until removed) is used as a marker for counting opening parentheses.

Thus, the number of successful 'DEPTH' groupings matched so far is maintained on the backtracking stack. We want to subtract from that whenever a closing parentheses is found. That's accomplished by .NET's special figs/boxdr.jpg (?<-DEPTH>) figs/boxul.jpg construct, which removes the most recent "successful DEPTH" notation from the stack. If it turns out that there aren't any, the figs/boxdr.jpg (?<-DEPTH>) figs/boxul.jpg itself fails, thereby disallowing the regex from over-matching an extra closing parenthesis.

Finally, figs/boxdr.jpg (?(DEPTH)(?!)) figs/boxul.jpg is a normal conditional that applies figs/boxdr.jpg (?!) figs/boxul.jpg if the 'DEPTH' grouping is currently successful. If it's still successful by the time we get here, there was an unpaired opening parenthesis whose success had never been subtracted by a balancing figs/boxdr.jpg (?<-DEPTH>) figs/boxul.jpg . If that's the case, we want to exit the match (we don't want to match an unbalanced sequence), so we apply figs/boxdr.jpg (?!) figs/boxul.jpg , which is normal negative lookbehind of an empty subexpression, and guaranteed to fail.

Phew! That's how to match nested constructs with .NET regular expressions.

9.6.3 Capture Objects

There's an additional component to .NET's object model, the Capture object, which I haven't discussed yet. Depending on your point of view, it either adds an interesting new dimension to the match results, or adds confusion and bloat.

A Capture object is almost identical to a Group object in that it represents the text matched within a set of capturing parentheses. Like the Group object, it has methods for Value (the text matched), Length (the length of the text matched), and Index (the zero-based number of characters into the target string that the match was found).

The main difference between a Group object and a Capture object is that each Group object contains a collection of Captures representing all the intermediary matches by the group during the match, as well as the final text matched by the group.

Here's an example with figs/boxdr.jpg ^(..)+ figs/boxul.jpg applied to 'abcdefghijk':

     Dim M as Match = Regex.Match("abcdefghijk", "^(..)+")

The regex matches four sets of figs/boxdr.jpg (..) figs/boxul.jpg , which is most of the string: 'abcdefghijk'. Since the plus is outside of the parentheses, they recapture with each iteration of the plus, and are left with only 'ij' (that is, M.Groups(1).Value is 'ij'). However, that M.Groups(1) also contains a collection of Captures representing the complete 'ab', 'cd', 'ef', 'gh', and 'ij' that figs/boxdr.jpg (..) figs/boxul.jpg walked through during the match:


     M.Groups(1).Captures(0).Value is 'ab'

     M.Groups(1).Captures(1).Value is 'cd'

     M.Groups(1).Captures(2).Value is 'ef'

     M.Groups(1).Captures(3).Value is 'gh'

     M.Groups(1).Captures(4).Value is 'ij'

     M.Groups(1).Captures.Count is 5.

You'll notice that the last capture has the same 'ij' value as the overall match, M.Groups(1).Value. It turns out that the Value of a Group is really just a shorthand notation for the group's final capture. M.Groups(1).Value is really:


M.Groups(1).Captures( M.Groups(1).Captures.Count - 1 ).Value

Here are some additional points about captures:

  • M.Groups(1).Captures is a CaptureCollection, which, like any collection, has Items and Count properties. However, it's common to forego the Items property and index directly through the collection to its individual items, as with M.Groups(1).Captures(3) (M.Groups[1].Captures[3] in C#).

  • A Capture object does not have a Success method; check the Group's Success instead.

  • So far, we've seen that Capture objects are available from a Group object. Although it's not particularly useful, a Match object also has a Captures property. M.Captures gives direct access to the Capture property of the zeroth group (that is, M.Captures is the same as M.Group(0).Captures). Since the zeroth group represents the entire match, there are no iterations of it "walking through" a match, so the zeroth captured collection always has only one Capture. Since they contain exactly the same information as the zeroth Group, both M.Captures and M.Group(0).Captures are not particularly useful.

.NET's Capture object is an interesting innovation that appears somewhat more complex and confusing than it really is by the way it's been "overly integrated" into the object model. After getting past the .NET documentation and actually understanding what these objects add, I've got mixed feelings about them. On one hand, it's an interesting innovation that I'd like to get to know. Uses for it don't immediately jump to mind, but that's likely because I've not had the same years of experience with it as I have with traditional regex features.

On the other hand, the construction of all these extra capture groups during a match, and then their encapsulation into objects after the match, seems an effi- ciency burden that I wouldn't want to pay unless I'd requested the extra information. The extra Capture groups won't be used in the vast majority of matches, but as it is, all Group and Capture objects (and their associated GroupCollection and CaptureCollection objects) are built when the Match object is built. So, you've got them whether you need them or not; if you can find a use for the Capture objects, by all means, use them.

    Previous Section  < Free Open Study >  Next Section