< Free Open Study > |
9.3 Core Object DetailsNow that we've seen an overview, let's look at the details. First, we'll look at how to create a Regex object, followed by how to apply it to a string to yield a Match object, and how to work with that object and its Group objects. In practice, you can often avoid having to explicitly create a Regex object, but it's good to be comfortable with them, so during this look at the core objects, I'll always explicitly create them. We'll see later what shortcuts .NET provides to make things more convenient. In the lists that follow, I don't mention little-used methods that are merely inherited from the Object class. 9.3.1 Creating Regex ObjectsThe constructor for creating a Regex object is uncomplicated. It accepts either one argument (the regex, as a string), or two arguments (the regex and a set of options). Here's a one-argument example: Dim StripTrailWS = new Regex("\s+$") ' for removing trailing whitespace
This just creates the Regex object, preparing it for use; no matching has been done to this point. Here's a two-argument example: Dim GetSubject = new Regex("^subject: (.*)", RegexOptions.IgnoreCase) That passes one of the RegexOptions flags, but you can pass multiple flags if they're OR'd together, as with: Dim GetSubject = new Regex("^subject: (.*)", _ RegexOptions.IgnoreCase OR RegexOptions.Multiline) 9.3.1.1 Catching exceptionsAn ArgumentException error is thrown if a regex with an invalid combination of metacharacters is given. You don't normally need to catch this exception when using regular expressions you know to work, but it's important to catch it if using regular expressions from "outside" the program (e.g., entered by the user, or read from a configuration file). Here's an example: Dim R As Regex Try R = New Regex(SearchRegex) Catch e As ArgumentException Console.WriteLine("*ERROR* bad regex: " & e.ToString) Exit Sub End Try Of course, depending on the application, you may want to do something other than writing to the console upon detection of the exception. 9.3.1.2 Regex optionsThe following option flags are allowed when creating a Regex object:
9.3.2 Using Regex ObjectsJust having a regex object is not useful unless you apply it, so the following methods swing it into action.
The IsMatch method applies the object's regex to the target string, returning a simple Boolean indicating whether the attempt is successful. Here's an example: Dim R as RegexObj = New Regex("^\s*$")
.
.
.
If R.IsMatch(Line) Then
' Line is blank . . .
.
.
.
Endif
If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted.
The Match method applies the object's regex to the target string, returning a Match object. With this Match object, you can query information about the results of the match (whether it was successful, the text matched, etc.), and initiate the "next" match of the same regex in the string. Details of the Match object follow, starting in Section 9.3.3. If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is first attempted. If you provide a maxlength argument, it puts matching into a special mode where the maxlength characters starting offset characters into the target string are taken as the entire target string, as far as the regex engine is concerned. It pretends that characters outside the range don't even exist, so, for example, ^ can match at offset characters into the original target string, and $ can match at maxlength characters after that. It also means that lookaround can't "see" the characters outside of that range. This is all very different from when only offset is provided, as that merely influences where the transmission begins applying the regex — the engine still "sees" the entire target string. This table shows examples that illustrate the meaning of offset and maxlength :
The Matches method is similar to the Match method, except Matches returns a collection of Match objects representing all the matches in the target, rather than just one Match object representing the first match. The returned object is a MatchCollection. For example, after this initialization: Dim R as New Regex("\w+") Dim Target as String = "a few words" this code snippet Dim BunchOfMatches as MatchCollection = R.Matches(Target) Dim I as Integer For I = 0 to BunchOfMatches.Count - 1 Dim MatchObj as Match = BunchOfMatches.Item(I) Console.WriteLine("Match: " & MatchObj.Value) Next produces this output: Match: a Match: few Match: words The following example, which produces the same output, shows that you can dispense with the MatchCollection variable altogether: Dim MatchObj as Match For Each MatchObj in R.Matches(Target) Console.WriteLine("Match: " & MatchObj.Value) Next Finally, as a comparison, here's how you can accomplish the same thing another way, with the Match (rather than Matches) method: Dim MatchObj as Match = R.Match(Target) While MatchObj.Success Console.WriteLine("Match: " & MatchObj.Value) MatchObj = MatchObj.NextMatch() End While
The Replace method does a search-and-replace on the target string, returning a (possibly changed) copy of it. It applies the Regex object's regular expression, but instead of returning a Match object, it replaces the matched text. What the matched text is replaced with depends on the replacement argument. The replacement argument is overloaded; it can be either a string or a MatchEvaluator delegate. If replacement is a string, it is interpreted according to the sidebar on the below. For example, Dim R_CapWord as New Regex("\b[A-Z]\w*") . . . Text = R_CapWord.Replace(Text, "<B>$1</B>") wraps each capitalized word with <B>···</B>. If count is given, only that number of replacements is done. (The default is to do all replacements). To replace just the first match found, for example, use a count of one. If you know that there will be only one match, using an explicit count of one is more efficient than letting the Replace mechanics go through the work of trying to find additional matches. A count of -1 means "replace all" (which, again, is the default when no count is given). If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is applied. Bypassed characters are copied through to the result unchanged. For example, this canonicalizes all whitespace (that is, reduces sequences of whitespace down to a single space): Dim AnyWS as New Regex("\s+") . . . Target = AnyWS.Replace(Target, " ") This converts 'some•••••random•••••spacing' to 'some•random•spacing'. The following does the same, except it leaves any leading whitespace alone: Dim AnyWS as New Regex("\s+") Dim LeadingWS as New Regex("^\s+") . . . Target = AnyWS.Replace(Target, " ", -1, LeadingWS.Match(Target).Length) This converts '••••some•••random•••••spacing' to '•••••some•random•spacing'. It uses the length of what's matched by LeadingWS as the offset (as the count of characters to skip) when doing the search and replace. It uses a convenient feature of the Match object, returned here by LeadingWS.Match(Target), that its Length property may be used even if the match fails. (Upon failure, the Length property has a value of zero, which is exactly what we need to apply AnyWS to the entire target.)
9.3.2.1 Using a replacement delegateThe replacement argument isn't limited to a simple string. It can be a delegate (basically, a pointer to a function). The delegate function is called after each match to generate the text to use as the replacement. Since the function can do any processing you want, it's an extremely powerful replacement mechanism. The delegate is of the type MatchEvaluator, and is called once per match. The function it refers to should accept the Match object for the match, do whatever processing you like, and return the text to be used as the replacement. As examples for comparison, the following two code snippets produce identical results: Target = R.Replace(Target, "<<$&>>")) .......................................................... Function MatchFunc(ByVal M as Match) as String return M.Result("<<$&>>") End Function Dim Evaluator as MatchEvaluator = New MatchEvaluator(AddressOf MatchFunc) . . . Target = R.Replace(Target, Evaluator) Both snippets highlight each match by wrapping the matched text in <<···>>. The advantage of using a delegate is that you can include code as complex as you like in computing the replacement. Here's an example that converts Celsius temperatur es to Fahrenheit: Function MatchFunc(ByVal M as Match) as String 'Get numeric temperature from $1, then convert to Fahrenheit Dim Celsius as Double = Double.Parse(M.Groups(1).Value) Dim Fahrenheit as Double = Celsius * 9/5 + 32 Return Fahrenheit & "F" 'Append an "F", and return End Function Given 'Temp is 37C.' in Target, it replace it with 'Temp is 98.6F.'.
The Split method applies the object's regex to the target string, returning an array of the strings separated by the matches. Here's a trivial example: Dim R as New Regex("\.") Dim Parts as String() = R.Split("209.204.146.22") The R.Split returns the array of four strings ('209', '204', '146', and '22') that are separated by the three matches of \. in the text. If a count is provided, no more than count strings will be returned (unless capturing parentheses are used—more on that in a bit). If count is not provided, Split returns as many strings as are separated by matches. Providing a count may mean that the regex stops being applied before the final match, and if so, the last string has the unsplit remainder of the line: Dim R as New Regex("\.") Dim Parts as String() = R.Split("209.204.146.22", 2) This time, Parts receives two strings, '209' and '204.146.22'. If an offset (an integer) is provided, that many characters in the target string are bypassed before the regex is attempted. The bypassed text becomes part of the first string returned (unless RegexOptions.RightToLeft has been specified, in which case the bypassed text becomes part of the last string returned). 9.3.2.2 Using Split with capturing parenthesesIf capturing parentheses of any type are used, additional entries for captured text are usually inserted into the array. (We'll see in what cases they might not be inserted in a bit.) As a simple example, to separate a string like '2002-12-31' or '04/12/2003' into its component parts, you might split on [-/] , like: Dim R as New Regex("[-/]") Dim Parts as String() = R.Split(MyDate) This returns a list of the three numbers (as strings). However, adding capturing parentheses and using ([-/,]) as the regex causes Split to return five strings: if MyDate contains '2002-12-31', the strings are '2002', '-', '12', '-', and '31'. The extra '-' elements are from the per-capture $1. If there are multiple sets of capturing parentheses, they are inserted in their numerical ordering (which means that all named captures come after all unnamed captures see Section 9.1.1.1). Split works consistently with capturing parentheses so long as all sets of capturing parentheses actually participate in the match. However, there's a bug with the current version of .NET such that if there is a set of capturing parentheses that doesn't participate in the match, it and all higher-numbered sets don't add an element to the returned list. As a somewhat contrived example, consider wanting to split on a comma with optional whitespace around it, yet have the whitespace added to the list of elements returned. You might use (\s+)?,(\s+)? for this. When applied with Split to 'this•,••that', four strings are returned, 'this', '•', '••', and 'that'. However, when applied to 'this,•that', the inability of the first set of capturing parentheses to match inhibits the element for it (and for all sets that follow) from being added to the list, so only two strings are returned, 'this' and 'that'. The inability to know beforehand exactly how many strings will be returned per match is a major shortcoming of the current implementation. In this particular example, you could get around this problem simply by using (\s*),(\s*) (in which both groups are guaranteed to participate in any overall match). However, more complex expressions are not easily rewritten.
These methods allow you to query information about the names (both numeric and, if named capture is used, by name) of capturing groups in the regex. They don't refer to any particular match, but merely to the names and numbers of groups that exist in the regex. The sidebar in below shows an example of their use.
These allow you to query information about the Regex object itself (as opposed to applying the regex object to a string). The ToString() method returns the pattern string originally passed to the regex constructor. The RightToLeft property returns a Boolean indicating whether RegexOptions.RightToLeft was specified with the regex. The Options property returns the RegexOptions that are associated with the regex. The following table shows the values of the individual options, which are added together when reported:
The missing 128 value is for a Microsoft debugging option not available in the final product. The sidebar in below shows an example these methods in use. 9.3.3 Using Match ObjectsMatch objects are created by a Regex's Match method, the Regex.Match static function (discussed in a bit), and a Match object's own NextMatch method. It encapsulates all information relating to a single application of a regex. It has the following properties and methods:
MatchObj
.Success
MatchObj
.Value
MatchObj
.Length
MatchObj
.Index
MatchObj
.Groups
Note that C# requires M.Groups[3] and M.Groups["HostName"] instead. The zeroth group represents the entire match itself. MatchObj .Groups(0).Value, for example, is the same as MatchObj .Value.
MatchObj
.NextMatch()
MatchObj
.Result(
string
)
Dim M as Match = Regex.Match(SomeString, "\w+") Console.WriteLine(M.Result("The first word is '$&'")) You can use this to get a copy of the text to the left and right of the match, with M.Result("$'") 'This is the text to the left of the match M.Result("$'") 'This is the text to the right of the match During debugging, it may be helpful to display something along the lines of:
M.Result("[$'<$&>$']"))
Given a Match object created by applying \d+ to the string 'May 16, 1998', it returns 'May <16>, 1998', clearly showing the exact match.
MatchObj
.Synchronized()
MatchObj
.Captures
9.3.4 Using Group ObjectsA Group object contains the match information for one set of capturing parentheses (or, if a zeroth group, for an entire match). It has the following properties and methods:
GroupObj
.Success
GroupObj
.Value
GroupObj
.Length
GroupObj
.Index
GroupObj
.Captures
|
< Free Open Study > |