3.2 Care and Handling of Regular Expressions

The second concern outlined at the start of the chapter is the syntactic packaging that tells an application "Hey, here's a regex, and this is what I want you to do with it." egrep is a simple example because the regular expression is expected as an argument on the command line. Any extra syntactic sugar, such as the single quotes I used throughout the first chapter, are needed only to satisfy the command shell, not egrep. Complex systems, such as regular expressions in programming languages, require more complex packaging to inform the system exactly what the regex is and how it should be used.

The next step, then, is to look at what you can do with the results of a match. Again, egrep is simple in that it pretty much always does the same thing (displays lines that contain a match), but as the previous chapter began to show, the real power is in doing much more interesting things. The two basic actions behind those interesting things are match (to check if a regex matches in a string, and to perhaps pluck information from the string), and search-and-replace, to modify a string based upon a match. There are many variations of these actions, and many variations on how individual languages let you perform them.

In general, a programming language can take one of three approaches to regular expressions: integrated, procedural, and object-oriented. With the first, regular expression operators are built directly into the language, as with Perl. In the other two, regular expressions are not part of the low-level syntax of the language. Rather, normal strings are passed as arguments to normal functions, which then interpret the strings as regular expressions. Depending on the function, one or more regex-related actions are then performed. One derivative or another of this style is use by most (non-Perl) languages, including Java, the .NET languages, Tcl, Python, PHP, Emacs lisp, and Ruby.

3.2.1 Integrated Handling

We've already seen a bit of Perl's integrated approach, such as this example from Section 2.3.4:

if ($line =~ m/^Subject: (.*)/i) {

    $subject = $1;

}

Here, for clarity, variable names I've chosen are in italic, while the regex-related items are bold, and the regular expression itself is underlined. We know that Perl applies the regular expression ^Subject:•(.*) to the text held in $line, and if a match is found, executes the block of code that follows. In that block, the variable $1 represents the text matched within the regular expression's parentheses, and this gets assigned to the variable $subject.

Another example of an integrated approach is when regular expressions are part of a configuration file, such as for procmail (a Unix mail-processing utility.) In the configuration file, regular expressions are used to route mail messages to the sections that actually process them. It's even simpler than with Perl, since the operands (the mail messages) are implicit.

What goes on behind the scenes is quite a bit more complex than these examples show. An integrated approach simplifies things to the programmer because it hides in the background some of mechanics of preparing the regular expression, setting up for the match, applying the regular expression, and deriving results from that application. Hiding these steps makes the normal case very easy to work with, but as we'll see later, it can make some cases less efficient or clumsier to work with.

But, before getting into those details, let's uncover the hidden steps by looking at the other methods.

3.2.2 Procedural and Object-Oriented Handling

Procedural and object-oriented handling are fairly similar. In either case, regex functionality is provided not by built-in regular-expression operators, but by normal functions (procedural) or constructors and methods (object-oriented). In this case, there are no true regular-expression operands, but rather normal string arguments that the functions, constructors, or methods choose to interpret as regular expressions.

The next sections show examples in Java, VB.NET, and Python.

3.2.2.1 Regex handling in Java

Let's look at the equivalent of the "Subject" example in Java, using Sun's java.util.regex package. (Java is covered in depth in Chapter 8.)

import java.util.regex.*; // Make regex classes easily available

             ·

             ·

             ·

[1] Pattern r = Pattern.compile("^Subject: (.*)", Pattern.CASE_INSENSITIVE);

[2] Matcher m = r.matcher(line);

[3] if (m.find()) {

[4]     subject = m.group(1);

}

Variable names I've chosen are again in italic, the regex-related items are bold, and the regular expression itself is underlined. Well, to be precise, what's underlined is a normal string literal to be interpreted as a regular expression.

This example shows an object-oriented approach with regex functionality supplied by two classes in Sun's java.util.regex package: Pattern and Matcher. The actions performed are:

[1]
Inspect the regular expression and compile it into an internal form that matches in a case-insensitive manner, yielding a "Pattern" object.

[2]
Associate it with some text to be inspected, yielding a "Matcher" object.

[3]
Actually apply the regex to see if there is a match in the previously-associated text, and let us know the result.

[4]
If there is a match, make available the text matched within the first set of capturing parentheses.

Actions similar to these are required, explicitly or implicitly, by any program wishing to use regular expressions. Perl hides most of these details, and this Java implementation usually exposes them.

A procedural example. Sun's Java regex package does, however, provide a few procedural-approach "convenience functions" that hide much of the work. Rather than require you to first create a regex object, then use that object's methods to apply it, these static functions create a temporary object for you, throwing it away once done. Here's an example showing the Pattern.matches(···) function:

   if (! Pattern.matches("\\s*", line))

   {

       // . . . line is not blank . . .

   }

This function wraps an implicit ^···$ around the regex, and returns a Boolean indicating whether it can match the input string. It's common for a package to provide both procedural and object-oriented interfaces, just as Sun did here. The differences between them often involve convenience (a procedural interface can be easier to work with for simple tasks, but more cumbersome for complex tasks), functionality (procedural interfaces generally have less functionality and options than their object-oriented counterparts), and efficiency (in any given situation, one is likely to be more efficient than the other — a subject covered in detail in Chapter 6).

There are many regex packages for Java (half a dozen are discussed in Chapter 8), but Sun is in a position to integrate theirs with the language more than anyone else. For example, they've integrated it with the string class; the previous example can actually be written as:

   if (! line

.matches("\\s*", ))

   {

       // . . . line is not blank . . .

   }

Again, this is not as efficient as a properly-applied object-oriented approach, and so is not appropriate for use in a time-critical loop, but it's quite convenient for "casual" use.

3.2.2.2 Regex handling in VB and other .NET languages

Although all regex engines perform essentially the same basic tasks, they differ in how those tasks and services are exposed to the programmer, even among implementations sharing the same approach. Here's the "Subject" example in VB.NET (.NET is covered in detail in Chapter 9):

   Imports System.Text.RegularExpressions ' Make regex classes easily available

      .

      .

      .

   Dim R as Regex = New Regex("^Subject: (.*)", RegexOptions.IgnoreCase)

   Dim M as Match = R.Match(line)

   If M.Success

       subject = M.Groups(1).Value

   End If

Overall, this is generally similar to the Java example, except that .NET combines steps [2] and [3], and requires an extra Value in [4]. Why the differences? One is not inherently better or worse—each was just chosen by the developers who happened to have thought was the best approach at the time. (More on this in a bit.)

.NET also provides a few procedural-approach functions. Here's one to check for a blank line:

   If Not Regex.IsMatch(Line, "^\s*$") Then

      ' . . . line is not blank . . .

   End If

Unlike Sun's Pattern.matches function, which adds an implicit ^···$ around the regex, Microsoft chose to offer this more general function. It's just a simple wrapper around the core objects, but it involves less typing and variable corralling for the programmer, at only a small efficiency expense.

3.2.2.3 Regex handling in Python

As a final example, let's look at the Subject example in Python:

   import re;

      .

      .

      .

   R = re.compile("^Subject: (.*)", re.IGNORECASE);

   M = R.search(line)

   if M:

       subject = M.group(1)

Again, this looks very similar to what we've seen before.

3.2.2.4 Why do approaches differ?

Why does one language do it one way, and another language another? There may be language-specific reasons, but it mostly depends on the whim and skills of the engineers that develop each package. In fact, there are many unrelated regularexpression packages for Java (see Chapter 8), each written by someone who wanted the functionality that Sun didn't originally provide. Each has its own strengths and weaknesses, but it's interesting to note that they all provide their functionality in quite different ways from each other, and from what Sun eventually decided to implement themselves.

3.2.3 A Search-and-Replace Example

The "Subject" example is pretty simple, so the various approaches really don't have an opportunity to show how different they really are. In this section, we'll look at a somewhat more complex example, further highlighting the different designs.

In the previous chapter (see Section 2.3.6.5), we saw this Perl search-and-replace to "linkize" an email address:


   $text =~ s{

      \b

      # Capture the address to $1 . . .

      (

         \w[-.\w]*                          # username

         @

         [-\w]+(\.[-\w]+)*\.(com|edu|info)  # hostname

      )

      \b

   }{<a href="mailto:$1">$1</a>}gix;

Let's see how this is done in other languages.

3.2.3.1 Search-and-replace in Java

Here's the search-and-replace example with Sun's java.util.regex package:

   import java.util.regex.*; // Make regex classes easily available

         .

         .

         .

   Pattern r = Pattern.compile(

      "\\b                                                   \n"+

      "# Capture the address to $1 . . .                     \n"+

      "(                                                     \n"+

      "  \\w[-.\\w]*                            # username   \n"+

      "    @                                                 \n"+

      "  [-\\w]+(\\.[-\\w]+)*\\.(com|edu|info)  # hostname   \n"+

      ")                                                     \n"+

      "\\b                                                   \n",

      Pattern.CASE_INSENSITIVE|Pattern.COMMENTS);

      

   Matcher m = r.matcher(text);

   String result = m.replaceAll("<a href=\"mailto:$(1)\">$(1)</a>");

   System.out.println(result);

There are a number of things to note. Perhaps the most important is that each '\' wanted in the regular expression requires '\\' in the string literal. Thus, using '\\w' in the string literal results in '\w' in the regular expression. This is because regular expressions are provided as normal Java string literals, which as we've seen before (see Section 2.2.3.1), require special handling. For debugging, it might be useful to use

   System.out.println(P.pattern());

to display the regular expression as the regex function actually received it. One reason that I include newlines in the regex is so that it displays nicely when printed this way. Another reason is that each '#' introduces a comment that goes until the next newline; so, at least some of the newlines are required to restrain the comments.

Perl uses notations like /g, /i, and /x to signify special conditions (these are the modifiers for replace all, case-insensitivity, and free formatting modes see Section 3.4.4), but java.util.regex uses either different functions (replaceAll vs. replace) or flag arguments passed to the function (e.g., Pattern.CASE_INSENSITIVE and Pattern.COMMENTS).

3.2.3.2 Search-and-replace in VB.NET

The general approach in VB.NET is similar:

   Dim R As Regex = New Regex _

   ("\b                                                " & _

    "(?# Capture the address to $1 . . . )             " & _

    "(                                                 " & _

    "  \w[-.\w]*                        (?# username)  " & _

    "  @                                               " & _

    "  [-\w]+(\.[-\w]+)*\.(com|edu|info)(?# hostname)  " & _

    ")                                                 " & _

    "\b                                                ",  _

    RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace)

    

   Dim Copy As String = R.Replace (text, "<a href=""mailto:${1}"">${1}</a>")

   Console.WriteLine(Copy)

Due to the inflexibility of VB.NET string literals (they can't span lines, and it's difficult to get newline characters into them), longer regular expressions are not as convenient to work with as in some other languages. On the other hand, because '\' is not a string metacharacter in VB.NET, the expression can be less visually cluttered. A double quote is a metacharacter in VB.NET string literals: to get one double quote into the string's value, you need two double quotes in the string literal.

3.2.4 Search and Replace in Other Languages

Let's quickly look at a few examples from other traditional tools and languages.

3.2.4.1 Awk

Awk uses an integrated approach, / regex/, to perform a match on the current input line, and uses "var ~ ···" to perform a match on other data. You can see where Perl got its notation for matching. (Perl's substitution operator, however, is modeled after sed's.) The early versions of awk didn't support a regex substitution, but modern versions have the sub(···) operator:

   sub(/mizpel/, "misspell")

This applies the regex mizpel to the current line, replacing the first match with misspell. Note how this compares to Perl's (and sed's) s/mizpel/misspell/.

To replace all matches within the line, awk does not use any kind of /g modifier, but a different operator altogether: gsub(/mizpel/, "misspell").

3.2.4.2 Tcl

Tcl takes a procedural approach that might look confusing if you're not familiar with Tcl's quoting conventions. To correct our misspellings with Tcl, we might use:

   regsub mizpel $var misspell newvar

This checks the string in the variable var, and replaces the first match of mizpel with misspell, putting the now possibly-changed version of the original string into the variable newvar (which is not written with a dollar sign in this case). Tcl expects the regular expression first, the target string to look at second, the replacement string third, and the name of the target variable fourth. Tcl also allows optional flags to its regsub, such as -all to replace all occurrences of the match instead of just the first:

   regsub -all mizpel $var misspell newvar

Also, the -nocase option causes the regex engine to ignore the difference between uppercase and lowercase characters (just like egrep's -i flag, or Perl's /i modifier).

3.2.4.3 GNU Emacs

The powerful text editor GNU Emacs (just "Emacs" from here on) supports elisp (Emacs lisp) as a built-in programming language. It provides a procedural regex interface with numerous functions providing various services. One of the main ones is re-search-forward, which accepts a normal string as an argument and interprets it as a regular expression. It then starts searching the text from the "current position," stopping at the first match, or aborting if no match is found. (This function is invoked when one invokes a "regexp search" while using the editor.)

As Table 3-3 shows, Emacs' flavor of regular expressions is heavily laden with backslashes. For example, \<$[a-z]+$$[\n•\t]\|<[^>]+>$+\1\> is an expression for finding doubled words, similar to the problem in the first chapter. We couldn't use this regex directly, however, because the Emacs regex engine doesn't understand \t and \n. Emacs double-quoted strings, however, do, and convert them to the tab and newline values we desire before the regex engine ever sees them. This is a notable benefit of using normal strings to provide regular expressions. One drawback, particularly with elisp's regex flavor's propensity for backslashes, is that regular expressions can end up looking like a row of scattered toothpicks. Here's a small function for finding the next doubled word:

   (defun FindNextDbl ()

      "move to next doubled word, ignoring <···> tags" (interactive)

      (re-search-forward "\\<\\([a-z]+\\)\\([\n \t]\\|<[^>]+>\\)+\\1\\>")

   )

Combine that with (define-key global-map "\C-x\C-d" 'FindNextDbl) and you can use the "Control-x Control-d" sequence to quickly search for doubled words.

3.2.5 Care and Handling: Summary

As you can see, there's a wide range of functionalities and mechanics for achieving them. If you are new to these languages, it might be quite confusing at this point. But, never fear! When trying to learn any one particular tool, it is a simple matter to learn its mechanisms.

< Free Open Study >