Store Your Patterns Externally

Storing your regex pattern description in an external file offers three important advantages. The first is that the Pattern doesn't have to be delimited once again for the String constructor, so it's easier to read. The second is a direct corollary of the first, in that the first makes it easier to use generic, non-Java-delimited regular expressions in your code. The third is that extracting the actual pattern to an external file allows you to change the pattern later without having to recompile the class.

Note

In this context, when I use the term "Java-delimited," I mean the double delimiting of the character with the character in Strings.

A Java regex pattern can be a little confusing, especially to someone unfamiliar with how the String and regex delimiters work together. For example, consider the following e-mail descriptor, which is freely available from http://www.regexlib.com. It's difficult enough to parse out exactly what ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\. [0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ means without having to add the somewhat awkward \\ characters that the String object's constructor requires. It's enough to make you reconsider using regex altogether.

Also, the process of changing the regex to be Java-delimited strings introduces fat-finger risks. Unsurprisingly, because your Java regex won't look conventional, it's that much more difficult to solicit help from regex gurus, most of whom are probably more familiar with Perl.

It's also very possible that as your requirements evolve, you may need to change the regular expression without changing any other part of the code. Thus, it's a good idea to store your regex patterns in an external file and then retrieve them when you need them. This offers an opportunity to kill two birds with one stone, because you would, conceivably, store your regex patterns externally and in such a manner that the double delimiting isn't required. But which persistence mechanism should you use? The next few sections discuss some options.

Don't Use Normal Property Files to Store Patterns

At first glance, the solution to storing patterns externally seems to be property files. They're already in place, tried and true, easy to use, easy to modify, intuitive in format, and object oriented. However, Java property files are, unfortunately, not quite up to the task, because of three very strong objections.

The first objection is that property values need to be delimited exactly as Strings need to be delimited in order to work properly. That is, the pattern \w would have to be stored as \\w, and the pattern ^([a-zA-Z0-9_\-\.]+)@((\ [[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}| [0-9]{1,3})(\]?)$ becomes the unwieldy

^([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-
9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-
9]{1,3})(\\]?)$

This is hardly an improvement as far as legibility is concerned.

Second, Properties objects preprocess certain characters, including \t, \n, and \\. That is, you have to understand the internal mechanisms of how the Properties object stores "special" characters before you use them to store your regex pattern. This is hardly an improvement, as far as abstraction is concerned.

Finally, property files internally use the \ character to delimit the end of a line, so that it can be continued on the next line. Thus, the property file entry

Produce = carrots, \
lettuce

is read as

Produce = carrots, lettuce

Accordingly, using Properties objects actually increases the complexity of the pattern, because it introduces new things you have to know about in addition to the regex. Again, this is hardly an improvement as far as ease of maintenance is concerned. All things considered, using property files to store regex patterns seems like an invitation for bugs. We can do better.

Don't Use XML to Store Patterns

XML is also a reasonable candidate for external storage of regex patterns. On the positive side, XML is an easy, universally accepted way to keep external system data, and there are many ways to access that data. As a matter of fact, J2SE 1.4 offers the extraordinarily useful XMLEncoder and XMLDecoder classes, which make it easier than ever to persist and retrieve XML data. XMLEncoder and XMLDecoder are new Java classes that can help store JavaBeans as serialized XML—better yet, the XML is human readable and maintainable, so changing the XML actually changes the attributes of the object when it is deserialized. Moreover, depending on which XML persistence mechanism you use, the XML parsers may not require you to Java-delimit your patterns.

There are two big problems with the XML persistence approach, however. The first problem is that the easiest XML persistence mechanism, the XMLEncoder and XMLDecoder classes, decorate Strings as they're stored. Thus, the < character becomes &lt, > becomes &gt, and so on. Second, the XML persistence mechanism, simply by virtue of being XML, has a lot of cluttering metadata that isn't germane to regex. But the proof, as they say, is in the pudding. To illustrate this point, let's try an XML-based approach in the next section and see how it works out.

XML Persistence Example

Because you want to try storing your regular expressions as XML, you'll create a regex-friendly JavaBean, say Regex.java, that has two member variables: a String to hold the regex itself and a String description to explain what it's supposed to do. Listing 4-2 lists the code for the Regex bean. The idea is to easily store the Regex bean as XML by using XMLEncoder, thus making modifications to the XML easier.

Listing 4-2: The Regex Bean

/**
* Holds a regex-friendly object, so that we can try
* persisting regex descriptions as XML
*/
public class Regex implements java.io.Serializable {

    /**
    * sets the String description
    *
    *@param String description
    */
    public void setDescription(String description){
        this.description = description;
    }

    /**
    * gets String description
    *
    *@return String description
    */
    public String getDescription(){
        return this.description;
    }

    /**
    * sets the String regex
    *
    *@param String regex
    */
    public void setRegex(String regex){
        this.regex = regex;
    }

    /**
    * gets String regex
    *
    *@return String regex
    */
    public String getRegex(){
        return this.regex;
    }

    private String regex;
    private String description;
}

Next, you write the persistence code. I'm not going to delve into how to use the XMLEncoder and XMLDecoder objects; that is both off-topic and very easy to figure out from reading Listing 4-3. For a reference on XMLEncoder and XMLDecoder, please review Sun's documentation.

Listing 4-3: The Persistence Code

import java.io.*;
import java.beans.*;

/**
* Helps persist a Serializable object to XML
* and back again.
*/
public class XMLHelper{
    public static void main(String args[]){
        Regex regex = new Regex();
        regex.setRegex("<((?i)TITLE>)(.*?)</\\1");

        String desc =
        "extracts the title element from an html page";

        regex.setDescription(desc);

        saveXML(regex, "htmlTitle.xml");

    }

    /**
    * Saves the Serializable as an XML file
    * @param ser the object to persist
    * @param fileName the file to save it to.
    */
    public static void saveXML(Serializable ser, String fileName){
        try{
            XMLEncoder e = new XMLEncoder(
            new BufferedOutputStream(
            new FileOutputStream(fileName)));
            e.writeObject(ser);
            e.close();
        }
        catch(IOException ioe){
            ioe.printStackTrace();
        }

    }
    /**
    * get the Serializable from XML file
    * @param fileName the file to get the data from.
    * @return ser the object to persist
    */
    public static Serializable getXML(String fileName){
    Serializable retval= null;
    try{
        XMLDecoder d = new XMLDecoder(
        new BufferedInputStream(
        new FileInputStream(fileName)));
        retval = (Serializable)d.readObject();
        d.close();

    }
    catch(IOException ioe){
        ioe.printStackTrace();
    }

    return retval;

    }
}

The regex pattern you're storing in this case, <((?i)TITLE>)(.*?)</\1, extracts the content of the first title element from an HTML file. See the "FAQs" section for a precise breakdown of how this works.

The XMLHelper class has two methods, saveXML and getXML, which persist and retrieve the state of the object. The XML for the Regex object stored appears in Listing 4-4.

Listing 4-4: The XML State of the Persisted Regex Object

<?xml version="1.0" encoding="UTF-8"?>
<java version="1.4.1" class="java.beans.XMLDecoder">
 <object class="Regex">
  <void property="description">
   <string>extracts the title element from an html page</string>
  </void>
  <void property="regex">
     <string>&lt;((?i)TITLE&gt;)(.*?)&lt;/\1</string>
  </void>
 </object>
</java>

As you can see, most of this information is human legible and easy to maintain. If you were to change this XML file and deserialize the object, the deserialized object would reflect your changes. That's pretty powerful, but still not as robust as we would like. You'll notice the line

   <string>&lt;((?i)TITLE&gt;)(.*?)&lt;/\1</string>

actually changed the input, as predicted, from using the < character to using &lt and the > character to using &gt. That's not particularly conducive to readability. Also, you'll notice that the XML has a great deal of metadata about the type of the object, the name of the method, and so on. We don't need any of this. All in all, this is an improvement over the Properties approach, but we can do better still.

Note

Observant readers may have noticed that manually changing the XML element to be a CDATA section can change the issue of XML decoration and still allow the object deserialization process to work. However, as far as I know, that behavior isn't guaranteed to be platform independent, and it still doesn't address the metadata clutter issue.

Use FileChannels and ByteBuffers to Store Patterns

The Java new I/O (NIO) paradigm offers an easy and elegant solution for externalizing patterns. Without digging too deeply into NIO, you can use the code in Listing 4-5 to extract a regex pattern from a persisted file.

Listing 4-5: Extract a Non-Java-Delimited Regex Pattern from a File

import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.channels.*;

/**
* Provides an easy mechanism for extracting the regex contents
* of a file
*/
public class RegexHelper{

   /**
   * Extracts the contents of the given file. This
   * particular extraction process is specifically
   * expecting the content of the file to be a
   * non-Java-delimited regex pattern.
   *
   * @param fileName the name of the file that
   * has the regex pattern.
   * @returns a string holding the content of the file
   * @author Mehran Habibi
   **/
   public static String getRegex(String fileName){
    String retval = null;
    try
    {
        //open a connection to the file
        FileInputStream fis =
        new FileInputStream(fileName);

        //get a file channel
        FileChannel fc = fis.getChannel();

        //create a ByteBuffer that is large enough
        //and read the contents of the file into it
        ByteBuffer bb = ByteBuffer.allocate((int)fc.size());

        fc.read(bb);
        bb.flip();
        //persist the content of the file as a String
        retval = new String(bb.array());

        //release the FileChannel
        fc.close();
        fc = null;

   }
   catch(IOException ioe)
   {
       ioe.printStackTrace();
   }

   return retval;
   }
}

Of course, it's not necessary to use FileChannels here; any mechanism that performs a byte-level read of a file would have yielded the same result. However, because reading bytes is so natural with NIO, it seems like the best overall solution. The result here is that a file could contain nothing but a pure regex pattern, without the Java delimitation, and still continue to work properly.

Note

Make sure that the pattern doesn't have any extra spaces or returns after the last character, as that will be read in as part of the pattern proper and cause your searches to fail.

Of course, this isn't as friendly as it could be—it would be nice, for example, if you could treat the file like a property file and define various keys in it. After all, you don't want to be forced to have a separate file for each regex pattern. For example, you could pass in a key, along with the filename, so that you could store numerous patterns in the same file. With this in mind, add the overloaded method getRegex, as shown in Listing 4-6.

Listing 4-6: Overloaded getRegex Method

/**
  * Extracts the contents of the given file at the given key.
  * This particular extraction process is specifically
  * expecting the content of the file to be a
  * non-Java-delimited regex pattern.
  *
  * @param fileName the name of the file that
  * has the regex pattern.
  * @param key the key that defines the regex in the file
  * @returns a string holding the content of the file
  * @author Mehran Habibi
  **/
  public static String getRegex(String fileName, String key){
   String retval = null;

   //get content of the file
   String content = getRegex(fileName);

   //if the file has content, then try to find the key
   if (content != null)
   {
    //look for a beginning of line, followed by the key,
    //followed by an equal sign, and capture everything between
    //that key and the end of the line.
    String keyRegex = "^"+key+"=(.*)$";

    //we expect the output to have multiple lines
    Pattern pattern = Pattern.compile(keyRegex,Pattern.MULTILINE);

    //extract the matcher, and look for the value to the key
    Matcher matcher = pattern.matcher(content);

    if (matcher != null && matcher.find())
       retval = matcher.group(1);

    }

    return retval;
    }

The point here is that you can use and write your regex patterns in a non- Java-specific syntax, at least as far as the double delimiting of the String syntax is concerned. Try testing this code with the sample data shown in Listing 4-7.

Listing 4-7: Sample Content of a Regex Cache File

#Email validator that adheres directly to the specification
#for email address naming. It allows for everything from
#ipaddress and country-code domains to very rare characters
#in the username.
email=^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-
9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-
9]{1,3})(\]?)$

#Matches UK postcodes according to the following rules 1. LN NLL
#eg N1 1AA 2. LLN NLL eg SW4 0QL 3. LNN NLL eg M23 4PJ 4. LLNN NLL
#eg WS14 0JT 5. LLNL NLL eg SW1N 4TB 6. LNL NLL eg W1C 8LQ Thanks
#to Simon Bell for informin ...
zip=^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$

#This regular expression matches dates of the form XX/XX/YYYY
#where XX can be 1 or 2 digits long and YYYY is always 4
#digits long.
dates=^\d{1,2}\/\d{1,2}\/\d{4}$

Thus, to extract a regex pattern, you would simply write the following code:

String regex =getRegex("regexCache.txt","email");

You'll notice that the data cache file also allows comments. You'll find these particularly useful as you drift away from the details of the pattern itself after having written it.

Finally, it should be noted that there are several different and valid approaches to take here. One approach would be to subclass the Properties file for yourself and offer a regex-friendly implementation. In the future, I wouldn't be surprised if Sun decided to do something like this or offer some other convenience mechanism for extracting regex patterns from a file without requiring that the regex be Java-delimited. In the meantime, Listing 4-8 provides a read-only implementation of a custom property file reader that allows regex patterns to be stored without being double delimited.

Listing 4-8: Custom Property File Reader

01  import java.util.Properties;
02  import java.util.regex.*;
03  import java.util.*;
04  import java.io.*;
05  import java.nio.*;
06  import java.nio.channels.*;
07  import java.util.logging.Logger;

08  /**
09  * Provides a read-only extension of the java.util.properties file.
10 * This class is unique because it is especially designed to read in
11 * regular expressions that are not double delimited, as the String
12 * class requires. Thus, \s is the actual string used to represent a
13 * whitespace character, not \\s. Accordingly, this class does not allow
14 * the regex patterns to be modified programmatically, nor does it
15 * follow the normal property file convention for \n,\t, etc., or
16 * multiline properties. Please see the documentation for the
17 * load method
18 */
19 public class RegexProperties extends Properties{
20     private static Logger log = Logger.getAnonymousLogger();
21     /**
22     * See load(FileInputStream inStream)
23     *
24     * @param String the name of the file to load
25     * @throws IOException if there's an IO problem
26     * @throws  PatternSyntaxException if the File format isn't properly
27     * formed, per the specification given above.
28     */

29     public void load(String inStream)
30     throws IOException, PatternSyntaxException{
31         load(new FileInputStream(inStream));
32     }
33     /**
34     * Specialized property file for reading regular expressions
35     * stored as properties. Reads a property list (key and
36     * element pairs) from the input stream using a FileChannel,
37     * thus allowing the usage of all characters. The stream is
38     * assumed to be using the ISO 8859-1 character encoding.
39     * Every property occupies one line of the input stream. Each
40     * line is terminated by a line terminator (\n or \r or \r\n).
41     * The entire contents of the file are read in.
42     *
43     * A line that contains only whitespace or whose first
44     * nonwhitespace character is an ASCII # or ! is ignored
45     * (thus, # or ! indicate comment lines).
46     *
47     * Every line other than a blank line or a comment line describes
48     * one property to be added to the table. The key consists of
49     * all the characters in the line starting with the first
50     * nonwhitespace character and up to, but not including,
51     * the first ASCII =, :, or whitespace character. Any whitespace
52     * after the key is skipped; if the first nonwhitespace character
53     * after the key is = or :, then it is ignored. Whitespace characters
54     * after the = or ; are <B>not</B> skipped, and become part of the
55     * value. This is a deliberate change from the default behavior of
56     * the class, in order to support regular expressions, which may very
57     * well need those characters. All remaining characters on the line
58     * become part of the associated element string. If the last
59     * character on the line is \, then the next line is <B>not </B>
60     * treated as a continuation of the current line. Again, this is a
61     * deliberate change from the default behavior of the class, in
62     * order to support regular expressions.
63     *
64     * @param FileInputStream inStream the actual property file
65     * @throws IOException if there's an IO problem
66     * @throws  PatternSyntaxException if the File format isn't properly
67     * formed, per the specification given above.
68     */

69     public void load(FileInputStream inStream)
       throws IOException, PatternSyntaxException{
70      // load the contents of the file
71      FileChannel fc = inStream.getChannel();

72      ByteBuffer bb = ByteBuffer.allocate((int)fc.size());
73      fc.read(bb);
74      bb.flip();
75      String fileContent = new String(bb.array());

76      //define a pattern that breaks the contents down line by line
77      Pattern pattern = Pattern.compile("^(.*)$",Pattern.MULTILINE);
78      Matcher matcher = pattern.matcher(fileContent);

79      //iterate through the fileContent, line by line
80      while (matcher.find()){
81          //extract the relevant part of each file.
82          //in this case, relevant means the characters
83          //between the beginning of the line and its end
84          String line = matcher.group(1) ;

85          //if the line is null or a comment, ignore it
86          if (
87             line != null &&
88             !"".equals(line.trim()) &&
89             !line.startsWith("#") &&
90             !line.startsWith("!")
91          )
92          {
93             String keyValue[] = null;

94             //was the key value entry split with the '='
95             //character or the ':' character? Both are legal.
96             if (line.indexOf("=") > 0 )
97               keyValue = line.split("=",2);
98             else
99               keyValue = line.split(":",2);

100            //final check that keyValue isn't null, because we
101            //are going to be entering into a map and trimming it
102            if (keyValue != null)
103            {
104                super.put(keyValue[0].trim(),keyValue[1]);
105            }
106         }
107     }

108     fc = null;
109     bb = null;
110    }
111     /**
112     *
113     * Not supported. This is designed to be a read-only class
114     * only. Throws UnsupportedOperationException.
115     * @param String the key to be placed into this property
116     * list.
117     * @param String the value corresponding to key.
118     * @throws UnsupportedOperationException
119     *
120     */
121     public void store(FileOutputStream out, String header)
122     throws UnsupportedOperationException
123     {
124         String msg = "unsupported for this class";
125         throw new UnsupportedOperationException(msg);
126     }
127     /**!
128     * Not supported.
129     * @param Object t - Mappings to be stored in this map.
130     *
131     * @returns nothing, since this call always throws an
132     * UnsupportedOperationException.
133     * @throws  UnsupportedOperationException
134     */
135     public void putAll(Map t)
136    {
137         String msg = "unsupported for this class";
138         throw new UnsupportedOperationException(msg);
139    }
140}