8.5 A Quick Look at Jakarta-ORO

Jakarta-ORO (from now on, just "ORO") is a vast, modular framework of mostly regex-related text-processing features containing a dizzying eight interfaces and 35+ classes. When first faced with the documentation, you can be intimidated until you realize that you can get an amazing amount of use out of it by knowing just one class, Perl5Util, described next.

8.5.1 ORO's `Perl5Util`

This ORO version of the example from Section 8.4.3 shows how simple Perl5Util is to work with:

      import org.apache.oro.text.perl.Perl5Util;



      public class SimpleRegexTest {

        public static void main(String[] args)

        {

           String sampleText = "this is the 1st test string";

           Perl5Util engine = new Perl5Util();

                  

           if (engine.match("/\\d+\\w+/", sampleText)) {

               String matchedText = engine.group(0);

               int    matchedFrom = engine.beginOffset(0);

               int    matchedTo   = engine.endOffset(0);

               System.out.println("matched [" + matchedText + "] from " +

                                  matchedFrom + " to " + matchedTo + ".");

           } else {

               System.out.println("didn't match");

           }

        }

     }

One class hides all the messy details about working with regular expressions behind a simple façade that somewhat mimics regular-expression use in Perl.
Where Perl has

     $input =~ /^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i

(from an example in Chapter 2 see Section 2.2.3.2), ORO allows:

     engine.match("/^([-+]?[0-9]+(\\.[0-9]*)?)\\s*([CF])$/i", input)

Where Perl then has

     $InputNum = $1; # Save to named variables to make the ...

     $Type     = $3; # ... rest of the program easier to read.

ORO provides for:

     inputNum = engine.group(1); // Save to named variables to make the ...



     type     = engine.group(3); // ... rest of the program easier to read.

If you're not familiar with Perl, the /···/i trappings may seem a bit odd, and they can be cumbersome at times, but it lowers the barrier to regex use about as low as it can get in Java.^[3] (Unfortunately, not even ORO can get around the extra escaping required to get regex backslashes and double quotes into Java string literals.)

^[3] One further step, I think, would be to remove the Perl trappings and just have separate arguments for the regex and modifier. The whole m/···/ bit may be convenient for those coming to Java from a Perl background, but it doesn't seem "natural" in Java.

Even substitutions can be simple. An example from Chapter 2 to "commaify" a number (see Section 2.3.5.5) looks like this in Perl:

      $text =~ s/(\d)(?=(\d\d\d)+(?!\d))/$1,/g;

and this with ORO:

      text = engine.substitute("s/(\\d)(?=(\\d\\d\\d)+(?!\\d))/$1,/g", text);

Traditionally, regular-expression use in Java has a class model that involves precompiling the regex to some kind of pattern object, and then using that object later when you actually need to apply the regex. The separation is for efficiency, so that repeated uses of a regex doesn't have to suffer the repeated costs of compiling each time.

So, how does Perl5Util, with its procedural approach of accepting the raw regex each time, stay reasonably efficient? It caches the results of the compile, keeping a behind-the-scenes mapping between a string and the resulting regex object. (See "Compile caching in the procedural approach" in Chapter 6 see Section 6.4.4.1.2.)

It's not perfectly efficient, as the argument string must be parsed for the regex delimiters and modifiers each time, so there's some extra overhead, but the caching keeps it reasonable for casual use.

8.5.2 A Mini `Perl5Util` Reference

The ORO suite of text-processing tools at first seems complex because of the raw number of classes and interfaces. Although the documentation is well-written, it's hard to know exactly where to start. The Perl5Util part of the documentation, however, is fairly self-contained, so it's the only thing you really need at first. The next sections briefly go over the main methods.

8.5.2.1 `Perl5Util` basics�initiating a match

match( expression, target )

Given a match expression in Perl notation, and a target string, returns true if the regex can match somewhere in the string:

     if (engine.match("/^Subject: (.*)/im", emailMessageText))

     {

         .

         .

         .

As with Perl, you can pick your own delimiters, but unlike Perl, the leading m is not required, and ORO does not support nested delimiters (e.g., m{···}).

Modifier letters may be placed after the closing delimiter. The modifiers allowed are:

i (case-insensitive match see Section 3.3.3.1)
x (free-spacing and comments mode see Section 3.3.3.2)
s (dot-matches-all see Section 3.3.3.3)
m (enhanced line anchor mode see Section 3.3.3.4)

If there's a match, the various methods described in the next section are available for querying additional information about the match.

substitute( expression, target )

Given a string showing a Perl-like substitute expression, apply it to the target text, returning a possibly-modified copy:

     headerLine = engine.substitute("s/\\b(Re:\\s*)*//i", headerLine);

The modifiers mentioned for match can be placed after the final delimiter, as can g, which has the substitution continue after the first match, applying the regex to the rest of the string in looking for subsequent matches to replace.^[4]

^[4] An o modifier is also supported. It's not particularly useful, so I don't cover it in this book, but it's important to note that it is completely unrelated to Perl's /o modifier.

The substitution part of the expression is interpreted specially. Instances of $1, $2, etc. are replaced by the associated text matched by the first, second, etc., set of capturing parentheses. $0 and $& are replaced with the entire matched text. \U···\E and \L···\E cause the text between to be converted to upper- and lowercase, respectively, while \u and \l cause just the next character to be converted. Unicode case conversion is supported.

Here's an example that turns words in all caps to leading-caps:

     phrase = engine.substitute("s/\\b([A-Z])([A-Z]+)/$1\\L$2\\E/g", phrase);

(In Perl this would be better written as s/\b([A-Z]+)/\L\u$1\E/g, but ORO currently doesn't support the combination of \L···\E with \u or \l.)

substitute( result, expression, target ): This version of the substitute method writes the possibly-modified version of the target string into a StringBuffer result, and returns the number of replacements actually done.

split( collection, expression, target, limit )

The m/···/ expression (formatted in the same way as for the match method) is applied to the target string, filling collection with the text separated by matches. There is no return value.

The collection should be an object implementing the java.util.Collection interface, such as java.util.ArrayList or java.util.Vector.

The limit argument, which is optional, limits the number of times the regex is applied to limit minus one. When the regex has no capturing parentheses, this limits the returned collection to at most limit elements.

For example, if your input is a string of values separated by simple commas, perhaps with spaces before or after, and you want to isolate just the first two values, you would use a limit of three:

     java.util.ArrayList list = new java.util.ArrayList();

     engine.split(list,"m/\\s+ , \\s+/x",input,3);

An input string of "USA, NY, NYC, Bronx", result in a list of three elements, 'USA', 'NY', and 'NYC, Bronx'. Because you want just the first two, you could then eliminate the "everything else" third element.

An omitted limit allows all matches to happen, as does a non-positive one.

If the regex has capturing parentheses, additional elements associated with each $1, $2, etc., may be inserted for each successful regex application. With ORO's split, they are inserted only if not empty (e.g., empty elements are not created from capturing parentheses.) Also, note that the limit limits the number of regex applications, not the number of elements returned, which is dependent upon the number of matches, as well as the number of capturing parentheses that actually capture text.

Perl's split operator has a number of somewhat odd rules as to when it returns leading and trailing empty elements that might result from matches at the beginning and end of the string (see Section 7.7.1.3). As of Version 2.0.6, ORO does not support these, but there is talk among the developers of doing so in a future release.

Here's a simple little program that's convenient for testing split:

     import org.apache.oro.text.perl.Perl5Util;

     import java.util.*;

     

     public class OroSplitTest {

         public static void main(String[] args) {

           Perl5Util engine = new Perl5Util();

           List list = new ArrayList();

           engine.split(list, args[0], args[1], Integer.parseInt(args[2]));

           System.out.println(list);

         }

     }

The println call shows each element within [···], separated by commas. Here are a few examples:

     % java OroSplitTest '/\./' '209.204.146.22' -1

     [209, 204, 146, 22]

     % java OroSplitTest '/\./' '209.204.146.22' 2

     [209, 204.146.22]

     % java OroSplitTest 'm|/+|' '/usr/local/bin//java' -1

     [, usr, local, bin, java]

     % java OroSplitTest 'm/(?=(?:\d\d\d)+$)/' 1234567890 -1

     [1, 234, 567, 890]

     % java OroSplitTest 'm/\s*<BR>\s*/i' 'this<br>that<BR>other' -1

     [this, that, other]

     % java OroSplitTest 'm/\s*(<BR>)\s*/i' 'this<br>that<BR>other' -1

     [this, <br>, that, <BR>, other]

Note that with most shells, you don't need to double the backslashes if you use single quotes to delimit the arguments, as you do when entering the same expressions as Java string literals.

8.5.2.2 `Perl5Util` basics�inspecting the results of a match

The following Perl5Util methods are available to report on the most recent successful match of a regular expression (an unsuccessful attempt does not reset these). They throw NullPointerException if called when there hasn't yet been a successful match.

group( num ): Returns the text matched by the num ^th set of capturing parentheses, or by the whole match if num is zero. Returns null if there aren't at least num sets of capturing parentheses, or if the named set did not participate in the match.

toString(): Returns the text matched�the same as group(0).

length(): Returns the length of the text matched�the same as group(0).length().

beginOffset( num ): Returns the number of characters from the start of the target string to the start of the text returned by group( num ). Returns -1 in cases where group( num ) returns null.

endOffset( num ): Returns the number of characters from the start of the target string to the first character after the text returned by group( num ). Returns -1 in cases where group( num ) returns null.

groups(): Returns the number of capturing groups in the regex, plus one (the extra is to account for the virtual group zero of the entire match). All num values to the methods just mentioned must be less than this number.

getMatch(): Returns an org.apache.oro.text.regex.MatchResult object, which has all the result-querying methods listed so far. It's convenient when you want to save the results of the latest match beyond the next use of the Perl5Util object. getMatch() is valid only after a successful match, and not after a substitute or split.

preMatch(): Returns the part of the target string before (to the left of) the match.

postMatch(): Returns the part of the target string after (to the right of) the match.

8.5.3 Using ORO's Underlying Classes

If you need to do things that Perl5Util doesn't allow, but still want to use ORO, you'll need to use the underlying classes (the "vast, modular framework") directly. As an example, here's an ORO version of the CSV-processing script in Section 8.4.4.3.

First, we need these 11 classes:

     import org.apache.oro.text.regex.PatternCompiler;

     import org.apache.oro.text.regex.Perl5Compiler;

     import org.apache.oro.text.regex.Pattern;

     import org.apache.oro.text.regex.PatternMatcher;

     import org.apache.oro.text.regex.Perl5Matcher;

     import org.apache.oro.text.regex.MatchResult;

     import org.apache.oro.text.regex.Substitution;

     import org.apache.oro.text.regex.Util;

     import org.apache.oro.text.regex.Perl5Substitution;

     import org.apache.oro.text.regex.PatternMatcherInput;

     import org.apache.oro.text.regex.MalformedPatternException;

Then, we prepare the regex engine�this is needed just once per thread:

     PatternCompiler compiler = new Perl5Compiler();

     PatternMatcher  matcher  = new Perl5Matcher();

Now we declare the variables for our two regexes, and also initialize an object representing the replacement text for when we change '""' to '"':

     Pattern rCSVmain  = null;

     Pattern rCSVquote = null;

     // When rCSVquote matches, we'll want to replace with one double quote:

     Substitution sCSVquote = new Perl5Substitution("\"");

Now we create the regex objects. The raw ORO classes require pattern exceptions to always be caught or thrown, even though we know the hand-constructed regex will always work (well, after we've tested it once to make sure we've typed it correctly).

     try {

        rCSVmain = compiler.compile(

             "  (?:^|,)                                       \n"+

             "  (?:                                           \n"+

             "     # Either a double-quoted field...          \n"+

             "     \" # field's opening quote                 \n"+

             "      ( [^\"]* (?: \"\" [^\"]* )* )             \n"+

             "     \" # field's closing quote                 \n"+

             "   # ... or ...                                 \n"+

             "   |                                            \n"+

             "     # ... some non-quote/non-comma text ...    \n"+

             "     ( [^\",]* )                                \n"+

             "   )                                            \n",

             Perl5Compiler.EXTENDED_MASK);

        rCSVquote = compiler.compile("\"\"");

     }

     catch (MalformedPatternException e) {

        System.err.println("Error parsing regular expression.");

        System.err.println("Error: " + e.getMessage());

        System.exit(1);

     }

ORO's \G doesn't work properly (at least as of Version 2.0.6), so I've removed it. You'll recall from the original discussion in Chapter 5 (see Section 5.4.2.1) that \G had been used as a precaution, and wasn't strictly required, so it's okay to remove here.

Finally, this snippet actually does the processing:

     PatternMatcherInput inputObj = new PatternMatcherInput(inputCSVtext);

     while ( matcher.contains(inputObj, rCSVmain) )

     {

         String field; // We'll fill this in with $1 or $2

         String first = matcher.getMatch().group(2);

         if ( first != null ) {

             field = first;

         } else {

               field = matcher.getMatch().group(1);

               // If $1, must replace paired double quotes with one double quote



               field = Util.substitute(matcher,      // the matcher to use



                                       rCSVquote,    // the pattern to match with it



                                       sCSVquote,    // the replacement to be done




                                       field,        // the target string

                                       Util.SUBSTITUTE_ALL); // do all replacements

         }

         // We can now work with the field . . .

         System.out.println("Field [" + field + "]");

     }

Phew! Seeing all that's involved certainly helps you to appreciate Perl5Util!

< Free Open Study >