8.5 A Quick Look at Jakarta-ORO
Jakarta-ORO (from now on, just "ORO") is a vast, modular framework of mostly
regex-related text-processing features containing a dizzying eight interfaces and
35+ classes. When first faced with the documentation, you can be intimidated until
you realize that you can get an amazing amount of use out of it by knowing just
one class, Perl5Util, described next.
8.5.1 ORO's Perl5Util
This ORO version of the example from Section 8.4.3 shows how simple Perl5Util is
to work with:
import org.apache.oro.text.perl.Perl5Util;
public class SimpleRegexTest {
public static void main(String[] args)
{
String sampleText = "this is the 1st test string";
Perl5Util engine = new Perl5Util();
if (engine.match("/\\d+\\w+/", sampleText)) {
String matchedText = engine.group(0);
int matchedFrom = engine.beginOffset(0);
int matchedTo = engine.endOffset(0);
System.out.println("matched [" + matchedText + "] from " +
matchedFrom + " to " + matchedTo + ".");
} else {
System.out.println("didn't match");
}
}
}
One class hides all the messy details about working with regular expressions
behind a simple façade that somewhat mimics regular-expression use in Perl.
Where Perl has
$input =~ /^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i
(from an example in Chapter 2 see Section 2.2.3.2), ORO allows:
engine.match("/^([-+]?[0-9]+(\\.[0-9]*)?)\\s*([CF])$/i", input)
Where Perl then has
$InputNum = $1; # Save to named variables to make the ...
$Type = $3; # ... rest of the program easier to read.
ORO provides for:
inputNum = engine.group(1); // Save to named variables to make the ...
type = engine.group(3); // ... rest of the program easier to read.
If you're not familiar with Perl, the /···/i trappings may seem a bit odd, and they
can be cumbersome at times, but it lowers the barrier to regex use about as low as it can get in Java. (Unfortunately, not even ORO can get around the extra escaping required to get regex backslashes and double quotes into Java string literals.)
Even substitutions can be simple. An example from Chapter 2 to "commaify" a
number (see Section 2.3.5.5) looks like this in Perl:
$text =~ s/(\d)(?=(\d\d\d)+(?!\d))/$1,/g;
and this with ORO:
text = engine.substitute("s/(\\d)(?=(\\d\\d\\d)+(?!\\d))/$1,/g", text);
Traditionally, regular-expression use in Java has a class model that involves precompiling
the regex to some kind of pattern object, and then using that object
later when you actually need to apply the regex. The separation is for efficiency,
so that repeated uses of a regex doesn't have to suffer the repeated costs of compiling
each time.
So, how does Perl5Util, with its procedural approach of accepting the raw regex
each time, stay reasonably efficient? It caches the results of the compile, keeping a
behind-the-scenes mapping between a string and the resulting regex object. (See
"Compile caching in the procedural approach" in Chapter 6 see Section 6.4.4.1.2.)
It's not perfectly efficient, as the argument string must be parsed for the regex
delimiters and modifiers each time, so there's some extra overhead, but the
caching keeps it reasonable for casual use.
8.5.2 A Mini Perl5Util Reference
The ORO suite of text-processing tools at first seems complex because of the raw
number of classes and interfaces. Although the documentation is well-written, it's
hard to know exactly where to start. The Perl5Util part of the documentation,
however, is fairly self-contained, so it's the only thing you really need at first. The
next sections briefly go over the main methods.
8.5.2.1 Perl5Util basics—initiating a match
-
match(
expression, target
)
-
Given a match expression in Perl notation, and a target string, returns true if
the regex can match somewhere in the string:
if (engine.match("/^Subject: (.*)/im", emailMessageText))
{
.
.
.
As with Perl, you can pick your own delimiters, but unlike Perl, the leading m
is not required, and ORO does not support nested delimiters (e.g., m{···}).
Modifier letters may be placed after the closing delimiter. The modifiers
allowed are:
i (case-insensitive match see Section 3.3.3.1)
x (free-spacing and comments mode see Section 3.3.3.2)
s (dot-matches-all see Section 3.3.3.3)
m (enhanced line anchor mode see Section 3.3.3.4)
If there's a match, the various methods described in the next section are available
for querying additional information about the match.
-
substitute(
expression, target
)
-
Given a string showing a Perl-like substitute expression, apply it to the target
text, returning a possibly-modified copy:
headerLine = engine.substitute("s/\\b(Re:\\s*)*//i", headerLine);
The modifiers mentioned for match can be placed after the final delimiter, as
can
g
, which has the substitution continue after the first match, applying the
regex to the rest of the string in looking for subsequent matches to replace.
The substitution part of the expression is interpreted specially. Instances of $1,
$2, etc. are replaced by the associated text matched by the first, second, etc.,
set of capturing parentheses. $0 and $& are replaced with the entire matched
text. \U···\E and \L···\E cause the text between to be converted to upper- and
lowercase, respectively, while \u and \l cause just the next character to be
converted. Unicode case conversion is supported.
Here's an example that turns words in all caps to leading-caps:
phrase = engine.substitute("s/\\b([A-Z])([A-Z]+)/$1\\L$2\\E/g", phrase);
(In Perl this would be better written as s/\b([A-Z]+)/\L\u$1\E/g, but ORO
currently doesn't support the combination of \L···\E with \u or \l.)
-
substitute(
result, expression, target
)
-
This version of the substitute method writes the possibly-modified version
of the target string into a StringBuffer result, and returns the number of replacements actually done.
-
split(
collection, expression, target, limit
)
-
The m/···/
expression (formatted in the same way as for the match method) is
applied to the target string, filling collection with the text separated by
matches. There is no return value.
The collection should be an object implementing the java.util.Collection
interface, such as java.util.ArrayList or java.util.Vector.
The limit argument, which is optional, limits the number of times the regex is
applied to limit minus one. When the regex has no capturing parentheses, this
limits the returned collection to at most limit elements.
For example, if your input is a string of values separated by simple commas,
perhaps with spaces before or after, and you want to isolate just the first two
values, you would use a limit of three:
java.util.ArrayList list = new java.util.ArrayList();
engine.split(list,"m/\\s+ , \\s+/x",input,3);
An input string of
"USA, NY, NYC, Bronx"
, result in a list of three elements,
'USA', 'NY', and 'NYC, Bronx'. Because you want just the first two, you could
then eliminate the "everything else" third element.
An omitted limit allows all matches to happen, as does a non-positive one.
If the regex has capturing parentheses, additional elements associated with
each $1, $2, etc., may be inserted for each successful regex application. With
ORO's split, they are inserted only if not empty (e.g., empty elements are not
created from capturing parentheses.) Also, note that the limit limits the number
of regex applications, not the number of elements returned, which is
dependent upon the number of matches, as well as the number of capturing
parentheses that actually capture text.
Perl's split operator has a number of somewhat odd rules as to when it
returns leading and trailing empty elements that might result from matches at
the beginning and end of the string (see Section 7.7.1.3). As of Version 2.0.6, ORO does not
support these, but there is talk among the developers of doing so in a future
release.
Here's a simple little program that's convenient for testing split:
import org.apache.oro.text.perl.Perl5Util;
import java.util.*;
public class OroSplitTest {
public static void main(String[] args) {
Perl5Util engine = new Perl5Util();
List list = new ArrayList();
engine.split(list, args[0], args[1], Integer.parseInt(args[2]));
System.out.println(list);
}
}
The println call shows each element within [···], separated by commas.
Here are a few examples:
% java OroSplitTest '/\./' '209.204.146.22' -1
[209, 204, 146, 22]
% java OroSplitTest '/\./' '209.204.146.22' 2
[209, 204.146.22]
% java OroSplitTest 'm|/+|' '/usr/local/bin//java' -1
[, usr, local, bin, java]
% java OroSplitTest 'm/(?=(?:\d\d\d)+$)/' 1234567890 -1
[1, 234, 567, 890]
% java OroSplitTest 'm/\s*<BR>\s*/i' 'this<br>that<BR>other' -1
[this, that, other]
% java OroSplitTest 'm/\s*(<BR>)\s*/i' 'this<br>that<BR>other' -1
[this, <br>, that, <BR>, other]
Note that with most shells, you don't need to double the backslashes if you
use single quotes to delimit the arguments, as you do when entering the same
expressions as Java string literals.
8.5.2.2 Perl5Util basics—inspecting the results of a match
The following Perl5Util methods are available to report on the most recent successful
match of a regular expression (an unsuccessful attempt does not reset
these). They throw NullPointerException if called when there hasn't yet been a
successful match.
-
group(
num
)
-
Returns the text matched by the num
th set of capturing parentheses, or by the
whole match if num is zero. Returns null if there aren't at least num sets of
capturing parentheses, or if the named set did not participate in the match.
-
toString()
-
Returns the text matched—the same as group(0).
-
length()
-
Returns the length of the text matched—the same as group(0).length().
-
beginOffset(
num
)
-
Returns the number of characters from the start of the target string to the start
of the text returned by group(
num
). Returns -1 in cases where group(
num
)
returns null.
-
endOffset(
num
)
-
Returns the number of characters from the start of the target string to the first
character after the text returned by group(
num
). Returns -1 in cases where
group(
num
) returns null.
-
groups()
-
Returns the number of capturing groups in the regex, plus one (the extra is to
account for the virtual group zero of the entire match). All num values to the
methods just mentioned must be less than this number.
-
getMatch()
-
Returns an org.apache.oro.text.regex.MatchResult object, which has all
the result-querying methods listed so far. It's convenient when you want to
save the results of the latest match beyond the next use of the Perl5Util
object. getMatch() is valid only after a successful match, and not after a
substitute or split.
-
preMatch()
-
Returns the part of the target string before (to the left of) the match.
-
postMatch()
-
Returns the part of the target string after (to the right of) the match.
8.5.3 Using ORO's Underlying Classes
If you need to do things that Perl5Util doesn't allow, but still want to use ORO,
you'll need to use the underlying classes (the "vast, modular framework") directly.
As an example, here's an ORO version of the CSV-processing script in Section 8.4.4.3.
First, we need these 11 classes:
import org.apache.oro.text.regex.PatternCompiler;
import org.apache.oro.text.regex.Perl5Compiler;
import org.apache.oro.text.regex.Pattern;
import org.apache.oro.text.regex.PatternMatcher;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.MatchResult;
import org.apache.oro.text.regex.Substitution;
import org.apache.oro.text.regex.Util;
import org.apache.oro.text.regex.Perl5Substitution;
import org.apache.oro.text.regex.PatternMatcherInput;
import org.apache.oro.text.regex.MalformedPatternException;
Then, we prepare the regex engine—this is needed just once per thread:
PatternCompiler compiler = new Perl5Compiler();
PatternMatcher matcher = new Perl5Matcher();
Now we declare the variables for our two regexes, and also initialize an object
representing the replacement text for when we change '""' to '"':
Pattern rCSVmain = null;
Pattern rCSVquote = null;
// When rCSVquote matches, we'll want to replace with one double quote:
Substitution sCSVquote = new Perl5Substitution("\"");
Now we create the regex objects. The raw ORO classes require pattern exceptions
to always be caught or thrown, even though we know the hand-constructed regex
will always work (well, after we've tested it once to make sure we've typed it
correctly).
try {
rCSVmain = compiler.compile(
" (?:^|,) \n"+
" (?: \n"+
" # Either a double-quoted field... \n"+
" \" # field's opening quote \n"+
" ( [^\"]* (?: \"\" [^\"]* )* ) \n"+
" \" # field's closing quote \n"+
" # ... or ... \n"+
" | \n"+
" # ... some non-quote/non-comma text ... \n"+
" ( [^\",]* ) \n"+
" ) \n",
Perl5Compiler.EXTENDED_MASK);
rCSVquote = compiler.compile("\"\"");
}
catch (MalformedPatternException e) {
System.err.println("Error parsing regular expression.");
System.err.println("Error: " + e.getMessage());
System.exit(1);
}
ORO's
\G
doesn't work properly (at least as of Version 2.0.6), so I've removed it.
You'll recall from the original discussion in Chapter 5 (see Section 5.4.2.1) that
\G
had been
used as a precaution, and wasn't strictly required, so it's okay to remove here.
Finally, this snippet actually does the processing:
PatternMatcherInput inputObj = new PatternMatcherInput(inputCSVtext);
while ( matcher.contains(inputObj, rCSVmain) )
{
String field; // We'll fill this in with $1 or $2
String first = matcher.getMatch().group(2);
if ( first != null ) {
field = first;
} else {
field = matcher.getMatch().group(1);
// If $1, must replace paired double quotes with one double quote
field = Util.substitute(matcher, // the matcher to use
rCSVquote, // the pattern to match with it
sCSVquote, // the replacement to be done
field, // the target string
Util.SUBSTITUTE_ALL); // do all replacements
}
// We can now work with the field . . .
System.out.println("Field [" + field + "]");
}
Phew! Seeing all that's involved certainly helps you to appreciate Perl5Util!
|