< Free Open Study > |
8.5 A Quick Look at Jakarta-OROJakarta-ORO (from now on, just "ORO") is a vast, modular framework of mostly regex-related text-processing features containing a dizzying eight interfaces and 35+ classes. When first faced with the documentation, you can be intimidated until you realize that you can get an amazing amount of use out of it by knowing just one class, Perl5Util, described next. 8.5.1 ORO's Perl5UtilThis ORO version of the example from Section 8.4.3 shows how simple Perl5Util is to work with: import org.apache.oro.text.perl.Perl5Util; public class SimpleRegexTest { public static void main(String[] args) { String sampleText = "this is the 1st test string"; Perl5Util engine = new Perl5Util(); if (engine.match("/\\d+\\w+/", sampleText)) { String matchedText = engine.group(0); int matchedFrom = engine.beginOffset(0); int matchedTo = engine.endOffset(0); System.out.println("matched [" + matchedText + "] from " + matchedFrom + " to " + matchedTo + "."); } else { System.out.println("didn't match"); } } } One class hides all the messy details about working with regular expressions
behind a simple façade that somewhat mimics regular-expression use in Perl. $input =~ /^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i (from an example in Chapter 2 see Section 2.2.3.2), ORO allows: engine.match("/^([-+]?[0-9]+(\\.[0-9]*)?)\\s*([CF])$/i", input) Where Perl then has $InputNum = $1; # Save to named variables to make the ... $Type = $3; # ... rest of the program easier to read. ORO provides for: inputNum = engine.group(1); // Save to named variables to make the ... type = engine.group(3); // ... rest of the program easier to read. If you're not familiar with Perl, the /···/i trappings may seem a bit odd, and they can be cumbersome at times, but it lowers the barrier to regex use about as low as it can get in Java.[3] (Unfortunately, not even ORO can get around the extra escaping required to get regex backslashes and double quotes into Java string literals.)
Even substitutions can be simple. An example from Chapter 2 to "commaify" a number (see Section 2.3.5.5) looks like this in Perl: $text =~ s/(\d)(?=(\d\d\d)+(?!\d))/$1,/g; and this with ORO: text = engine.substitute("s/(\\d)(?=(\\d\\d\\d)+(?!\\d))/$1,/g", text); Traditionally, regular-expression use in Java has a class model that involves precompiling the regex to some kind of pattern object, and then using that object later when you actually need to apply the regex. The separation is for efficiency, so that repeated uses of a regex doesn't have to suffer the repeated costs of compiling each time. So, how does Perl5Util, with its procedural approach of accepting the raw regex each time, stay reasonably efficient? It caches the results of the compile, keeping a behind-the-scenes mapping between a string and the resulting regex object. (See "Compile caching in the procedural approach" in Chapter 6 see Section 6.4.4.1.2.) It's not perfectly efficient, as the argument string must be parsed for the regex delimiters and modifiers each time, so there's some extra overhead, but the caching keeps it reasonable for casual use. 8.5.2 A Mini Perl5Util ReferenceThe ORO suite of text-processing tools at first seems complex because of the raw number of classes and interfaces. Although the documentation is well-written, it's hard to know exactly where to start. The Perl5Util part of the documentation, however, is fairly self-contained, so it's the only thing you really need at first. The next sections briefly go over the main methods. 8.5.2.1 Perl5Util basics—initiating a match
8.5.2.2 Perl5Util basics—inspecting the results of a matchThe following Perl5Util methods are available to report on the most recent successful match of a regular expression (an unsuccessful attempt does not reset these). They throw NullPointerException if called when there hasn't yet been a successful match.
8.5.3 Using ORO's Underlying ClassesIf you need to do things that Perl5Util doesn't allow, but still want to use ORO, you'll need to use the underlying classes (the "vast, modular framework") directly. As an example, here's an ORO version of the CSV-processing script in Section 8.4.4.3. First, we need these 11 classes: import org.apache.oro.text.regex.PatternCompiler; import org.apache.oro.text.regex.Perl5Compiler; import org.apache.oro.text.regex.Pattern; import org.apache.oro.text.regex.PatternMatcher; import org.apache.oro.text.regex.Perl5Matcher; import org.apache.oro.text.regex.MatchResult; import org.apache.oro.text.regex.Substitution; import org.apache.oro.text.regex.Util; import org.apache.oro.text.regex.Perl5Substitution; import org.apache.oro.text.regex.PatternMatcherInput; import org.apache.oro.text.regex.MalformedPatternException; Then, we prepare the regex engine—this is needed just once per thread: PatternCompiler compiler = new Perl5Compiler(); PatternMatcher matcher = new Perl5Matcher(); Now we declare the variables for our two regexes, and also initialize an object representing the replacement text for when we change '""' to '"': Pattern rCSVmain = null; Pattern rCSVquote = null; // When rCSVquote matches, we'll want to replace with one double quote: Substitution sCSVquote = new Perl5Substitution("\""); Now we create the regex objects. The raw ORO classes require pattern exceptions to always be caught or thrown, even though we know the hand-constructed regex will always work (well, after we've tested it once to make sure we've typed it correctly). try { rCSVmain = compiler.compile( " (?:^|,) \n"+ " (?: \n"+ " # Either a double-quoted field... \n"+ " \" # field's opening quote \n"+ " ( [^\"]* (?: \"\" [^\"]* )* ) \n"+ " \" # field's closing quote \n"+ " # ... or ... \n"+ " | \n"+ " # ... some non-quote/non-comma text ... \n"+ " ( [^\",]* ) \n"+ " ) \n", Perl5Compiler.EXTENDED_MASK); rCSVquote = compiler.compile("\"\""); } catch (MalformedPatternException e) { System.err.println("Error parsing regular expression."); System.err.println("Error: " + e.getMessage()); System.exit(1); } ORO's \G doesn't work properly (at least as of Version 2.0.6), so I've removed it. You'll recall from the original discussion in Chapter 5 (see Section 5.4.2.1) that \G had been used as a precaution, and wasn't strictly required, so it's okay to remove here. Finally, this snippet actually does the processing: PatternMatcherInput inputObj = new PatternMatcherInput(inputCSVtext); while ( matcher.contains(inputObj, rCSVmain) ) { String field; // We'll fill this in with $1 or $2 String first = matcher.getMatch().group(2); if ( first != null ) { field = first; } else { field = matcher.getMatch().group(1); // If $1, must replace paired double quotes with one double quote field = Util.substitute(matcher, // the matcher to use rCSVquote, // the pattern to match with it sCSVquote, // the replacement to be done Phew! Seeing all that's involved certainly helps you to appreciate Perl5Util! |
< Free Open Study > |