8.4 Sun's Regex Package

Sun's regex package, java.util.regex, comes standard with Java as of Version 1.4. It provides powerful and innovative functionality with an uncluttered (if somewhat simplistic) class interface to its "match state" object model discussed (see Section 8.2.1.2). It has fairly good Unicode support, clear documentation, and good efficiency.

We've seen examples of java.util.regex in earlier chapters (see Section 2.3.7.1, Section 3.2.2, Section 3.2.3.1, Section 5.4.2.1, Section 6.3.2). We'll see more later in this chapter when we look at its object model and how to actually put it to use, but first, we'll take a look at the regex flavor it supports, and the modifiers that influence that flavor.

8.4.1 Regex Flavor

java.util.regex is powered by a Traditional NFA, so the rich set of lessons from Chapters 4, 5, and 6 apply. Table 8-2 below summarizes its metacharacters. Certain aspects of the flavor are modified by a variety of match modes, turned on via flags to the various functions and factories, or turned on and off via (? mods-mods ) and (? mods-mods:···) modifiers embedded within the regular expression itself. The modes are listed in Table 8-3 in Section 8.4.1.

A regex flavor certainly can't be described with just a tidy little table, so here are some notes to augment Table 8-2:

The table shows "raw" backslashes, not the doubled backslashes required when regular expressions are provided as Java string literals. For example, \n in the table must be written as "\\n" as a Java string. See "Strings as Regular Expressions" (see Section 3.3.1).

With the Pattern.COMMENTS option (see Section 8.4.1), #··· sequences are taken as comments. (Don't forget to add newlines to multiline string literals, as in the sidebar in Section 8.4.4.2.) Unescaped ASCII whitespace is ignored. Note: unlike most implementations that support this type of mode, comments and free whitespace are recognized within character classes.

Table 2. Overview of Sun's java.util.regex Flavor

Character Shorthands

see Section 3.4.1.1(c) \a \b \e \f \n \r \t \0 octal \x ## \u #### \c char

Character Classes and Class-Like Constructs

see Section 3.4.2.1(c) Classes: [···] [^···] (may contain class set operators see Section 3.4.2.5)
see Section 3.4.2.2 Almost any character: dot (various meanings, changes with modes)
see Section 3.4.2.4(c) Class shorthands: \w \d \s \W \D \S
see Section 3.4.2.4(c) Unicode properties and blocks \p{ Prop } \P{ Prop }

Anchors and other Zero-Width Tests

see Section 3.4.3.1 Start of line/string:^ \A
see Section 3.4.3.2 End of line/string: $ \z \Z
see Section 3.4.3.3 Start of current match: \G
see Section 3.4.3.4 Word boundary: \b \B
see Section 3.4.3.5 Lookaround: (?=···) (?!···) (?<=···) (?<!···)

Comments and Mode Modifiers

see Section 3.4.4 Mode modifiers: (? mods - mods ) Modifiers allowed: x d s m i u
see Section 3.4.4.2 Mode-modified spans: (? mods - mods :···)
see Section 3.3.3.5(c) Literal-text mode: \Q···\E

Grouping, Capturing, Conditional, and Control

see Section 3.4.5 Capturing parentheses: (···) \1 \2. . .
see Section 3.4.5.2 Grouping-only parentheses: (?:···)
see Section 3.4.5.4 Atomic grouping: (?>···)
see Section 3.4.5.5 Alternation: |
See Section 3.4.5.7 Greedy quantifiers: * + ? {n} {n,} {x,y}
see Section 3.4.5.9 Lazy quantifiers: *? +? ?? {n}? {n,}? {x,y}?
see Section 3.4.5.10 Possessive quantifiers: *+ ++ ?+ {n}+ {n,}+ {x,y}?

(c) - may be used within a character class(See text for notes on many items)

\b is valid as a backspace only within a character class (outside, it matches a word boundary).
\x## allows exactly two hexadecimal digits, e.g., \xFCber matches 'über'.
\u#### allows exactly four hexadecimal digits, e.g., \u00FCber matches 'über', and \u20AC matches '€'.
\0 octal requires the leading zero, with one to three following octal digits.

\c char is case sensitive, blindly xoring the ordinal value of the following character with 64. This bizarre behavior means that, unlike any other regex flavor I've ever seen, \cA and \ca are different. Use uppercase letters to get the traditional meaning of \x01. As it happens, \ca is the same as \x21, matching '!'. (The case sensitivity is scheduled to be fixed in Java 1.4.2.)

Table 3. The java.util.regex Match and Regex Modes
Compile-Time Option (?mode) Description
Pattern.UNIX_LINES d Changes how dot and ^match (see Section 8.4.2)
Pattern.DOTALL s Causes dot to match any character (see Section 3.3.3.2)
Pattern.MULTILINE m Expands where ^ and $ can match (see Section 8.4.2)
Pattern.COMMENTS x Free-spacing and comment mode (see Section 2.3.6.2)
(Applies even inside character classes)
Pattern.CASE_INSENSITIVE i Case-insensitive matching for ASCII characters
Pattern.UNICODE_CASE u Case-insensitive matching for non-ASCII characters
Pattern.CANON_EQ Unicode "canonical equivalence" match mode (different encodings of the same
character match as identical Section 3.3.2.2)

\w, \d, and \s (and their uppercase counterparts) match only ASCII characters, and don't include the other alphanumerics, digits, or whitespace in Unicode. That is, \d is exactly the same as [0-9], \w is the same as [0-9a-zA-Z_], and \s is the same as [• \t\n\f\r\x0B] (\x0B is the little-used ASCII VT character).
For full Unicode coverage, you can use Unicode properties (see Section 3.4.2.4): use \p{L} for \w, use \p{Nd} for \d, and use \p{Z} for \s. (Use the \P{···} version of each for \W, \D, and \S.)
\p{···} and \P{···} support most standard Unicode properties and blocks. Unicode scripts are not supported. Only the short property names like \p{Lu} are supported—long names like \p{Lowercase_Letter} are not supported. (see the tables in Section 3.4.2.5 and Section 3.4.2.5.) One-letter property names may omit the braces: \pL is the same as \p{L}. Note, however, that the special composite property \p{L&} is not supported. Also, for some reason, \p{P} does not match characters matched by \p{Pi} and \p{Pf}. \p{C} doesn't match characters matched by \p{Cn}.
\p{all} is supported, and is equivalent to (?s:.). \p{assigned} and \p{unassigned} are not supported: use \P{Cn} and \p{Cn} instead.
This package understands Unicode blocks as of Unicode Version 3.1. Blocks added to or modified in Unicode since Version 3.1 are not known (see Section 3.3.2.2).
Block names require the 'In' prefix (see the table in Section 3.4.2.6), and only the raw form unadorned with spaces and underscores may be used. For example, \p{In_Greek_Extended} and \p{In Greek Extended} are not allowed; \p{InGreekExtended} is required.
$ and \Z actually match line terminators when they should only match at the line terminators (for example, a pattern of "(.*$)" actually captures the line terminator). This is scheduled to be fixed in Java 1.4.1.
\G matches the location where the current match started, despite the documentation's claim that it matches at the ending location of the previous match (see Section 3.4.3.3). \G is scheduled to be fixed (to agree with the documentation and match at the end of the previous match) in Java 1.4.1.
The \b and \B word boundary metacharacters' idea of a "word character" is not the same as \w and \W's. The word boundaries understand the properties of Unicode characters, while \w and \W match only ASCII characters.
Lookahead constructs can employ arbitrary regular expressions, but lookbehind is restricted to subexpressions whose possible matches are finite in length. This means, for example, that ? is allowed within lookbehind, but *and + are not. See the description in Chapter 3, starting in Section 3.4.3.6.
At least until Java 1.4.2 is released, character classes with many elements are not optimized, and so are very slow; use ranges when possible (e.g., use [0-9A-F] instead of [0123456789ABCDEF] ), and if there are characters or ranges that are likely to match more often than others, put them earlier in the class's list.

8.4.2 Using java.util.regex

The mechanics of wielding regular expressions with java.util.regex are fairly simple. Its object model is the "match state" model discussed in Section 8.2.1.2. The functionality is provided with just three classes:


        java.util.regex.Pattern

        java.util.regex.Matcher

        java.util.regex.PatternSyntaxException

Informally, I'll refer to the first two simply as "Pattern" and "Matcher". In short, the Pattern object is a compiled regular expression that can be applied to any number of strings, and a Matcher object is an individual instance of that regex being applied to a specific target string. The third class is the exception thrown upon the attempted compilation of an ill-formed regular expression.

Sun's documentation is sufficiently complete and clear that I refer you to it for the complete list of all methods for these objects (if you don't have the documentation locally, see regex.info for links). The rest of this section highlights just the main points.

Sun's java.util.regex "Line Terminators"

Traditionally, pre-Unicode regex flavors treat a newline specially with respect to dot, ^, $, and \Z. However, the Unicode standard suggests the larger set of "line terminators" discussed in Chapter 3 (Section 3.3.2.2). Sun's package supports a subset of the these consisting of these five characters and one character sequence:

Character Codes Nicknames Description

U +000A
U +000D
U +000D U +000A
U +0085
U +2028
U +2029

LF\n
CR\r
CR/LF\r\n
NEL
LS
PS

ASCII Line Feed
ASCII Carriage Return
ASCII Carriage Return / Line Feed
Unicode NEXT LINE
Unicode LINE SEPARATOR
Unicode PARAGRAPH SEPARATOR

This list is related to the dot, ^, $, and \Z metacharacters, but the relationships are neither constant (they change with modes), nor consistent (one would expect ^ and $ to be treated similarly, but they are not).

Both the Pattern.UNIX_LINES and Pattern.DOTALL match modes (available also via (?d) and (?s)) influence what dot matches.

^ can always match at the beginning of the string, but can match elsewhere under the (?m) Pattern.MULTILINE mode. It also depends upon the (?d)Pattern.UNIX_LINES mode.

$ and \Z can always match at the end of the string, but they can also match just before certain string-ending line terminators. With the Pattern.MULTILINE mode, $ can match after certain embedded line terminators as well. With Java 1.4.0, Pattern.UNIX_LINES does not influence $ and \Z in the same way (but it's slated to be fixed in 1.4.1 such that it does). The following table summarizes the relationships as of 1.4.0.

LF CR CR/LF NEL LS PS
Default action, without modifiers
dot matches all but:
^ matches at beginning of line only
$ and \Z match before line-ending:
With Pattern.MULTILINE or (?m)
^ matches after any:
$ matches before any:
With Pattern.DOTALL or (?s)
dot matches any character

— does not apply if Pattern.UNIX_LINES or (?d) is in effect

Finally, note that there is a bug in Java 1.4.0 that is slated to be fixed in 1.4.1: $ and \Z actually match the line terminators, when present, rather than merely matching at line terminators.

Here's a complete example showing a simple match:


 public class SimpleRegexTest {

   public static void main(String[] args)

   {

      String sampleText = "this is the 1st test string";

      String sampleRegex = "\\d+\\w+";

      java.util.regex.Pattern p = java.util.regex.Pattern.compile(sampleRegex);

      java.util.regex.Matcher m = p.matcher(sampleText);

      if (m.find()) {

          String matchedText = m.group();

          int    matchedFrom = m.start();

          int    matchedTo   = m.end();

          System.out.println("matched [" + matchedText + "] from " +

                             matchedFrom + " to " + matchedTo + ".");

   } else {

       System.out.println("didn't match");

   }

 }

}

This prints 'matched [1st] from 12 to 15.'. As with all examples in this chapter, names I've chosen are in italic. Notice the Matcher object, after having been created by associating a Pattern object and a target string, is used to instigate the actual match (with its m.find() method), and to query the results (with m.group(), etc.).

The parts shown in bold can be omitted if

import java.util.regex.*;

or perhaps

import java.util.regex.Pattern;

import java.util.regex.Matcher;

are inserted at the head of the program, just as with the examples in Chapter 3 (see Section 3.2.2.1). Doing so makes the code more manageable, and is the standard approach. The rest of this chapter assumes the import statement is always supplied. A more involved example is shown in the sidebar in Section 8.4.4.3.

8.4.3 The `Pattern.compile()` Factory

A Pattern regular-expression object is created with Pattern.compile(···). The first argument is a string to be interpreted as a regular expression (see Section 3.3.1). Optionally, compile-time options shown in Table 8-3 in Section 8.4.1 can be provided as a second argument. Here's a snippet that creates a Pattern object from the string in the variable sampleRegex, to be matched in a case-insensitive manner:

Pattern pat = Pattern.compile(sampleRegex,

     Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

A call to Pattern.compile(···) can throw two kinds of exceptions: an invalid regular expression throws PatternSyntaxException, and an invalid option value throws IllegalArgumentException.

8.4.3.1 Pattern's `matcher(···)` method

A Pattern object offers some convenience methods we'll look at shortly, but for the most part, all the work is done through just one method: matcher(···). It accepts a single argument: the string to search.^[1] It doesn't actually apply the regex, but prepares the general Pattern object to be applied to a specific string. The matcher(···) method returns a Matcher object.

^[1] Actually, matcher's argument can be any object implementing the CharSequence interface (of which String, StringBuffer, and CharBuffer are examples). This provides the flexibility to apply regular expressions to a wide variety of data, including text that's not even kept in contiguous strings.

8.4.4 The `Matcher` Object

Once you've associated a regular expression with a target string by creating a Matcher object, you can instruct it to apply the regex in various ways, and query the results of that application. For example, given a Matcher object m, the call m.find() actually applies m's regex to its string, returning a Boolean indicating whether a match is found. If a match is found, the call m.group() returns a string representing the text actually matched.

The next sections list the various Matcher methods that actually apply a regex, followed by those that query the results.

8.4.4.1 Applying the regex

Here are the main Matcher methods for actually applying its regex to its string:

find(): Applies the object's regex to the object's string, returning a Boolean indicating whether a match is found. If called multiple times, the next match is returned each time.

find( offset ): If find(···) is given an integer argument, the match attempt starts from the given offset number of characters from the start of the string. It throws IndexOutOfBoundsException if the offset is negative or beyond the end of the string.

matches(): This method returns a Boolean indicating whether the object's regex exactly matches the object's string. That is, the regex is wrapped with an implied \A···\z.^[2] is also available via String's matches() method. For example,"123".matches("\\d+") is true.
^[2] Due to the bug with \Zmentioned at the bottom of Section 8.4.2, with version 1.4.0, the regex actually appears to be wrapped with an implied \A···\Z instead.

lookingAt(): Returns a Boolean indicating whether the object's regex matches the object's string from its beginning. That is, the regex is applied with an implied leading \A. This is also available via String's matches() method. For example, "Subject:•spam".lookingAt("^\\w+:") is true.

8.4.4.2 Querying the results

The following Matcher methods return information about a successful match. They throw IllegalStateException if the object's regex hasn't yet been applied to the object's string, or if the previous application was not successful. The methods that accept a num argument (referring to a set of capturing parentheses) throw IndexOutOfBoundsException when an invalid num is given.

group(): Returns the text matched by the previous regex application.

groupCount(): Returns the number of sets of capturing parentheses in the object's regex. Numbers up to this value can be used in the group( num ) method, described next.

group( num ): Returns the text matched by the num ^th set of capturing parentheses, or null if that set didn't participate in the match. A num of zero indicates the entire match, so group(0) is the same as group().

start( num ): Returns the offset, in characters, from the start of the string to the start of where the num ^th set of capturing parentheses matched. Returns -1 if the set didn't participate in the match.

start(): The offset to the start of the match; this is the same as start(0).

end( num ): Returns the offset, in characters, from the start of the string to the end of where the num ^th set of capturing parentheses matched. Returns -1 if the set didn't participate in the match.

end(): The offset to the end of the match; this is the same as end(0).

8.4.4.3 Reusing `Matcher` objects for efficiency

The whole point of having separate compile and apply steps is to increase efficiency, alleviating the need to recompile a regex with each use (see Section 6.4.3). Additional efficiency can be gained by reusing Matcher objects when applying the same regex to new text. This is done with the reset method, described next.

CSV Parsing with java.util.regex

Here's the java.util.regex version of the CSV example from Chapter 6 (see Section 6.6.7.3). The regex has been updated to use possessive quantifiers (see Section 3.4.5.10) for a bit of extra efficiency.

First, we set up Matcher objects that we'll use in the actual processing. The '\n' at the end of each line is needed because we use #··· comments, which end at a newline.

//Prepare the regexes we'll use Pattern pCSVmain = Pattern.compile( " \\G(?:^|,) \n"+ " (?: \n"+ " # Either a double-quoted field... \n"+ " \" # field's opening quote \n"+ " ( (?> [^\"]*+ ) (?> \"\" [^\"]*+ )*+ ) \n"+ " \" # field's closing quote \n"+ " # ... or ... \n"+ " | \n"+ " # ... some non-quote/non-comma text ... \n"+ " ( [^\",]*+ ) \n"+ " ) \n", Pattern.COMMENTS); Pattern pCSVquote = Pattern.compile("\"\""); // Now create Matcher objects, with dummy text, that we'll use later. Matcher mCSVmain = pCSVmain.matcher(""); Matcher mCSVquote = pCSVquote.matcher("");

Then, to parse the string in csvText as CSV text, we use those Matcher objects to actually apply the regex and use the results:

mCSVmain.reset(csvText); // Tie the target text to the mCSVmain object while ( mCSVmain.find() ) { String field; // We'll fill this in with $1 or $2 . . . String first = mCSVmain.group(2); if ( first != null ) field = first; else { // If $1, must replace paired double-quotes with one double quote mCSVquote.reset(mCSVmain.group(1)); field = mCSVquote.replaceAll("\""); } // We can now work with field . . . System.out.println("Field [" + field + "]"); }

This is more efficient than the similar version shown in Section 5.4.2.1 for two reasons: the regex is more efficient (as per the Chapter 6 discussion), and that one Matcher object is reused, rather than creating and disposing of new ones each time (as per the discussion in Section 8.4.4.2).

reset( text ): This method reinitializes the Matcher object with the given String (or any object that implements a CharSequence), such that the next regex operation will start at the beginning of this text. This is more efficient than creating a new Matcher object (see Section 8.4.4.2). You can omit the argument to keep the current text, but to reset the match state to the beginning.

Reusing the Matcher object saves the Java mechanics of disposing of the old object and creating a new one, and requires only about one fourth the overhead of creating a new Matcher object.

In practice, you usually need only one Matcher object per regex, at least if you intend to apply the regex to only one string at a time, as is commonly the case. The sidebar below shows this in action. Dummy strings are immediately associated with each Pattern object to create the Matcher objects. It's okay to start with a dummy string because the object's reset(···) method is called with the real text to match against before the object is used further.

In fact, there's really no need to actually save the Pattern objects to variables, since they're not used except to create the Matcher objects. The lines:

     Pattern pCSVquote = Pattern.compile("\"\"");

     Matcher mCSVquote = mCSVquote.matcher("");

can be replaced by

     Matcher mCSVquote = Pattern.compile("\"\"").matcher("");

thus eliminating the pCSVquote variable altogether.

8.4.4.4 Simple search and replace

You can implement search-and-replace operations using just the methods mentioned so far, but the Matcher object offers convenient methods to do simple search-and-replace for you:

replaceAll( replacement )

The Matcher object is reset, and its regex is repeatedly applied to its string. The return value is a copy of the object's string, with any matches replaced by the replacement string.

This is also available via a String's replaceAll method:


    string.replaceAll(regex, replacement);

is equivalent to:

    Pattern.compile(regex).matcher(string).replaceAll(replacement)

replaceFirst( replacement )

The Matcher object is reset, and its regex is applied once to its string. The return value is a copy of the object's string, with the first match (if any) replaced by the replacement string.

This is also available via a String's replaceFirst method, as just described with replaceAll.

With any of these functions, the replacement string receives special parsing:

Instances of '$1', '$2', etc., within the replacement string are replaced by the text matched by the associated set of capturing parentheses. ($0 is replaced by the entire text matched.)
IllegalArgumentException is thrown if the character following the '$' is not an ASCII digit.
Only as many digits after the '$' as "make sense" are used. For example, if there are three capturing parentheses, '$25' in the replacement string is interpr eted as $2 followed by the character '5'. However, in the same situation, '$6' in the replacement string throws IndexOutOfBoundsException.
A backslash escapes the character that follows, so use '···\$···' in the replacement string to include a dollar sign in it. By the same token, use '···\\···' to get a backslash into the replacement value. (And if you're providing the replacement string as a Java string literal, that means you need "···\\\\···" to get a backslash into the replacement value.) Also, if there are, say, 12 sets of capturing parentheses and you'd like to include the text matched by the first set, followed by '2', you can use a replacement value of '···$1\2···'.

8.4.4.5 Advanced search and replace

Two additional methods provide raw access to Matcher's search-and-replace mechanics. Together, they build a result in a StringBuffer that you provide. The first is called after each match, to fill the result with the replacement string, as well as the text between the matches. The second is called after all matches have been found, to tack on the text remaining after the final match.

appendReplacement( stringBuffer, replacement )

Called immediately after a regex has been successfully applied (e.g., with find), this method appends two strings to the given stringBuffer : first, it copies in the text of the original target string prior to the match. Then it appends the replacement string, as per the special processing described in the previous section.

For example, let's say we've got a Matcher object m that associates the regex \w+ with the string '-->one+test<--'. The first time through this while loop:


     while (m.find())

          m.appendReplacement(sb,"XXX")

the find matches the underlined portion of '--> one+test<--'. The call to appendReplacement fills the stringBuffer sb with the text before the match,'-->', then bypasses what matched, instead appending the replacement string, 'XXX', to sb.

The second time through the loop, find matches '-->one+test<--'. The call to appendReplacement appends the text before the match, '+', then again appends the replacement string, 'XXX'.

This leaves sb with '-->XXX+XXX', and the original target string within the m object marked at '-->one+test<--'.

appendTail( stringBuffer )

Called after all matches have been found (or, at least, after the desired matches have been found — you can stop early if you like), this method appends the remaining text. Continuing the previous example,


    m.appendTail(sb)

appends '<--' to sb. This leaves it with '-->XXX+XXX<--', completing the search and replace.

Here's an example showing how you might implement your own version of replaceAll using these. (Not that you'd want to, but it's illustrative.)

       public static String replaceAll(Matcher m,String replacement)

       {

            m.reset(); // Be sure to start with a fresh Matcher object

            StringBuffer result = new StringBuffer(); // We'll build the updated copy here

            while (m.find())

                m.appendReplacement(result, replacement);

            m.appendTail(result);

            return result.toString(); //Convert to a String and return

       }

Here's a slightly more involved snippet, which prints a version of the string in the variable Metric, with Celsius temperatures converted to Fahrenheit:

       // Build a matcher to find numbers followed by "C" within the variable "Metric"

       Matcher m = Pattern.compile("(\\d+(?:\\.(\\d+))?)C\\b").matcher(metric);

       

       StringBuffer result = new StringBuffer(); // We'll build the updated copy here

       while (m.find()) {

         float celsius = Float.parseFloat(m.group(1));     //Get the number, as a number

         int fahrenheit = (int) (celsius * 9/5 + 32);      //Convert to a Fahrenheit value



         m.appendReplacement(result, fahrenheit + "F");    //Insert it

       }

       m.appendTail(result);

       System.out.println(result.toString()); //Display the result

For example, if the variable Metric contains 'from 36.3C to 40.1C.', it displays 'from 97F to 104F.'.

8.4.5 Other `Pattern` Methods

In addition to the main compile(···) factories, the Pattern class contains some helper functions and methods that don't add new functionality, but make the current functionality more easily accessible.

Pattern.matches( pattern, text )

This static function returns a Boolean indicating whether the string pattern can match the CharSequence (e.g., String) text. Essentially, this is:

    Pattern.compile(pattern).matcher(text).matches();

If you need to pass compile options, or need to gain access to more information about the match than whether it was successful, you'll have to use the methods described earlier.

8.4.5.1 Pattern's split method, with one argument

split( text ): This Pattern method accepts text (a CharSequence) and returns an array of strings from text that are delimited by matches of the object's regex. This is also available via a String's split method.

This trivial example

     String[] result = Pattern.compile("\\.").split("209.204.146.22");

returns the array of four strings ('209', '204', '146', and '22') that are separated by the three matches of \. in the text. This simple example splits on only a single literal character, but you can split on an arbitrary regular expression. For example, you might approximate splitting a string into "words" by splitting on non-alphanumerics:

     String[] result = Pattern.compile("\\W+").split(Text);

When given a string like 'What's up, Doc' it returns the four strings ('What', 's', 'up', and 'Doc') delimited by the three matches of the regex. (If you had non-ASCII text, you'd probably want to use \P{L}+, or perhaps [^\p{L}\p{N}_], as the regex, instead of \W+ see Section 8.4.1.)

Empty elements with adjacent matches

If the object's regex can match at the beginning of the text, the first string returned by split is an empty string (a valid string, but one that contains no characters). Similarly, if the regex can match two or more times in a row, empty strings are returned for the zero-length text "separated" by the adjacent matches. For example,

  String[] result = Pattern.compile("\\s*,\\s*").split(", one, two , ,, 3");

splits on a comma and any surrounding whitespace, returning an array of five strings: an empty string, 'one', 'two', two empty strings, and '3'.

Finally, any empty strings that might appear at the end of the list are suppressed:

     String[] result = Pattern.compile(":").split(":xx:");

This produces just two strings: an empty string and 'xx'. To keep trailing empty elements, use the two-argument version of split(···), described next.

8.4.5.2 Pattern's split method, with two arguments

split( text, limit ): This version of split(···) provides some control over how many times the Pattern's regex is applied, and what is done with trailing empty elements.

The limit argument takes on different meanings depending on whether it's less than zero, zero, or greater than zero.

Split with a limit less than zero
Any limit less than zero means to keep trailing empty elements in the array. Thus,

     String[] result = Pattern.compile(":").split(":xx:", -1);

returns an array of three strings (an empty string, 'xx', and another empty string).

Split with a limit of zero
An explicit limit of zero is the same as if there were no limit given, i.e., trailing empty elements are suppressed.

Split with a limit greater than zero
With a limit greater than zero, split(···) returns an array of at most limit elements. This means that the regex is applied at most limit -1 times. (A limit of three, for example, requests three strings separated by two matches.)

After having matched limit -1 times, no further matches are checked, and the entire remainder of the string after the final match is returned as the last string in the array. For example, if you had a string with

     Friedl,Jeffrey,Eric Francis,America,Ohio,Rootstown

and wanted to isolate just the three name components, you'd split the string into four parts (the three name components, and one final "everything else" string):

     String[] NameInfo = Pattern.compile(",").split(Text, 4);

     // NameInfo[0] is the family name

     // NameInfo[1] is the given name

     // NameInfo[2] is the middle name (or in my case, middle names)

     // NameInfo[3] is everything else, which we don't need, so we'll just ignore it.

The reason to limit split in this way is for enhanced efficiency — why bother going through the work of finding the rest of the matches, creating new strings, making a larger array, etc., when there's no intention to use the results of that work? Supplying a limit allows just the required work to be done.

< Free Open Study >