The Pattern Object

Figure 2-1 shows the methods of the Pattern class. The figure is a UML rendering of the Pattern class that illustrates the various methods and constants of the class.

Figure 2-1: A UML representation of the Pattern class

Let's examine the fields and methods of the Pattern class in detail. If you aren't familiar with UML, here's a quick guide to reading Figure 2-1:

The name of the class is Pattern, and it's in the topmost section of the rectangle.
The middle section is a grouping of the field variables. The plus sign (+) preceding these indicates that they're public. The underline indicates that they're static, and the ": int" indicates that they're of type int. The "= num" indicates the default value.
The bottommost section of the rectangle holds the class's methods. Again, the plus sign indicates public access, and the underline indicates that the method is static. The parentheses designate the parameters for a given method; thus, flags() takes no parameters, whereas matcher(input : CharSequence) takes variable named input of type CharSequence. The colon (:) toward the end indicates a type.

The following sections describe the fields and methods of the Pattern class.

public static final int UNIX_LINES

The UNIX_LINES flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. Use this flag when you parse data that originates on a UNIX machine.

On many flavors of UNIX, the invisible character \n is used to note termination of a line. This is distinct from other operating systems, including flavors of Windows, which may use \r\n, \n, \r, \u2028, or \u0085 for a line terminator.

If you transport a file that originated on a UNIX machine to a Windows platform and open it, you may notice that the lines will sometimes not terminate in the expected manner, depending on which editor you use to view the text. This happens because the two systems can use different syntax to denote the end of a line.

The UNIX_LINES flag simply tells the regex engine that it's dealing with UNIX-style lines, which affects the matching behavior of the regular expression metacharacters ^ and $. Using the UNIX_LINES flag, or the equivalent (?d) regex pattern, doesn't degrade performance. By default, this flag isn't set.

public static final int CASE_INSENSITIVE

The CASE_INSENSITIVE field is used when constructing the second parameter of the Pattern.compile(String regex, int flags) method. It's useful when you need to match ASCII characters, regardless of case.

Using this flag or the equivalent (?I) regular expression can cause performance to degrade slightly. By default, this flag isn't set.

public static final int COMMENTS

The COMMENTS flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that the regex pattern has an embedded comment in it. Specifically, it tells the regex engine to ignore any comments in the pattern, starting with the spaces leading up to the # character and everything thereafter, until the end of the line.

Thus, the regex pattern A #matches uppercase US-ASCII char code 65 will use A as the regular expression, but the spaces leading to the # character and everything after it until the end of the line will be ignored. Your code might end up looking something like this:

Pattern p =
Pattern.compile("A    #matches uppercase US-ASCII char code 65",Pattern.COMMENTS);

Think of the # character as the regex equivalent of the java comment //. By using the COMMENTS flag in compiling your regex, you're telling the regex engine that your expression contains comments, which should be ignored. This can be useful if your pattern is particularly complex or subtle. When you don't set this flag, the regex engine will attempt to interpret and use your comments as part of the regular expression.

Using this flag or the equivalent (?x) regular expression doesn't degrade performance.

public static final int MULTILINE

The MULTILINE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that regex input isn't a single line of code; rather, it contains several lines that have their own termination characters.

This means that the beginning-of-line character, ^, and the end-of-line character, $, will potentially match several lines within the input String. For example, imagine that your input String is This is a sentence.\n So is this.. If you use the MULTILINE flag to compile the regular expression pattern

Pattern p = Pattern.compile("^", Pattern.MULTILINE);

then the beginning of line character, ^ , will match before the T in This is a sentence. It will also match just before the S in So is this. When you don't use the MULTILINE flag, the match will only find the T in This is a sentence.

Using this flag or the equivalent (?m) regular expression may degrade performance.

public static final int DOTALL

The DOTALL flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. The DOTALL flag tells the regex engine to allow the metacharacter period to match any character, including line termination characters. What does this mean?

Imagine that your candidate String is Test\n. If your corresponding regex pattern is . you would normally have four matches: one for the T, another for the e, another for s, and the fourth for t. This is because the regex metacharacter . will normally match any character except a line termination character.

Enabling the DOTALL flag as follows:

Pattern p = Pattern.compile(".", Pattern.DOTALL);

would generate five matches. Your pattern would match the T, e, s, and t characters. In addition, it would match the \n character at the end of the line.

Using this flag or the equivalent (?s) regular expression doesn't degrade performance.

public static final int UNICODE_CASE

The UNICODE_CASE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It is used in conjunction with the CASE_INSENITIVE flag to generate case-insensitive matches for the international character sets.

Using this flag or the equivalent (?u) regular expression can degrade performance.

public static final int CANON_EQ

The CANON_EQ flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. As you know, characters are actually stored as numbers. For example, in the ASCII character set, the character A is represented by the number 65. Depending on the character set that you're using, the same character can be represented by different numeric combinations. For example, à can be represented by both +00E0 and U+0061U+0300. A CANON_EQ match would match either representation.

Using this flag may degrade performance.

public static Pattern compile(String regex) Throws a PatternSyntaxException

You'll notice that the Pattern class doesn't have a public constructor. This means that you can't write the following type of code:

Pattern p = new Pattern("my regex");//wrong!

To get a reference to a Pattern object, you must use the static method compile(String regex). Thus, your first line of regex code might look like the following:

Pattern p = Pattern.compile("my regex");//Right!

The parameter for this method is a String that represents a regular expression. When you pass a String to a method that expects a regular expression, it's important to delimit any \ characters that the regular expressions might have by appending another \ character to them. This is because String objects internally use the \ character to delimit metacharacters in character sequences, regardless of whether those character sequences are regular expressions. This was true long before regular expressions were part of Java. Thus, the regular expression \d becomes \\d. To match a single digit, your regular expression becomes the following:

Pattern p = Pattern.compile("\\d");

The point here being that the regular expression \d becomes the String \\d.

The delimitation of the String parameter can sometimes be tricky, so it's important to understand it well. By and large, it means that you double the \ characters that might already be present in the regular expression. It does not mean that you simply append a single \ character. I present an example to illustrate this shortly.

The compile method will throw a java.util.regex.PatternSyntaxException if the regular expression itself is badly formed. For example, if you pass in a String that contains [4, the compile method will throw a PatternSyntaxException at runtime, because the syntax of the regular expression [4 is illegal, as shown in Listing 2-1.

Listing 2-1: Using the compile Method

import java.util.regex.*;

public class DelimitTest{
  public static void main(String args[]){


    //throws exception
    Pattern p = Pattern.compile("[4");
  }
}

Does this mean that you have to catch a PatternSyntaxException every time you use a regular expression? No. PatternSyntaxException doesn't have to be explicitly caught, because it extends from a RuntimeException, and a RuntimeException doesn't need to be explicitly caught.

The compile(String regex) method returns a Pattern object.

public static Pattern compile(String regex, int flags) Throws a PatternSyntaxException

The (String regex, int flags) method is a more powerful form of the compile(String) method. The first parameter for this method, regex, is a String that represents a regular expression, as detailed in the Pattern.compile(String regex) method presented earlier. For details on how you must format the String parameter, please see the "public static Pattern compile(String regex) Throws a PatternSyntaxException" section.

The flexibility of this compile method is fully realized by using the second parameter, int flags. The int flags parameter can consist of the following flags or a bit mask created by OR-ing combinations thereof:

CANON_EQ
CASE_INSENSTIVE
COMMENTS
DOTALL
MULTILINE
UNICODE_CASE
UNIX_LINES

For example, if you want a match to be successful regardless of the case of the candidate String, then your pattern might look like the following:

   Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);

You can combine the flags by using the | operator. For example, to achieve case-insensitive Unicode matches that include a comment, you might use the following:

Pattern p =
Pattern.compile("t # a compound flag example",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE|
Pattern.COMMENT);

The compile(String regex, int flags) method returns a Pattern object.

public String pattern()

This method returns a simple String representation of the regex compiled. It can sometimes be misleading in two ways. First, the string that's returned doesn't reflect any flags that were set when the pattern was compiled. Second, the regex String you pass in isn't always the pattern String you get back out. Specifically, the original String delimitations aren't shown. Thus, if your original code was this:

Pattern p = Pattern.compile("\\d");

you should expect your output to be \d, with a single \ character.

A question naturally arises here: If this method strips out the original delimiting, can you use the resulting String as a regular expression to feed another expression? For example, does Listing 2-2 work?

Listing 2-2: Pattern Matching Example

import java.util.regex.*;

public class PatternMethodExample{
  public static void main(String args[]){
      reusePatternMethodExample();
  }

   public static void reusePatternMethodExample(){
      //match a single digit
      Pattern p = Pattern.compile("\\d");
      Matcher matcher = p.matcher("5");
      boolean isOk = matcher.matches();
      System.out.println("original pattern matches " + isOk);

      //recycle the pattern
      String tmp = p.pattern();
      Pattern p2 = Pattern.compile(tmp);
      matcher = p.matcher("5");
      isOk = matcher.matches();
      System.out.println("second pattern matches " + isOk);
   }
}

Will this method throw a RuntimeException? After all, the pattern() method returns \d, and an attempt to create a regex pattern using \d as a String will fail to compile.

The answer is no, it won't throw an exception. Remember that the doubling of the \ character is a requirement of the String object's constructor—it has nothing to do with the regex pattern that the String represents. Thus, once the String is created, the conflict disintegrates.

public Matcher matcher(CharSequence input)

Remember that you create a Pattern object by compiling a description of what you're looking for. A Pattern is a bit like a personal ad: It lists the features of the thing you're looking for. Speaking purely conceptually, your patterns might look like the following:

Pattern p = Pattern.compile("She must have red hair, and a temper");

Correspondingly, you'll need to compare that description against candidates. That is, you'll want to examine a given String to see if it matches the description you provided.

The Matcher object is designed specifically to help you do this sort of interrogation. I discuss Matcher in detail in the next major section of this chapter, but for now you should know that the Pattern.matcher(CharSequence input) method returns the Matcher that will help get details about how your candidate String compares with the description you passed in.

Pattern.matcher(CharSequence input) takes a CharSequence parameter as an input parameter. CharSequence is a new interface introduced in J2SE 1.4 and retroactively implemented by the String object. Because String implements CharSequence, you can simply pass a String object as the parameter to the Pattern.matcher(CharSequence input) method. I discuss the CharSequence parameter in detail shortly.

In the preceding example, again speaking purely conceptually, you might get your Matcher object as follows:

Matcher m = pattern.matches("Anna");

In J2SE, this Matcher object's matches() would return true. In real life, YMMV.

public int flags()

Earlier I discussed the constant flags that you can use in compiling your regex pattern. The flags method simply returns an int that represents those flags. For example, to see whether your Pattern class is currently using a given flag (say, the Pattern.COMMENTS flag), simply extract the flag:

int flgs = myPattern.flags();

then "and" (&) that flag to the Pattern.COMMENTS flag:

boolean isUsingCommentFlag =( Pattern.COMMENTS == (Pattern.COMMENTS & flgs)) ;

Similarly, to see if you're using the CASE_INSENSITIVE flag, use the following code:

boolean isUsingCaseInsensitiveFlag =
(Pattern.CASE_INSENSITIVE == (Pattern. CASE_INSENSITIVE & flgs));

public static boolean matches (String regex,CharSequence input)

Very often, you'll find that all you need to know about a String is whether it matches a given regular expression exactly. You don't want to have to create a Pattern object, extract its Matcher object, and interrogate that Matcher.

This static utility method is designed to do exactly that. Internally, it creates the Pattern and Matcher objects you need, compares the regex to the input String, and returns a boolean that tells you whether the two match exactly. Listing 2-3 presents an example of its use.

Listing 2-3: Matches Example

import java.util.regex.*;
public class PatternMatchesTest{
  public static void main(String args[]){

      String regex = "ad*";
      String input = "add";

      boolean isMatch = Pattern.matches(regex,input);
      System.out.println(isMatch);//return true
  }
}

If you're going to do a lot of comparisons, then it's more efficient to explicitly create a Pattern object and do your matches manually. However, if you aren't going to do a lot of comparisons, then matches is a handy utility method.

The Pattern.matches(String regex, CharSequence input) method is also used internally by the String class. As of J2SE 1.4, String has a new method called matches that internally defers to the Pattern.matches method. You might already be using this method without being aware of it.

Of course, this method can throw a PatternSyntaxException if the regex pattern under consideration isn't well formed.

public String[] split(CharSequence input)

This method can be particularly helpful if you need to break up a String into an array of substrings based on some criteria. In concept, it's similar to the StringTokenizer, but it's much more powerful and more resource intensive than StringTokenizer because it allows your program to use regular expressions as the splitting criteria.

This method always returns at least one element. If the split candidate, input, can't be found, then a String array is returned that contains exactly one String— namely, the original input.

If the input can be found, then a String array is returned. That array contains every substring after an occurrence of the input. Thus, for the pattern

Pattern p = new Pattern.compile(",");

the split method for Hello, Dolly will return a String array consisting of two elements. The first element of the array will contain Hello, and the second will contain Dolly. That String array is obtained as follows:

String tmp[] = p.split("Hello,Dolly");

In this case, the value returned is

//tmp is equal to { "Hello", "Dolly"}

You should be aware of some subtleties when you work with this method. If the candidate String had been Hello,Dolly, with a trailing comma character after the y in Dolly then this method would still have returned a two-element String array consisting of Hello and Dolly. The implicit behavior is that trailing spaces aren't returned.

If the input String had been Hello,,,Dolly the resulting String array would have had four elements. The return value of the split method, as applied to the pattern, is

// p.split("Hello,,,Dolly")  returns {"Hello","","","Dolly"}

Listing 2-4 provides an example in which the split method is used to split a String into an array based on a single space character.

Listing 2-4: Pattern Splitting Example

import java.util.regex.*;
public class PatternSplitExample{
  public static void main(String args[]){
      splitTest();
  }

  public static void splitTest(){

    Pattern p =
    Pattern.compile(" ");
    String tmp = "this is the String I want to split up";

    String[] tokens = p.split(tmp);

    for (int i=0; i<tokens.length; i++){
       System.out.println(tokens[i]);
    }

  }
}

Of course, this is a misuse of the method: You could have used a StringTokenizer to achieve the same result, and it would have been less resource intensive. In light of what you now know, consider Listing 2-5, which is a slightly modified version of Listing 1-12 from Chapter 1 in that it uses the Pattern.split method. Output 2-1 shows the result of running the program.

Listing 2-5: PatternSplit.java

import java.util.regex.*;

public class PatternSplit{
  public static void main(String args[]){

      String statement = "I will not compromise. I will not "+
      "cooperate. There will be no concession, no conciliation, no "+
      "finding the middle ground, and no give and take.";

      String tokens[] =null;
      String splitPattern= "compromise|cooperate|concession|"+
      "conciliation|(finding the middle ground)|(give and take)";

      Pattern p = Pattern.compile(splitPattern);

      tokens=p.split(statement);

      System.out.println("REGEX PATTERN:\n"+splitPattern + "\n");

      System.out.println("STATEMENT:\n"+statement + "\n");


      System.out.println("TOKENS:");
      for (int i=0; i < tokens.length; i++){
      System.out.println(tokens[i]);
      }
  }
}

Output 2-1: Result of Running PatternSplit.java

C:\RegEx\code\chapter1>java Split
REGEX PATTERN:
compromise|cooperate|concession|conciliation|(finding the middle group)|(give
and take)

STATEMENT:
I will not compromise. I will not cooperate. There
will be no concession, no conciliation,
no finding the middle group, and no give and take.

TOKENS:
I will not
. I will not
. There will be no
, no
, no
, and no
.

You'll notice that Listing 2-5 uses the Pattern.split method, whereas Listing 1-12 uses the new String.split method. In effect, the two are identical because the String.split method simply defers to this method internally.

What you've done is really quite amazing and might have been ridiculously convoluted without regular expressions. You're actually using complicated artificial constructs—namely, English language synonyms—to decompose text. This isn't your father's J2SE.

Note

The String method further optimizes its search criteria by placing an invisible ^ before the pattern and a $ after it.

public String[] split(CharSequence input, int limit)

This method works in exactly the same way that Pattern.split(CharSequence input) does, except for one variation. The second parameter, limit, allows you to control how many elements are returned, as shown in the following sections.

Limit == 0

If you specify that the second parameter, limit, should equal 0, then this method behaves exactly like its overloaded counterpart. That is, it returns an array containing as many matching substrings as possible, and trailing spaces are discarded. Thus, the pattern

Pattern p = new Pattern.compile(",");

will return an array consisting of two elements when split against the candidate Hello, Dolly. An example of the usage of the method follows:

String tmp[] = p.split("Hello,Dolly", 0);

Similarly, split will return two elements when matched against the String Hello, Dolly, with a trailing comma character after the y in Dolly:

String tmp[] = p.split("Hello,Dolly,", 0);

However, you may not always want this behavior. For example, there may be a time when you want to limit the number of elements returned.

Limit > 0

Use a positive limit if you're interested in only a certain number of matches. You should use that number +1 as the limit. To split the String Hello, Dolly, You, Are, My, Favorite when you want only the first two tokens, use this:

String[] tmp = pattern.split("Hello, Dolly, You, Are, My, Favorite",3);

The value of the resulting String is as follows:

//tmp[0] is  "Hello",
// tmp[1] is "Dolly";

The interesting behavior here is that a third element is returned. In this case, the third element is

//tmp[2] is  "You, Are, My, Favorite";

Using a positive limit can potentially lead to performance enhancements, because the regex engine can stop searching when it meets the specified number of matches.

Limit < 0

Using a negative number—any negative number—for the limit tells the regex engine that you want to return as many matches as possible and that you want trailing spaces, if any, to be returned. Thus, for the regex pattern

Pattern p = Pattern.compile(",");

and the candidate String Hello,Dolly, the command

String tmp[] = p.split("Hello,Dolly", -1);

results in the following condition:

//tmp is equal to {"Hello","Dolly"};

However, for the String Hello, Dolly, with trailing spaces after the comma following Dolly, the method call

String tmp[] = p.split("Hello,Dolly,    ", -1);

results in this:

//tmp is equal to {"Hello","Dolly","    "};

Notice that the actual value of the negative limit doesn't matter, thus

p.split("Hello,Dolly", -1);

is exactly equivalent to this:

p.split("Hello,Dolly", -100);