Team LiB
Previous Section Next Section

Appendix B: Pattern and Matcher Methods

This appendix provides a summary of the methods of the Pattern and Matcher classes in Java. It's intended to be a quick reference for working with the various regex utilities you'll be using. For more detailed descriptions, please see the appropriate section in the text.

Pattern Class Fields

UNIX_LINES

The UNIX_LINES flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. Use this flag when parsing data that originates on a UNIX machine.

On many flavors of UNIX, the invisible character \n is used to note termination of a line. This is distinct from other operating systems, including flavors of Windows, which may use \r\n, \n,\r, \u2028, or \u0085 for a line terminator.

If you've ever transported a file that originated on a UNIX machine to a Windows platform and opened it, you may have noticed that the lines sometimes don't terminate as you might expect, depending on which editor you use to view the text. This happens because the two systems can use different syntax to denote the end of the line.

The UNIX_LINES flag simply tells the regex engine that it's dealing with UNIX style lines, which affects the matching behavior of the regular expression metacharacters ^ and $.

Note 

Using the UNIX_LINES flag, or the equivalent (?d) regex pattern, doesn't degrade performance. By default, this flag isn't set.

CASE_INSENSITIVE

The CASE_INSENSITIVE field is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It's useful when you need to match U.S. ASCII characters, regardless of case.

Note 

Using this flag, or the equivalent (?i) regular expression, can cause performance to degrade slightly. By default, this flag is not set.

COMMENTS

The COMMENTS field is defined because it's used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that the regex pattern has an embedded comment in it. Specially, it tells the regex engine to ignore any comments in the pattern, starting with the spaces leading up to the # character and everything thereafter, until the end of the line.

Thus, the regex pattern A #matches uppercase US-ASCII char code 65 will use A as the regular expression, but the spaces leading up to the # character and everything thereafter until the end of the line will be ignored.

Note 

Using this flag, or the equivalent (?x) regular expression, doesn't degrade performance.

MULTILINE

The MULTILINE field is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that regex input isn't a single line of code; rather, it contains several lines that have their own termination characters.

This means that the beginning-of-line character, ^, and the end-of-line character, $, will potentially match several lines within the input String.

For example, imagine that your input String is This is sentence.\n So is this. If you use the MULTILINE flag to compile the regular expression pattern:

Pattern p = Pattern.compile("^", Pattern.MULTILINE);

then the beginning of line character, ^, will match before the T in This is a sentence. It will also match just before the S in So is this. Without using the MULTILINE flag, the match will only find the T in This is a sentence.

Note 

Using this flag, or the equivalent (?m) regular expression, may degrade performance.

DOTALL

The DOTALL flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method.

The DOTALL flag tells the regex engine to allow the metacharacter period (.) to match any character, including a line termination character. What does this mean?

Imagine that your candidate String were Test\n. If your corresponding regex pattern were the period (.), then you would normally have four matches: one for the T, another for the e, another for s, and the fourth for t. This is because the regex metacharacter period (.) will normally match any character, except line termination characters.

Enabling the DOTALL flag

Pattern p = Pattern.compile(".", Pattern.DOTALL);

would have generated five matches. Your pattern would have matched the T, e, s, and t characters. In addition, it would have matched the \n character at the end of the line.

Note 

Using this flag, or the equivalent (?s) regular expression, doesn't degrade performance.

UNICODE_CASE

The UNICODE_CASE flag in conjunction with the CASE_INSENSITIVE flag generates case-insensitive matches for international character sets.

Note 

Using this flag, or the equivalent (?u) regular expression, can degrade performance.

CANON_EQ

As you know, characters are actually stored as numbers. For example, in the U.S. ASCII character set, the character A is represented by the number 65. Depending on the character set that you're using, the same character can be represented by different numeric combinations. For example, à can be represented by both +00E0 and U+0061U+0300. A CANON_EQ match would match either representation.

Note 

Using this flag may degrade performance.


Team LiB
Previous Section Next Section