Team LiB
Previous Section Next Section

Appendix A: Regular Expression Reference

This appendix provides a comprehensive quick reference for your day-to-day Java regular expression needs. The material in this appendix is presented as pure regex patterns, not the Java String-delimited counterparts. For example, when referring to a digit, \d, not \\d, is used here.

Table A-1: Common Characters

Regex

Description

Notes

q

The character q

Could be any character, not including special regex characters or punctuation.

\\

The backslash (/) character

\ delimits special regex characters, of which the backslash is a member.

\t

The tab character

 

\n

The newline or linefeed character

 

\r

The carriage-return character

 

\f

The form-feed character

 
Table A-2: Predefined Character Classes

Regex

Description

Notes

.

Any single character

Matches any single character, including spaces, punctuation, letters, or numbers. It may or may not match line-termination characters, depending on the operating system involved, or if the DOTALL flag is active in the regex pattern. Thus, it's probably a good idea to explicitly set, or turn off, DOTALL support in your patterns, in case you need to port your code.

\d

Any single digit from 0 to 9

 

\D

Will match any character except a single digit

By default, this won't match line terminators.

\s

A whitespace character: [ \t\n\x0B\f\r]

This matches tab, space, end-of-line, carriage-return, form-feed, and newline characters.

\S

A non-whitespace character

Matches anything that is not a whitespace character, as described previously. Thus, 7 would be matched, as would punctuation.

\w

A word character: [a-zA-Z_0-9]

Any uppercase or lowercase letter, digit, or the underscore character.

\W

A nonword character; the opposite of \w

Anything that isn't a word character, as described previously. Thus, the minus sign will match, as will a space. It won't match the end-of-line $ or beginning-of-line ^ characters.

Table A-3: Character Classes

Regex

Description

Notes

[abc]

a, b, or c

Strictly speaking, it won't match ab.

[^abc]

Any character except a, b, or c

This will match any character except a, b, or c. It won't match the end-of-line $ or beginning-of-line ^ characters.

[a-zA-Z]

Any uppercase or lowercase letter

When working with numbers, [0–25] doesn't mean 0 to 25. It means 0 to 2, or just 5. If you wanted 0 to 25, you would need to actually write an expression, such as \d|1\d|2[0-5]. Note that [0–9] is exactly equal to \d.

[a-c[x-z]]

a through c, or x through z

For example, [1–3[7–9]] matches 1 through 3, or 7 through 9. No other digit will do.

[a-z&&[a,e]]

a or e

[a-z&&[a,e,i,o,u]] matches all lowercase vowels.

[a-z&&[^bc]]

All lowercase letters except for b and c

For example, all the prime numbers between 1 and 9 would be [1-9&&[^4689]]. That is, 1 through 9, excluding 4, 6, 8, and 9.

[a-d&&[^b-c]]

a through d, but not b through c

[1-9&&[^4-6]] matches 1 through 3, or 7 through 9. Compare this to the union example presented earlier in this table.

Table A-4: POSIX Character Classes

Regex

Description

Notes

\p{Lower}

A lowercase alphabetic character

 

\p{Upper}

An uppercase alphabetic character

 

\p{ASCII}

An ASCII character

 

\p{Alpha}

An alphabetic character

 

\p{Digit}

A decimal digit

 

\p{Alnum}

An alphanumeric character

 

\p{Punct}

Punctuation

This is a good way to deal with punctuation in general, without having to delimit special characters such as periods, parentheses, brackets, and such. It matches !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

\p{Graph}

A visible character

Exactly equal to [\p{Alnum}\p{Punct}].

\p{Print}

A printable character

Exactly equal to [\p{Graph}].

\p{Blank}

A space or a tab

 

\p{Space}

Any whitespace character

It matches [ \t\n\x0B\f\r].

\p{Cntrl}

A control character

 

\p{XDigit}

A hexadecimal digit

 
Table A-5: Boundary Matchers

Regex

Description

Notes

^

Beginning-of-line character

This is an invisible character.

$

End-of-line character

This is an invisible character.

\b

A word boundary

This is the position of a word boundary. Its usage requires some caution, because \b doesn't match characters; it matches a position. Thus, the String anna marrie doesn't match the regex anna\bmarie. However, it does match anna\b\smarrie. That's because there's a word boundary at the position after the last a in anna, and it happens to be the space character, so \s is necessary to match it, and marie must then follow it. Because \b matches a position, it is meaningless to add greedy qualifiers to it. Thus, \b+, \b\b\b\b\b, and \b all match exactly the same thing.

Further complicating the picture is the fact that in a character class, \b means a backspace character. This is syntactically legal (if a little awkward) because the word boundary has no place inside a character class. Thus, [\b] describes a backspace character, because it is surrounded by [ and ].

\B

A non-word boundary

This is the opposite of word boundary, as described previously.

\A

The beginning of the input

\A matches the beginning of the input, but it isn't just a synonym for the ^ pattern. This distinction becomes clear if you use the Pattern.MULTILINE flag when you compile your pattern. \A matches the beginning of the input, which is the very beginning of the file. By contrast, ^ matches the beginning of each line when the Pattern.MULTILINE flag is active.

\Z

The end of the input except for the final $, if any

\Z matches the end of the input, but it isn't just a synonym for the $ character. This distinction becomes clear if you use the Pattern.MULTILINE flag when you compile your pattern. \Z matches the end of the input, which is the very end of the file. By contrast, $ matches the end of each line when the Pattern.MULTILINE flag is active.

\G

The end of the previous match

 

\z

The end of the input

This behaves exactly like the \Z with a capital Z character, except that it also captures the closing $ character.

Table A-6: Greedy Quantifiers

Regex

Description

Notes

X?

X, once or not at all

A? would match A, or the absence of A. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X*

X, zero or more times

This pattern is very much like the ? pattern, except that it matches zero or more occurrences. It doesn't match "any character," as its usage in DOS might indicate. Thus, A* would match A, AA, AAA, or the absence of A. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X+

X, one or more times

This quantifier is very much like the * pattern, except that it looks for the existence of one or more occurrences instead of zero or more occurrences. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n}

X, exactly n times

This quantifier demands the occurrence of the target exactly n times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X(n,}

X, at least n times

This quantifier demands the occurrence of the target at least n times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n,m}

X, at least n but not more than m times

This quantifier demands the occurrence of the target at least n times, but not more than m times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

Table A-7: Reluctant Quantifiers

Regex

Description

Notes

X??

X, once or not at all

This pattern is very much like the ? pattern, except that it prefers to match nothing at all. When it's used with the Matcher.matches() method, ?? functions in exactly the same way as the ? pattern. However, when it's used with Matcher.find(), the behavior is different. For example, the pattern x??, as applied to the String xx, will actually not find x, yet consider that lack of finding a success. That's because we asked it to be reluctant to match, and the most reluctant thing it can do is match zero occurrences of x. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X*?

X, zero or more times

This pattern is very much like the * pattern, except that it prefers to match as little as possible. When it's used with the Matcher.matches() method, *? functions in exactly the same way as the * pattern. However, when it's used with Matcher.find(), the behavior is different. For example, the pattern x*?, as applied to the String xx, will actually not find x, yet consider that a success. That's because we asked it to be reluctant to match, and the most reluctant thing it can do is match zero occurrences of x. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if character the class immediately precedes it).

X+?

X, one or more times

This pattern is very much like the + pattern, except that it prefers to match as little as possible. When it's used with the Matcher.matches() method, +? functions in exactly the same way as the + pattern. However, when it's used with Matcher.find(), the behavior is different. For example, the pattern x+?, as applied to the String xx, will actually find one x, yet consider that a success. That's because we asked it to be reluctant to match, and the most reluctant thing it can do is match one occurrence of x. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n}?

X, exactly n times

This pattern is exactly like the X{n} pattern. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X(n,}?

X, at least n times

This pattern is very much like the X{n,} pattern, except that it prefers to match as little as possible. When it's used with the Matcher.matches() method, X{n,}? functions in exactly the same way as the X{n,} pattern. However, when it's used with Matcher.find(), the behavior is different. For example, the pattern X{3,}?, as applied to the String xxxxx, will actually only find xxx, yet consider that a success. Compare this with just X{3,5}, which would have found xxxxx. That's because we asked it to be reluctant to match, and the most reluctant thing it can do is match three occurrences of x. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n,m}?

X, at least n but not more than m times

This pattern is very much like the X{n,m} pattern, except that it prefers to match as little as possible. When it's used with the Matcher.matches() method, X{n,m}? functions in exactly the same way as the X{n,m} pattern. However, when it's used with Matcher.find(), the behavior is distinct. For example, the pattern X{3,5}?, as applied to the String xxxxx, will actually find xxx, yet consider that a success. Compare this with just X{3,5}, which would have found xxxxx. This happens because we asked it to be reluctant to match, and the most reluctant thing it can do is match three occurrences of x. Notice that if there were six x characters, such as xxxxxx, the pattern would have matched twice: once for the first three xxx characters and then again for the other three. Again, this is because three is the minimum requirement. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

Table A-8: Possessive Quantifiers

Regex

Description

Notes

X?+

X, once or not at all

Very much like the ? pattern, this pattern prefers to match as much as possible. However, this pattern won't release matches to help the entire expression match as a whole. For example, the pattern \w?+\d, as applied to the String A2, will actually not match, because the first \w?+ consumes the A and the 2, and it won't release them for the greater good of allowing the entire expression to match. Thus, \d is unable to match. This is because we asked \w?+ to be possessive, and the most possessive thing it can do is match the occurrence of A and 2, and not release them. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X*+

X, zero or more times

Very much like the * pattern, this pattern prefers to match as much as possible. However, this pattern won't release matching to help the entire expression as a whole match. For example, the pattern \w*+\d, as applied to the String Java2, will actually not match, because the first \w*+ consumes the String Java2 and won't release it for the greater good of allowing the entire expression to match. Thus, \d is unable to match. This is because the pattern \w*+ is possessive, and the most possessive thing it can do is match the entire Java2 and not release anything. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X++

X, one or more times

Very much like the + pattern, this pattern prefers to match as much as possible. However, this pattern won't release matching to help the entire expression as a whole match. For example, the pattern \w++\d, as applied to the String Java2, will actually not match, because the first \w++ consumes the String Java2 and won't release it for the greater good of allowing the entire expression to match. Thus, \d is unable to match. This is because the pattern \w++ is possessive, and the most possessive thing it can do is match the entire Java2 and not release anything. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n}+

X, exactly n times

This pattern is exactly like the X{n} pattern. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X(n,}+

X, at least n times

Very much like the X{n,} pattern, this pattern prefers to match as much as possible. However, this pattern won't release matching to help the entire expression as a whole match. For example, the pattern \w{4,}+\d, as applied to the String Java2, will actually not match, because the \w{4,}+ consumes the String Java2 and won't release it for the greater good of allowing the entire expression to match. Thus, \d is unable to match 4. This is because the pattern \w{4,}+ is possessive, and the most possessive thing it can do is match the entire Java2 and not release anything. This pattern applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

X{n,m}+

X, at least n but not more than m times

Very much like the X{n,m} pattern, this pattern prefers to match as much as possible. However, this pattern won't release matching to help the entire expression as a whole match. For example, the pattern \x{2,5}\d, as applied to the String Java2, will actually not match, because the first \w++ consumes the String Java2, and won't release it for the greater good of allowing the entire expression to match. Thus, \d is unable to match. This is because the pattern \x{2,5} is possessive, and the most possessive thing it can do is match the entire Java2 and not release anything. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it).

Table A-9: Logical Operators

Regex

Description

Notes

XY

X followed by Y

This is the default relationship assumption between characters. Note that spaces are a valid part of this syntax. Thus, A B means the character A, followed by a space, followed by the character B.

X|Y

Either X or Y

AB|CD will match either AB or CD. Similarly, the pattern hello sir|madam will match hello sir or it will match madam. Specifically, it won't match hello madam. This is because of the nature of the And pattern discussed previously. When the regex engine sees hello sir, it assumes you mean that hello, followed by a space, followed by sir should be treated as a single logical unit. These are all Anded together. Then the engine sees the Or pattern, so it assumes that the logical alternative is madam.

If you actually want to accept hello sir or hello madam, you'll have to use groups—thus, the pattern hello (sir|madam). Or better yet, you can use the noncapturing group hello (?:sir|madam).

(X)

X, the capturing group

A capturing group is a logical unit that is conceptually similar to the logical units you're familiar with from algebra. Thus, (\d\d\d\d) is a capturing group that defines four digits. Capturing groups can be referred to later in your expression by using a back reference, as explained later. They're counted left to right and can be nested. Thus, h(ello (world)) has three capturing groups. Capturing group 0 is the entire expression, which matches the String hello world. Capturing group 1 is ello world, because you count from left to right, and the first group starts with the (right before the e in hello. Group 2 is world, because the second group starts right before the w in world.

\n

The nth capturing group matched

In this context, I'm not referring to newline, even though \n looks like the newline symbol. The n in this case refers to a number. The regex engine allows you to access the information captured by a previous part of the group, even as the search is executing. For example, if you want to find repeated words, all you need is the pattern (\w+)\W\1, which says, "Look for a group of word characters, followed by a nonword character, followed by that exact same word character captured in group 1." If you attempt to refer to a group that doesn't exist, a PatternSyntaxException will be thrown.

If you happen to have, say, 13 captured groups, then \13 will mean that you want the thirteenth capturing group. If you don't have 13 groups, then the same expression \13 will mean the first capturing group, followed by the digit 3.

Table A-10: Quotation

Regex

Description

Notes

\

Quotes the following character

This quotes the metacharacter that follows, so it will actually be treated as a character. Thus, if you were looking for a dollar sign, you would use \$. as the pattern. By contrast, $ would have matched the end-of-line character. Remember that for regex expressions used directly as Strings, you need to double the number of \ characters you see. Thus, in a Java String, \s becomes \\s.

\Q

Quotes all characters until \E

This works in conjunction with \E to quote a sequence of characters. If you need to quote a lot of characters in sequence, then use \Q to open your quote and \E to close it. For example, if you want the characters \([?*, the expression \Q\(?*\E will do the job.

\E

Ends quote started by \Q

 
Table A-11: Noncapturing Group Constructs

Regex

Description

Notes

(?:X)

Defines a subpattern as a logical unit

Noncapturing groups don't store the information that actually matches the pattern for later access. These are much more efficient than capturing groups if you're only using grouping for logical purposes. This pattern is noncapturing.

(?idmsux-idmsux)

i for CASE_INSENSITIVE

x for COMMENTS

s for DOTALL

u for UNICODE_CASE

m for MULTILINE

d for UNIX_LINES

The pattern (?i)hel(?-i)LO will match the String HELLO, because (?i) indicates a case-insensitive match starting from h, and (?-i) signals an end to that case insensitivity after the first l. This pattern is noncapturing.

(?idmsux-idmsux:X)

X, with the given flags on or off

The pattern (?i:hel)LO will match the String HELLO, because (?i: indicates a case-insensitive match starting from h and ending with the first l. This pattern is noncapturing.

Table A-12: Lookarounds

Regex

Description

Notes

(?=X)

X, using zero-width positive lookahead

This pattern glances to the right of whatever remains to be parsed from the candidate String to find the first position at which the expression X exists. For example, if you want to extract all of the inline comments from a text file, you might try the pattern (?=//).*$ and extract group 0. This pattern is noncapturing.

(?!X)

X, using zero-width negative lookahead

This pattern glances to the right, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X doesn't exist. For example, if you want to skip leading spaces leading up to some content, you could use (?!\s).*. This pattern is noncapturing.

(?<=X)

X, using zero-width positive lookbehind

This pattern glances to the left, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X exists. For example, if you want to extract all of the inline comments from a text file, you might try the pattern (?=//).*$. This pattern is noncapturing.

(?<!X)

X, using zero-width negative lookbehind

This pattern glances to the left, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X doesn't occur. For example, if you want to extract all of the text before inline Java comments from a text file, you might try the pattern .*(?<=//). This pattern is noncapturing.

(?>X)

X, as an independent, noncapturing group

This pattern refuses to release the contents of the match, regardless of the consequences on the rest of the pattern's ability to match. Thus, whereas the pattern \w+\d matches the String java2, the pattern (?>\w+)\d does not, because the (?>\w+) consumes java and 2, and refuses to release the 2 so that \d can match.

Table A-13: Less Common Characters

Regex

Description

Notes

\0n

The character with octal value on

0 <= n <= 7

\0nn

The character with octal value onn

0 <= n <= 7

\0mnn

The character with octal value 0mnn

0 <= m <= 3, 0 <= n <= 7 (This can't exceed 377.)

\xhh

The character with hexadecimal value 0xhh

0 <= h <= 9 or A<=h <=F

\uhhhh

The character with hexadecimal value 0xhhhh

0 <= h <= 9 or A<=h <=F

\a

The alert (bell) character ('\u0007')

 

\e

The escape character ('\u001B')

 

\cx

The control character corresponding to x

 
Table A-14: Unicode Blocks and Categories

Regex

Description

Notes

\p{InGreek}

A character in the Greek block

 

\p{Lu}

An uppercase letter

\p{Lu} matches any uppercase character.

\p{Sc}

A currency symbol

If you need to find or swap out, say, a dollar sign, this is a good way to do so without having to deal with the various delimiting complexities of not matching the end-of-line character $.

\P{InGreek}

Any character except one in the Greek block (negation)

Notice the use of the capital P here. In general, uppercase \P is the opposite of lowercase \p. Thus, \P{Lower} matches all uppercase characters.

[\p{L}&&[^\p{Lu}]]

Any non-uppercase letter

This is exactly equal to \p{Upper}.


Team LiB
Previous Section Next Section