This appendix provides a comprehensive quick reference for your day-to-day Java regular expression needs. The material in this appendix is presented as pure regex patterns, not the Java String-delimited counterparts. For example, when referring to a digit, \d, not \\d, is used here.
Regex |
Description |
Notes |
---|---|---|
q |
The character q |
Could be any character, not including special regex characters or punctuation. |
\\ |
The backslash (/) character |
\ delimits special regex characters, of which the backslash is a member. |
\t |
The tab character | |
\n |
The newline or linefeed character | |
\r |
The carriage-return character | |
\f |
The form-feed character |
Regex |
Description |
Notes |
---|---|---|
. |
Any single character |
Matches any single character, including spaces, punctuation, letters, or numbers. It may or may not match line-termination characters, depending on the operating system involved, or if the DOTALL flag is active in the regex pattern. Thus, it's probably a good idea to explicitly set, or turn off, DOTALL support in your patterns, in case you need to port your code. |
\d |
Any single digit from 0 to 9 | |
\D |
Will match any character except a single digit |
By default, this won't match line terminators. |
\s |
A whitespace character: [ \t\n\x0B\f\r] |
This matches tab, space, end-of-line, carriage-return, form-feed, and newline characters. |
\S |
A non-whitespace character |
Matches anything that is not a whitespace character, as described previously. Thus, 7 would be matched, as would punctuation. |
\w |
A word character: [a-zA-Z_0-9] |
Any uppercase or lowercase letter, digit, or the underscore character. |
\W |
A nonword character; the opposite of \w |
Anything that isn't a word character, as described previously. Thus, the minus sign will match, as will a space. It won't match the end-of-line $ or beginning-of-line ^ characters. |
Regex |
Description |
Notes |
---|---|---|
[abc] |
a, b, or c |
Strictly speaking, it won't match ab. |
[^abc] |
Any character except a, b, or c |
This will match any character except a, b, or c. It won't match the end-of-line $ or beginning-of-line ^ characters. |
[a-zA-Z] |
Any uppercase or lowercase letter |
When working with numbers, [0–25] doesn't mean 0 to 25. It means 0 to 2, or just 5. If you wanted 0 to 25, you would need to actually write an expression, such as \d|1\d|2[0-5]. Note that [0–9] is exactly equal to \d. |
[a-c[x-z]] |
a through c, or x through z |
For example, [1–3[7–9]] matches 1 through 3, or 7 through 9. No other digit will do. |
[a-z&&[a,e]] |
a or e |
[a-z&&[a,e,i,o,u]] matches all lowercase vowels. |
[a-z&&[^bc]] |
All lowercase letters except for b and c |
For example, all the prime numbers between 1 and 9 would be [1-9&&[^4689]]. That is, 1 through 9, excluding 4, 6, 8, and 9. |
[a-d&&[^b-c]] |
a through d, but not b through c |
[1-9&&[^4-6]] matches 1 through 3, or 7 through 9. Compare this to the union example presented earlier in this table. |
Regex |
Description |
Notes |
---|---|---|
\p{Lower} |
A lowercase alphabetic character | |
\p{Upper} |
An uppercase alphabetic character | |
\p{ASCII} |
An ASCII character | |
\p{Alpha} |
An alphabetic character | |
\p{Digit} |
A decimal digit | |
\p{Alnum} |
An alphanumeric character | |
\p{Punct} |
Punctuation |
This is a good way to deal with punctuation in general, without having to delimit special characters such as periods, parentheses, brackets, and such. It matches !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~. |
\p{Graph} |
A visible character |
Exactly equal to [\p{Alnum}\p{Punct}]. |
\p{Print} |
A printable character |
Exactly equal to [\p{Graph}]. |
\p{Blank} |
A space or a tab | |
\p{Space} |
Any whitespace character |
It matches [ \t\n\x0B\f\r]. |
\p{Cntrl} |
A control character | |
\p{XDigit} |
A hexadecimal digit |
Regex |
Description |
Notes |
---|---|---|
^ |
Beginning-of-line character |
This is an invisible character. |
$ |
End-of-line character |
This is an invisible character. |
\b |
A word boundary |
This is the position of a word boundary. Its usage requires some caution, because \b doesn't match characters; it matches a position. Thus, the String anna marrie doesn't match the regex anna\bmarie. However, it does match anna\b\smarrie. That's because there's a word boundary at the position after the last a in anna, and it happens to be the space character, so \s is necessary to match it, and marie must then follow it. Because \b matches a position, it is meaningless to add greedy qualifiers to it. Thus, \b+, \b\b\b\b\b, and \b all match exactly the same thing. Further complicating the picture is the fact that in a character class, \b means a backspace character. This is syntactically legal (if a little awkward) because the word boundary has no place inside a character class. Thus, [\b] describes a backspace character, because it is surrounded by [ and ]. |
\B |
A non-word boundary |
This is the opposite of word boundary, as described previously. |
\A |
The beginning of the input |
\A matches the beginning of the input, but it isn't just a synonym for the ^ pattern. This distinction becomes clear if you use the Pattern.MULTILINE flag when you compile your pattern. \A matches the beginning of the input, which is the very beginning of the file. By contrast, ^ matches the beginning of each line when the Pattern.MULTILINE flag is active. |
\Z |
The end of the input except for the final $, if any |
\Z matches the end of the input, but it isn't just a synonym for the $ character. This distinction becomes clear if you use the Pattern.MULTILINE flag when you compile your pattern. \Z matches the end of the input, which is the very end of the file. By contrast, $ matches the end of each line when the Pattern.MULTILINE flag is active. |
\G |
The end of the previous match | |
\z |
The end of the input |
This behaves exactly like the \Z with a capital Z character, except that it also captures the closing $ character. |
Regex |
Description |
Notes |
---|---|---|
X? |
X, once or not at all |
A? would match A, or the absence of A. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
X* |
X, zero or more times |
This pattern is very much like the ? pattern, except that it matches zero or more occurrences. It doesn't match "any character," as its usage in DOS might indicate. Thus, A* would match A, AA, AAA, or the absence of A. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
X+ |
X, one or more times |
This quantifier is very much like the * pattern, except that it looks for the existence of one or more occurrences instead of zero or more occurrences. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
X{n} |
X, exactly n times |
This quantifier demands the occurrence of the target exactly n times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
X(n,} |
X, at least n times |
This quantifier demands the occurrence of the target at least n times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
X{n,m} |
X, at least n but not more than m times |
This quantifier demands the occurrence of the target at least n times, but not more than m times. This applies to either the character that immediately precedes it, or a group (if the group immediately precedes it), or a character class (if the character class immediately precedes it). |
Regex |
Description |
Notes |
---|---|---|
\ |
Quotes the following character |
This quotes the metacharacter that follows, so it will actually be treated as a character. Thus, if you were looking for a dollar sign, you would use \$. as the pattern. By contrast, $ would have matched the end-of-line character. Remember that for regex expressions used directly as Strings, you need to double the number of \ characters you see. Thus, in a Java String, \s becomes \\s. |
\Q |
Quotes all characters until \E |
This works in conjunction with \E to quote a sequence of characters. If you need to quote a lot of characters in sequence, then use \Q to open your quote and \E to close it. For example, if you want the characters \([?*, the expression \Q\(?*\E will do the job. |
\E |
Ends quote started by \Q |
Regex |
Description |
Notes |
---|---|---|
(?:X) |
Defines a subpattern as a logical unit |
Noncapturing groups don't store the information that actually matches the pattern for later access. These are much more efficient than capturing groups if you're only using grouping for logical purposes. This pattern is noncapturing. |
(?idmsux-idmsux) |
i for CASE_INSENSITIVE x for COMMENTS s for DOTALL u for UNICODE_CASE m for MULTILINE d for UNIX_LINES |
The pattern (?i)hel(?-i)LO will match the String HELLO, because (?i) indicates a case-insensitive match starting from h, and (?-i) signals an end to that case insensitivity after the first l. This pattern is noncapturing. |
(?idmsux-idmsux:X) |
X, with the given flags on or off |
The pattern (?i:hel)LO will match the String HELLO, because (?i: indicates a case-insensitive match starting from h and ending with the first l. This pattern is noncapturing. |
Regex |
Description |
Notes |
---|---|---|
(?=X) |
X, using zero-width positive lookahead |
This pattern glances to the right of whatever remains to be parsed from the candidate String to find the first position at which the expression X exists. For example, if you want to extract all of the inline comments from a text file, you might try the pattern (?=//).*$ and extract group 0. This pattern is noncapturing. |
(?!X) |
X, using zero-width negative lookahead |
This pattern glances to the right, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X doesn't exist. For example, if you want to skip leading spaces leading up to some content, you could use (?!\s).*. This pattern is noncapturing. |
(?<=X) |
X, using zero-width positive lookbehind |
This pattern glances to the left, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X exists. For example, if you want to extract all of the inline comments from a text file, you might try the pattern (?=//).*$. This pattern is noncapturing. |
(?<!X) |
X, using zero-width negative lookbehind |
This pattern glances to the left, to whatever remains to be parsed from the candidate String, to find the first position at which the expression X doesn't occur. For example, if you want to extract all of the text before inline Java comments from a text file, you might try the pattern .*(?<=//). This pattern is noncapturing. |
(?>X) |
X, as an independent, noncapturing group |
This pattern refuses to release the contents of the match, regardless of the consequences on the rest of the pattern's ability to match. Thus, whereas the pattern \w+\d matches the String java2, the pattern (?>\w+)\d does not, because the (?>\w+) consumes java and 2, and refuses to release the 2 so that \d can match. |
Regex |
Description |
Notes |
---|---|---|
\0n |
The character with octal value on |
0 <= n <= 7 |
\0nn |
The character with octal value onn |
0 <= n <= 7 |
\0mnn |
The character with octal value 0mnn |
0 <= m <= 3, 0 <= n <= 7 (This can't exceed 377.) |
\xhh |
The character with hexadecimal value 0xhh |
0 <= h <= 9 or A<=h <=F |
\uhhhh |
The character with hexadecimal value 0xhhhh |
0 <= h <= 9 or A<=h <=F |
\a |
The alert (bell) character ('\u0007') | |
\e |
The escape character ('\u001B') | |
\cx |
The control character corresponding to x |
Regex |
Description |
Notes |
---|---|---|
\p{InGreek} |
A character in the Greek block | |
\p{Lu} |
An uppercase letter |
\p{Lu} matches any uppercase character. |
\p{Sc} |
A currency symbol |
If you need to find or swap out, say, a dollar sign, this is a good way to do so without having to deal with the various delimiting complexities of not matching the end-of-line character $. |
\P{InGreek} |
Any character except one in the Greek block (negation) |
Notice the use of the capital P here. In general, uppercase \P is the opposite of lowercase \p. Thus, \P{Lower} matches all uppercase characters. |
[\p{L}&&[^\p{Lu}]] |
Any non-uppercase letter |
This is exactly equal to \p{Upper}. |