The following sections introduce Java's regular expression syntax. For the sake of clarity, the material is grouped into small, logical units, followed by a brief example that demonstrates usage. The examples progress from those that emphasize the role of the Pattern to those that start to rely on the Matcher more.
Note |
Please keep in mind that these are working examples only. We're not ready to bulletproof our code yet. |
The regex language contains metacharacters designed to help you describe search criteria. Because reading a pattern without being aware of these characters can be a bewildering experience, I've listed the most popular metacharacters are in Table 1-1.
Pattern |
Name |
Description |
---|---|---|
. |
Period |
Matches any character. |
$ |
Dollar sign |
Matches the end of a line. |
^ |
Carat |
Matches the beginning of a line. |
{ |
Opening curly bracket |
Defines a range opening. |
[ |
Opening bracket |
Defines a character class opening. |
( |
Opening parenthesis |
Defines the beginning of a group. |
| |
Pipe symbol |
A symbol meaning OR |
} |
Closing curly bracket |
Defines a range closing. |
] |
Closing bracket |
Defines a character class closing. |
) |
Closing parenthesis |
Defines the closing of a group. |
* |
Asterisk |
The preceding is repeated zero or more times. |
+ |
Plus sign |
The preceding is repeated one or more times. |
? |
Question mark |
The preceding is repeated zero or one time. |
\ |
Backward slash |
The following is not to be treated as a metacharacter. |
These characters are effectively reserved words, just as new is a reserved word in Java. They serve as building blocks for more complex search criteria. I discuss this in more detail soon.
If you're reading a character in a regex pattern and it isn't one of characters listed in Table 1-1, then the character you're reading probably stands for the character it represents. For example, Table 1-2 shows how the pattern hello* should be read.
Letter |
Description |
---|---|
h |
The character h |
e |
Followed by the character e |
l |
Followed by the character l |
l |
Followed by the character l |
o |
Followed by the character o |
* |
Followed by a metacharacter that, in this case, means o should be repeated zero or more times |
[*]In English: Look for the word hell, followed by any number of trailing o characters. |
If you actually need to find one of these characters, such as the * character, simply append the character you're searching for to a \ character. For example, to find the * character, use \*.
Regular expressions also contain characters that take on special meaning when they're delimited by the \ character. These facilitate finding common tokens, such as word boundaries, empty spaces, tabs, alphanumeric characters, and so on. For example, \n and \t are special characters that represent a newline and a tab, respectively.
In this section, I cover these common boundary characters and provide examples of their use.
Certain types of characters occur often enough that regular expression languages have developed a shorthand for referring to them. For example, a digit is designated by the \d expression. Without the \ character delimiting the d, the expression would simply refer to the fourth letter of the English alphabet, in lowercase. Table 1-3 lists some of these common characters.
Character |
Description |
---|---|
. |
Matches any character; may also match line terminators. |
\d |
A digit [0-9]. This will match any single digit from 0 to 9. Notice that an input of 19 will need to match twice: Once for the 1 and once again for the 9. |
\D |
A nondigit [^0-9]. This will match any character that isn't a digit, including a whitespace character. |
\w |
A word character [a-zA-Z_0-9]. This will match any character from a to z or A to Z, an underscore, or any single digit from 0 to 9. |
\W |
A nonword character [^\w]. This will match any character that isn't a word character, such as a number, including whitespace characters. |
\t |
The tab character. |
\n |
The newline (linefeed) character. |
\r |
The carriage-return character. |
\f |
The form-feed character. |
\s |
A whitespace character. This includes the newline, carriage-return, tab, form-feed, and end-of-line characters. |
\S |
A non-whitespace character, also known as [^\s]. This will match any character that isn't a whitespace character, as described previously. |
^ |
The beginning of a line. |
$ |
The end of a line. |
\b |
A word boundary. A word boundary is the character immediately preceding what we think of as "words" in English vernacular, corresponding to \w previously. It will also match the character immediately following a word. Most often, this character matches a space, a tab, an end of a line, or a beginning of a line. |
\B |
A non-word boundary. |
Imagine that you need to verify that a given String consists of any alphanumeric character, including underscores, followed by a digit. Thus, you would accept A1, but not !1, because the ! symbol isn't an alphanumeric character or an underscore. The pattern you want in this case consists of an alphanumeric character (or underscore) followed by a digit; thus, \w\d, per Table 1-1.
The pattern \w\d will match h1, k9, A1, or 11, because each consists of an alphanumeric character followed by a digit. It won't match AA, 9A, or *5, because these don't consist of an alphanumeric character followed by a digit. Table 1-4 dissects the pattern.
Regex |
Description |
---|---|
\w |
Any character ranging from a to z, A to Z, 0 to 9, or an underscore |
\d |
Followed by a single digit ranging from 0 to 9 |
* In English: Look for any alphanumeric character, or the underscore character, followed by a single digit. |
Regular expressions also provide a mechanism for finding common character boundaries. These include newlines, end-of-line characters, end-of-file characters, tabs, and so on. These are listed in the latter part of Table 1-3.
Say you want to match the word anna from an input string, but only if it's at the beginning of a word. Thus, Hanna wouldn't fit your criteria. The pattern you want in this case consists of a word boundary, \b, followed by the characters a, n, n, and a, thus the regex \banna.
The pattern \banna will match anna but not Hanna, because anna is a cluster of characters preceded by a space character. A space character meets the criterion of being a word boundary. This isn't true of Hanna, because the character immediately preceding the a character in Hanna is an H, and H isn't a word boundary. Table 1-5 dissects the pattern.
Regex |
Description |
---|---|
\b |
A word boundary |
a |
Followed by the character a |
n |
Followed by the character n |
n |
Followed by the character n |
a |
Followed by the character a |
* In English: Look for anna if it is the beginning of a word. |
Quantifiers and alternates allow you to specify the number of tokens you need to find or alternative tokens you're willing to accept. Table 1-6 lists some of the quantifiers and alternates in regex.
Regex |
Description |
---|---|
? |
The preceding is repeated once or not at all. |
* |
The preceding is repeated zero or more times. |
+ |
The preceding is repeated one or more times. |
{n} |
The preceding is repeated exactly n times. |
{n,} |
The preceding is repeated at least n times. |
{n,m} |
The preceding is repeated at least n times, but no more than m times. This includes m repetitions. |
| |
The element preceding the | or the element following it. |
The following sections offer some examples that demonstrate working with quantifiers.
The pattern An+a will match Ana, Anna, or Annnna because each contains at least one A character immediately followed by one or more n characters followed by an a character. It won't match Aa or ANna because these don't consist of a single A character immediately followed by at least one n character followed by an a character. Notice that a capital N and a lowercase n aren't considered matches. Table 1-7 dissects the pattern.
Regex |
Description |
---|---|
A |
The character A |
n+ |
Followed by one or more n characters |
a |
Followed by the character a |
* In English: Look for a capital A, followed by one or more n characters, followed by an a character. |
There is some interesting behavior that can be elicited here. If this match had been performed using the String.matches method, the pattern would not have matched AnnaMarie, because the String.matches method requires an exact match, and the Marie part of AnnaMarie would have ruined that exactness. However, the Matcher.find method would have matched AnnaMarie because it's more permissive. Stay tuned—more details coming soon.
The pattern A{2,7} will match AA,AAAA, or AAAAAAA because each of these contains at least at least two A characters and no more than seven A characters. The pattern won't match A because it contains less than two A characters, and the pattern won't match AAAAAAA because it contains more than seven A characters. Table 1-8 dissects the pattern.
Regex |
Description |
---|---|
A |
The character A |
{ |
Open repeating group |
2 |
Repeated at least two times |
, |
But not more than |
7 |
Seven times |
} |
Close repeated group |
* In English: Look for a sequence of the character A repeated two, three, four, five, six, or seven times. |
Note |
In the example at the beginning of this chapter, you needed a pattern to match four consecutive digits and derived \d\d\d\d. As noted, this isn't the most elegant pattern possible. An alternative, yet equivalent, way of expressing the same pattern is \d{4}, per Table 1-6—that is, a sequence of exactly four digits. |
The pattern A|B will match A or B, because each consists of either an A character or a B character. It won't match P, Q, or jelly because these don't consist strictly of either an A or a B character. Table 1-9 dissects this pattern.
Regex |
Description |
---|---|
A |
The character A |
| |
Or |
B |
The character B |
* In English: Look for either a capital A or a capital B. |
The pattern anna|marie will match anna or marie, because anna matches the first alternative and marie matches the second. It won't match Josie, Ralph, or Doctor. Table 1-10 dissects the pattern.
Regex |
Description |
---|---|
anna |
The characters a, n, n, and a, in order |
| |
Or |
marie |
The characters m, a, r, i, and e, in order |
* In English: Look for either the word anna or the word marie. |
So would the pattern match annamarie as a single word? In a word, maybe. I provide detailed information about this topic in later chapters, but here's the nickel tour. Java 2 Enterprise Edition's (J2EE's) regex allows you to specify whether you need an exact or partial match. Thus, annamarie would match the pattern anna|marie twice for a partial match, and not at all for an exact match. Without going into too much detail, String.matches only provides for exact matches, whereas the Matcher class can provide more lenient matches using the find method.
What about the pattern Miss anna|marie? Will it match Miss marie and Miss anna, or just one of them? Or will it match neither? A strict match will match Miss anna but reject Miss marie. The alternative | will read Miss anna as a single option and the pattern marie as another. Because the pattern maria isn't equal to the candidate Miss maria, the search will reject Miss maria.
There are times when you need to describe your search criteria as a class—that is, as a group that shares potentially complex commonalities that you need to be able to describe and for which there are no predefined classes. Fortunately, regex provides a mechanism for doing so through character classes, as shown in Table 1-11.
Pattern |
Description |
---|---|
[abc] |
a, b, or c. (Of course, any character could be used, not just a, b, or c.) |
[^abc] Any |
character except a, b, or c. |
[a-zA-Z] |
a through z or A through Z. |
[a-d[m-p]] |
a through d, or m through p: [a-dm-p]. |
[a-z&&[def]] |
Whatever exists in both sets, namely d, e, or f. |
[a-z&&[^bc]] |
a through z, except for b and c: [ad-z]. |
[a-z&&[^m-p]] |
a through z, and not m through p: [a-lq-z]. |
There are also some predefined Portable Operating System Interface for UNIX (POSIX) character classes. These are American Standard Code for Information Interchange (ASCII) classes that experience has shown to be particularly useful. Thus, they're already in place, and you can simply refer to them for use. Table 1-12 contains the POSIX character classes.
Pattern |
Description |
---|---|
\p{Lower} |
A lowercase letter: [a-z] |
\p{Upper} |
An uppercase letter: [A-Z] |
\p{ASCII} |
All ASCII characters: [\x00-\x7F] |
\p{Alpha} |
An upper- or lowercase letter: [\p{Lower}\p{Upper}] |
\p{Digit} |
A digit: [0-9] |
\p{Alnum} |
A number or a letter: [\p{Alpha}\p{Digit}] |
\p{Punct} |
Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
\p{Graph} |
Any visible character: [\p{Alnum}\p{Punct}] |
\p{Print} |
A printable character: [\p{Graph}] |
\p{Blank} |
A tab or space |
\p{Cntrl} |
A control character: [\x00-\x1F\x7F] |
\p{XDigit} |
A hexadecimal digit: [0-9a-fA-F] |
\p{Space} |
A whitespace character: [ \t\n\x0B\f\r] |
Let's step through some simple examples. The pattern [0-5] will match any part of the input that contains a digit between 0 and 5. Thus, it will match on 0, 1, 2, 3, 4, or 5. It won't match 8, 6, or any nondigit characters. Table 1-13 dissects the pattern.
Regex |
Description |
---|---|
[ |
A class consisting of |
0 |
The digit 0 |
- |
Ranging through |
5 |
The digit 5 |
] |
Close class |
* In English: Look for any digit ranging from 0 to 5, including 0 and 5. |
The pattern [^A] will match any character except the character A. This includes other characters, spaces, tabs, punctuation, and so on. It's important to notice that the ^ delimiter only has a not meaning when inside a class bracket—that is, inside the [ and ] brackets. Outside those brackets, it stands for the beginning of the line character. I cover this topic in more detail later. Table 1-14 dissects the pattern.
Regex |
Description |
---|---|
[ |
A class consisting of |
^ |
Any character except |
A |
The character A |
] |
Close class |
* In English: Look for any character except the capital letter A |
Groups are simply logical divisions of the text. When you describe a group in regex, you're providing a mechanism for the JVM to treat characters that fall into that group in a specific way.
Back references allow the regex pattern to refer to a group, even as it's in the middle of an operation. A pattern can refer to the last group it found, or the one before that, or even one further down the execution chain.
In the sections that follow, I cover the topics of groups and back references in more detail and present an example for each.
A group is a submatch. If you're familiar with SQL, it might be helpful to think of groups as the SQL equivalent of a subquery. Groups allow you to define parts of your pattern as logical subunits of the whole and then refer to the results of those subunits. Their syntax follows in Table 1-15.
Regex |
Description |
---|---|
( |
A group consisting of |
… |
Any regex pattern |
) |
Close group |
As with most things, an example can be more illuminating than a description. Consider the pattern (\w+)_(\w+)@(\w+)\.org to match e-mail patterns. Table 1-16 dissects this pattern.
Regex |
Description |
---|---|
( |
A group consisting of |
\w |
An alphanumeric or underscore character |
+ |
Repeated one or more times |
) |
Close group |
_ |
Followed by an underscore character |
( |
A group consisting of |
\w |
One alphanumeric or underscore character |
+ |
Followed by one or more alphanumeric characters |
) |
Close group |
@ |
Followed by an at character |
( |
A group consisting of |
\w |
One alphanumeric or underscore character |
+ |
Followed by one or more alphanumeric or underscore characters |
) |
Close group |
\. |
Followed by the period character |
o |
Followed by the character o |
r |
Followed by the character r |
g |
Followed by the character g |
* In English: Look for a group of alphanumeric characters, followed by _, followed by a group of alphanumeric characters, followed by @, followed by a group of alphanumeric characters, followed by .org. |
Back references are one of the most powerful features offered by regular expressions. Unfortunately, programmers often skip over them because they're not explained well in the regular expression literature. That's a mistake I hope to rectify here.
Back references allow a pattern to refer back to parts of itself. They always refer back to groups that were enclosed by the "(" and the ")" characters. Table 1-17 presents the syntax for back references.
Regex |
Description |
---|---|
\1 |
The first group in the pattern |
\2 |
The second group in the pattern |
\n |
The nth group in the pattern |
Note |
There are some idiosyncratic behaviors associated with how back references work in Java, which I explain later in this chapter and in Chapter 3. For right now, you have enough information on back references to get started. |
Say you need to find matches in which a word is duplicated. That is, you don't know what the word you're looking for is, but you want to be alerted when the same word is repeated twice in a row. If you've used a word processor such as Microsoft Word, you'll notice that the application does this automatically. Let's explore how you might do this in Java.
You'll use the pattern \b(\w+) \1\b, which is dissected in Table 1-18. This pattern matches pizza pizza, Faster pussycat kill kill, or Never Never Never Never Never because each contains a word that's immediately repeated. It won't match 222 2222, sara sarah, or Faster pussycat kill, kill because these don't contain a word that's immediately repeated. The latter group won't match because 222 2222 has a lingering 2 in the second set, sara sarah has a lingering h in the second word, and in Faster pussycat kill, kill the second kill is separated from the first by a comma.
Regex |
Description |
---|---|
\b |
A word boundary |
( |
Followed by a group consisting of |
\w |
Any alphanumeric character |
+ |
Repeated one for more times |
) |
Close group |
<space> |
Followed by a space |
\1 |
Followed by the exact group of characters captured previously |
\b |
Followed by a word boundary |
* In English: Look for a word boundary, followed by a group of alphanumeric characters, followed by a space, followed by the exact same group of alphanumeric characters found previously, followed by a word boundary. In short, look for duplicate words. |
In the next section, you'll examine some practical examples with corresponding Java code.