Team LiB
Previous Section Next Section

Introducing the Regular Expression Syntax

The following sections introduce Java's regular expression syntax. For the sake of clarity, the material is grouped into small, logical units, followed by a brief example that demonstrates usage. The examples progress from those that emphasize the role of the Pattern to those that start to rely on the Matcher more.

Note 

Please keep in mind that these are working examples only. We're not ready to bulletproof our code yet.

Reading Patterns

The regex language contains metacharacters designed to help you describe search criteria. Because reading a pattern without being aware of these characters can be a bewildering experience, I've listed the most popular metacharacters are in Table 1-1.

Table 1-1: Basic Regex Delimiter Characters

Pattern

Name

Description

.

Period

Matches any character.

$

Dollar sign

Matches the end of a line.

^

Carat

Matches the beginning of a line.

{

Opening curly bracket

Defines a range opening.

[

Opening bracket

Defines a character class opening.

(

Opening parenthesis

Defines the beginning of a group.

|

Pipe symbol

A symbol meaning OR

}

Closing curly bracket

Defines a range closing.

]

Closing bracket

Defines a character class closing.

)

Closing parenthesis

Defines the closing of a group.

*

Asterisk

The preceding is repeated zero or more times.

+

Plus sign

The preceding is repeated one or more times.

?

Question mark

The preceding is repeated zero or one time.

\

Backward slash

The following is not to be treated as a metacharacter.

These characters are effectively reserved words, just as new is a reserved word in Java. They serve as building blocks for more complex search criteria. I discuss this in more detail soon.

If you're reading a character in a regex pattern and it isn't one of characters listed in Table 1-1, then the character you're reading probably stands for the character it represents. For example, Table 1-2 shows how the pattern hello* should be read.

Table 1-2: The Pattern hello [*]

Letter

Description

h

The character h

e

Followed by the character e

l

Followed by the character l

l

Followed by the character l

o

Followed by the character o

*

Followed by a metacharacter that, in this case, means o should be repeated zero or more times

[*]In English: Look for the word hell, followed by any number of trailing o characters.

If you actually need to find one of these characters, such as the * character, simply append the character you're searching for to a \ character. For example, to find the * character, use \*.

Common and Boundary Characters

Regular expressions also contain characters that take on special meaning when they're delimited by the \ character. These facilitate finding common tokens, such as word boundaries, empty spaces, tabs, alphanumeric characters, and so on. For example, \n and \t are special characters that represent a newline and a tab, respectively.

In this section, I cover these common boundary characters and provide examples of their use.

Common Characters

Certain types of characters occur often enough that regular expression languages have developed a shorthand for referring to them. For example, a digit is designated by the \d expression. Without the \ character delimiting the d, the expression would simply refer to the fourth letter of the English alphabet, in lowercase. Table 1-3 lists some of these common characters.

Table 1-3: Common and Boundary Characters

Character

Description

.

Matches any character; may also match line terminators.

\d

A digit [0-9]. This will match any single digit from 0 to 9. Notice that an input of 19 will need to match twice: Once for the 1 and once again for the 9.

\D

A nondigit [^0-9]. This will match any character that isn't a digit, including a whitespace character.

\w

A word character [a-zA-Z_0-9]. This will match any character from a to z or A to Z, an underscore, or any single digit from 0 to 9.

\W

A nonword character [^\w]. This will match any character that isn't a word character, such as a number, including whitespace characters.

\t

The tab character.

\n

The newline (linefeed) character.

\r

The carriage-return character.

\f

The form-feed character.

\s

A whitespace character. This includes the newline, carriage-return, tab, form-feed, and end-of-line characters.

\S

A non-whitespace character, also known as [^\s]. This will match any character that isn't a whitespace character, as described previously.

^

The beginning of a line.

$

The end of a line.

\b

A word boundary. A word boundary is the character immediately preceding what we think of as "words" in English vernacular, corresponding to \w previously. It will also match the character immediately following a word. Most often, this character matches a space, a tab, an end of a line, or a beginning of a line.

\B

A non-word boundary.

Common Characters Example

Imagine that you need to verify that a given String consists of any alphanumeric character, including underscores, followed by a digit. Thus, you would accept A1, but not !1, because the ! symbol isn't an alphanumeric character or an underscore. The pattern you want in this case consists of an alphanumeric character (or underscore) followed by a digit; thus, \w\d, per Table 1-1.

The pattern \w\d will match h1, k9, A1, or 11, because each consists of an alphanumeric character followed by a digit. It won't match AA, 9A, or *5, because these don't consist of an alphanumeric character followed by a digit. Table 1-4 dissects the pattern.

Table 1-4: The Pattern \w\d

Regex

Description

\w

Any character ranging from a to z, A to Z, 0 to 9, or an underscore

\d

Followed by a single digit ranging from 0 to 9

* In English: Look for any alphanumeric character, or the underscore character, followed by a single digit.

Boundary Characters

Regular expressions also provide a mechanism for finding common character boundaries. These include newlines, end-of-line characters, end-of-file characters, tabs, and so on. These are listed in the latter part of Table 1-3.

Boundary Characters Example

Say you want to match the word anna from an input string, but only if it's at the beginning of a word. Thus, Hanna wouldn't fit your criteria. The pattern you want in this case consists of a word boundary, \b, followed by the characters a, n, n, and a, thus the regex \banna.

The pattern \banna will match anna but not Hanna, because anna is a cluster of characters preceded by a space character. A space character meets the criterion of being a word boundary. This isn't true of Hanna, because the character immediately preceding the a character in Hanna is an H, and H isn't a word boundary. Table 1-5 dissects the pattern.

Table 1-5: The Pattern \banna

Regex

Description

\b

A word boundary

a

Followed by the character a

n

Followed by the character n

n

Followed by the character n

a

Followed by the character a

* In English: Look for anna if it is the beginning of a word.

Quantifiers and Alternates

Quantifiers and alternates allow you to specify the number of tokens you need to find or alternative tokens you're willing to accept. Table 1-6 lists some of the quantifiers and alternates in regex.

Table 1-6: Quantifiers

Regex

Description

?

The preceding is repeated once or not at all.

*

The preceding is repeated zero or more times.

+

The preceding is repeated one or more times.

{n}

The preceding is repeated exactly n times.

{n,}

The preceding is repeated at least n times.

{n,m}

The preceding is repeated at least n times, but no more than m times. This includes m repetitions.

|

The element preceding the | or the element following it.

The following sections offer some examples that demonstrate working with quantifiers.

Repeated Characters Example 1

The pattern An+a will match Ana, Anna, or Annnna because each contains at least one A character immediately followed by one or more n characters followed by an a character. It won't match Aa or ANna because these don't consist of a single A character immediately followed by at least one n character followed by an a character. Notice that a capital N and a lowercase n aren't considered matches. Table 1-7 dissects the pattern.

Table 1-7: The Pattern An+a

Regex

Description

A

The character A

n+

Followed by one or more n characters

a

Followed by the character a

* In English: Look for a capital A, followed by one or more n characters, followed by an a character.

There is some interesting behavior that can be elicited here. If this match had been performed using the String.matches method, the pattern would not have matched AnnaMarie, because the String.matches method requires an exact match, and the Marie part of AnnaMarie would have ruined that exactness. However, the Matcher.find method would have matched AnnaMarie because it's more permissive. Stay tuned—more details coming soon.

Repeated Characters Example 2

The pattern A{2,7} will match AA,AAAA, or AAAAAAA because each of these contains at least at least two A characters and no more than seven A characters. The pattern won't match A because it contains less than two A characters, and the pattern won't match AAAAAAA because it contains more than seven A characters. Table 1-8 dissects the pattern.

Table 1-8: The Pattern A{2,7}

Regex

Description

A

The character A

{

Open repeating group

2

Repeated at least two times

,

But not more than

7

Seven times

}

Close repeated group

* In English: Look for a sequence of the character A repeated two, three, four, five, six, or seven times.

Note 

In the example at the beginning of this chapter, you needed a pattern to match four consecutive digits and derived \d\d\d\d. As noted, this isn't the most elegant pattern possible. An alternative, yet equivalent, way of expressing the same pattern is \d{4}, per Table 1-6—that is, a sequence of exactly four digits.

Alternative Characters Example 1

The pattern A|B will match A or B, because each consists of either an A character or a B character. It won't match P, Q, or jelly because these don't consist strictly of either an A or a B character. Table 1-9 dissects this pattern.

Table 1-9: The Pattern A|B

Regex

Description

A

The character A

|

Or

B

The character B

* In English: Look for either a capital A or a capital B.

Alternative Characters Example 2

The pattern anna|marie will match anna or marie, because anna matches the first alternative and marie matches the second. It won't match Josie, Ralph, or Doctor. Table 1-10 dissects the pattern.

Table 1-10: The Pattern anna|marie

Regex

Description

anna

The characters a, n, n, and a, in order

|

Or

marie

The characters m, a, r, i, and e, in order

* In English: Look for either the word anna or the word marie.

So would the pattern match annamarie as a single word? In a word, maybe. I provide detailed information about this topic in later chapters, but here's the nickel tour. Java 2 Enterprise Edition's (J2EE's) regex allows you to specify whether you need an exact or partial match. Thus, annamarie would match the pattern anna|marie twice for a partial match, and not at all for an exact match. Without going into too much detail, String.matches only provides for exact matches, whereas the Matcher class can provide more lenient matches using the find method.

What about the pattern Miss anna|marie? Will it match Miss marie and Miss anna, or just one of them? Or will it match neither? A strict match will match Miss anna but reject Miss marie. The alternative | will read Miss anna as a single option and the pattern marie as another. Because the pattern maria isn't equal to the candidate Miss maria, the search will reject Miss maria.

Character Classes

There are times when you need to describe your search criteria as a class—that is, as a group that shares potentially complex commonalities that you need to be able to describe and for which there are no predefined classes. Fortunately, regex provides a mechanism for doing so through character classes, as shown in Table 1-11.

Table 1-11: Character Classes

Pattern

Description

[abc]

a, b, or c. (Of course, any character could be used, not just a, b, or c.)

[^abc] Any

character except a, b, or c.

[a-zA-Z]

a through z or A through Z.

[a-d[m-p]]

a through d, or m through p: [a-dm-p].

[a-z&&[def]]

Whatever exists in both sets, namely d, e, or f.

[a-z&&[^bc]]

a through z, except for b and c: [ad-z].

[a-z&&[^m-p]]

a through z, and not m through p: [a-lq-z].

There are also some predefined Portable Operating System Interface for UNIX (POSIX) character classes. These are American Standard Code for Information Interchange (ASCII) classes that experience has shown to be particularly useful. Thus, they're already in place, and you can simply refer to them for use. Table 1-12 contains the POSIX character classes.

Table 1-12: POSIX Character Classes

Pattern

Description

\p{Lower}

A lowercase letter: [a-z]

\p{Upper}

An uppercase letter: [A-Z]

\p{ASCII}

All ASCII characters: [\x00-\x7F]

\p{Alpha}

An upper- or lowercase letter: [\p{Lower}\p{Upper}]

\p{Digit}

A digit: [0-9]

\p{Alnum}

A number or a letter: [\p{Alpha}\p{Digit}]

\p{Punct}

Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

\p{Graph}

Any visible character: [\p{Alnum}\p{Punct}]

\p{Print}

A printable character: [\p{Graph}]

\p{Blank}

A tab or space

\p{Cntrl}

A control character: [\x00-\x1F\x7F]

\p{XDigit}

A hexadecimal digit: [0-9a-fA-F]

\p{Space}

A whitespace character: [ \t\n\x0B\f\r]

Simple Class Example

Let's step through some simple examples. The pattern [0-5] will match any part of the input that contains a digit between 0 and 5. Thus, it will match on 0, 1, 2, 3, 4, or 5. It won't match 8, 6, or any nondigit characters. Table 1-13 dissects the pattern.

Table 1-13: The Pattern [0-5]

Regex

Description

[

A class consisting of

0

The digit 0

-

Ranging through

5

The digit 5

]

Close class

* In English: Look for any digit ranging from 0 to 5, including 0 and 5.

Negation Example

The pattern [^A] will match any character except the character A. This includes other characters, spaces, tabs, punctuation, and so on. It's important to notice that the ^ delimiter only has a not meaning when inside a class bracket—that is, inside the [ and ] brackets. Outside those brackets, it stands for the beginning of the line character. I cover this topic in more detail later. Table 1-14 dissects the pattern.

Table 1-14: The Pattern [^A]

Regex

Description

[

A class consisting of

^

Any character except

A

The character A

]

Close class

* In English: Look for any character except the capital letter A

Groups and Back References

Groups are simply logical divisions of the text. When you describe a group in regex, you're providing a mechanism for the JVM to treat characters that fall into that group in a specific way.

Back references allow the regex pattern to refer to a group, even as it's in the middle of an operation. A pattern can refer to the last group it found, or the one before that, or even one further down the execution chain.

In the sections that follow, I cover the topics of groups and back references in more detail and present an example for each.

Groups

A group is a submatch. If you're familiar with SQL, it might be helpful to think of groups as the SQL equivalent of a subquery. Groups allow you to define parts of your pattern as logical subunits of the whole and then refer to the results of those subunits. Their syntax follows in Table 1-15.

Table 1-15: Groups

Regex

Description

(

A group consisting of

Any regex pattern

)

Close group

Groups Example

As with most things, an example can be more illuminating than a description. Consider the pattern (\w+)_(\w+)@(\w+)\.org to match e-mail patterns. Table 1-16 dissects this pattern.

Table 1-16: The Pattern (\w+)_(\w+)@(\w+)\.org

Regex

Description

(

A group consisting of

\w

An alphanumeric or underscore character

+

Repeated one or more times

)

Close group

_

Followed by an underscore character

(

A group consisting of

\w

One alphanumeric or underscore character

+

Followed by one or more alphanumeric characters

)

Close group

@

Followed by an at character

(

A group consisting of

\w

One alphanumeric or underscore character

+

Followed by one or more alphanumeric or underscore characters

)

Close group

\.

Followed by the period character

o

Followed by the character o

r

Followed by the character r

g

Followed by the character g

* In English: Look for a group of alphanumeric characters, followed by _, followed by a group of alphanumeric characters, followed by @, followed by a group of alphanumeric characters, followed by .org.

Back References

Back references are one of the most powerful features offered by regular expressions. Unfortunately, programmers often skip over them because they're not explained well in the regular expression literature. That's a mistake I hope to rectify here.

Back references allow a pattern to refer back to parts of itself. They always refer back to groups that were enclosed by the "(" and the ")" characters. Table 1-17 presents the syntax for back references.

Table 1-17: Back References

Regex

Description

\1

The first group in the pattern

\2

The second group in the pattern

\n

The nth group in the pattern

Note 

There are some idiosyncratic behaviors associated with how back references work in Java, which I explain later in this chapter and in Chapter 3. For right now, you have enough information on back references to get started.

Back References Example

Say you need to find matches in which a word is duplicated. That is, you don't know what the word you're looking for is, but you want to be alerted when the same word is repeated twice in a row. If you've used a word processor such as Microsoft Word, you'll notice that the application does this automatically. Let's explore how you might do this in Java.

You'll use the pattern \b(\w+) \1\b, which is dissected in Table 1-18. This pattern matches pizza pizza, Faster pussycat kill kill, or Never Never Never Never Never because each contains a word that's immediately repeated. It won't match 222 2222, sara sarah, or Faster pussycat kill, kill because these don't contain a word that's immediately repeated. The latter group won't match because 222 2222 has a lingering 2 in the second set, sara sarah has a lingering h in the second word, and in Faster pussycat kill, kill the second kill is separated from the first by a comma.

Table 1-18: The Pattern \b(\w+) \1\b

Regex

Description

\b

A word boundary

(

Followed by a group consisting of

\w

Any alphanumeric character

+

Repeated one for more times

)

Close group

<space>

Followed by a space

\1

Followed by the exact group of characters captured previously

\b

Followed by a word boundary

* In English: Look for a word boundary, followed by a group of alphanumeric characters, followed by a space, followed by the exact same group of alphanumeric characters found previously, followed by a word boundary. In short, look for duplicate words.

In the next section, you'll examine some practical examples with corresponding Java code.


Team LiB
Previous Section Next Section