This first example confirms that a given phone number has the normal format for a U.S. phone number.
Before beginning in earnest, I have to make a decision: Do I want to write my own pattern or try to find an existing one? Normally, the first thing I would do is check some online regex resource, such as http://www.regexlib.com, to find an existing pattern. However, because this is a relatively simple pattern, and because I want to demonstrate the process of writing it, I'll create it myself.
Because I've decided to write the pattern myself, the first question I need to answer is, what is a phone number? I begin by working backward from some sample numbers. This is the pull technique described in Chapter 1. It requires that I take some actual data and try to pull the pattern out of it. Say my sample set is this:
614-345-6789
345-6789
345 6789
345.6789
3456789
(614)345-6789
6143456789
I pick the first phone number on the list as my pattern model. In examining it, I see a fairly obvious blueprint: three digits, a hyphen, three digits, a hyphen, and then four digits. That leads to the pattern \d{3}-\d{3}-\d{4}. Table 5-1 shows the process of deriving the pattern.
Step |
What I Did |
Why I Did It |
Justification |
Resulting Pattern |
---|---|---|---|---|
Step 1 |
Nothing |
Initial state |
N/A |
614-345-6789 |
Step 2 |
Replaced digits with \d |
To get a more generic description |
\d stands for any single digit |
\d\d\d-\d\d\d-\d\d\d\d |
Step 3 |
Replaced \d\d\d with \d{3} |
To produce a more succinct pattern |
\d{3} is exactly equal to \d\d\d |
\d{3}-\d{3}-\d\d\d\d |
Step 4 |
Replaced \d\d\d\d with \d{4} |
To produce a more succinct pattern |
\d{4} is exactly equal to \d\d\d\d |
\d{3}-\d{3}-\d{4} |
Of course, the pattern also has to accommodate numbers consisting of only seven digits, such as 345-6789. At present, it can't, because it's modeled after data that has nine digits. Reconciling the pattern to do so leads to (?:\d{3}-)?\d{3}-\d{4}. Table 5-2 shows the process of deriving the pattern.
Step |
What I Did |
Why I Did It |
Justification |
Resulting Pattern |
---|---|---|---|---|
Step 5 |
Nothing |
Initial state |
N/A |
\d{3}-\d{3}-\d{4} |
Step 6 |
Grouped the leftmost \d{3}-in parentheses, producing (\d{3}-) |
To treat \d{3}-as a single entity that might or might not exist. |
Any part of a pattern can be subgrouped. |
(\d{3}-)\d{3}-\d{4} |
Step 7 |
Made (\d{3}-) optional by producing (\d{3}-)? |
So that users can omit area codes. |
Adding ? after a group makes it optional. |
(\d{3}-)?\d{3}-\d{4} |
Step 8 |
Added a ?: inside the opening (of (\d{3}-) to produce (?:\d{3}-)? |
It makes the expression more efficient. Non-capturing groups require less memory. |
Adding ?: inside a group makes the group noncapturing. |
(?:\d{3}-)?\d{3}-\d{4} |
Now the pattern will accept any seven or ten digit sequence of numbers, as long as they are grouped into sets of threes and fours, and separated by hyphens.
Next, I use String.replaceAll to remove all punctuation and spacing from the phone number, as follows:
This replaces any and all punctuation or space characters with a zero-length string.
Note |
\p{Punct} is a POSIX, U.S. ASCII predefined class that matches any punctuation character. Specifically, it matches !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ It was introduced in Chapter 1, in Table 1-12. |
This way, I can count on the phone number being in the form 6143456789 or 3456789. This is great news, because I can now simplify the pattern even further, as shown in Table 5-3. By breaking the process down into a separate step, I've decreased the amount of complexity that will go into a given pattern.
Step |
What I Did |
Why I Did It |
Justification |
Resulting Pattern |
---|---|---|---|---|
Step 9 |
Nothing |
Initial state |
N/A |
(?:\d{3}-)?\d{3}-\d{4} |
Step 10 |
Removed all references to the character - |
This scrubbing guarantees that I'll never have to deal with punctuation in the phone number. Thus, it would be a mistake to expect it. |
Removing the - simply means that the pattern won't check for, or require, the existence of a hyphen. |
(?:\d{3})?\d{3}\d{4} |
Step 11 |
Treated (\d{3}) as a single entity and checked for its existence one or two times; thus, replaced (?:\d{3})?\d{3} with (?:\d{3}){1,2} |
To make the expressions less verbose. |
(?:\d{3})?\d{3} means "three digits or six digits." (?:\d{3}){1,2} means exactly the same thing. Thus, they're logically equivalent statements. |
(?:\d{3}){1,2}\d{4} |
Step 12 |
Went back to using the pattern from step 10. |
Although the pattern from step 11 is briefer, it's more difficult to read. |
See previous. |
(?:\d{3})?\d{3}\d{4} |
Notice that I back off from a change made in step 11 in step 12. Although it's true that step 11 made the expression less verbose, it also made it more difficult to read. In this case, I'm willing to have the pattern be slightly longer, if it will also be easier to read and maintain. Thus, the heart of my code consists of two lines. The first is line 32, which strips out any and all punctuation from the candidate:
String tmp = phone.replaceAll("\\p{Punct}|\\s","");
The second is line 36, which applies the pattern:
boolean retval = tmp.matches(PHONE_PATTERN) // (\d{3})?\d{3}\d{4}
Listing 5-1 shows the full implementation.
![]() |
01 import java.util.regex.*; 02 import java.util.logging.Logger; 03 public class MatchPhoneNumber{ 04 private static Logger log = Logger.getAnonymousLogger(); 05 private static final String PHONE_NUMBER_KEY="phoneNumber"; 06 /** 07 * Confirms that the format for the given phone number is valid. 08 * @param phone a String representing the phone number. 09 * @returns true if the phone number format is acceptable. 10 */ 11 public static boolean isPhoneValid(String phone){ 12 boolean retval=false; 13 String msg = "\r\nCANDIDATE:" + phone; 14 //make sure the candidate has a shot passing 15 if (phone != null && phone.length() > 6) 16 { 17 //load the regex properties file 18 RegexProperties rb = new RegexProperties(); 19 try 20 { 21 rb.load("../regex.properties"); 22 } 23 catch(Exception e) 24 { 25 e.printStackTrace(); 26 } 27 //scrub the phone number, removing spaces 28 //and punctuation. We could store this 29 //pattern in the regex.property file as well, 30 //but it's not really so complex that 31 //it's confusing when Java-delimited 32 String tmp = phone.replaceAll("\\p{Punct}|\\s",""); 33 //extract appropriate regex pattern and run check 34 //in this case (\d{3})?\d{3}\d{4} 36 String phoneNumberPattern=rb.getProperty(PHONE_NUMBER_KEY); 37 //do the actual comparison 38 retval= tmp.matches(phoneNumberPattern); 39 //log for debug purposes 40 msg += ":\r\nREGEX:" + phoneNumberPattern; 41 } 42 msg += "\r\nRESULT:" + retval +"\r\n"; 43 log.info(msg); 44 return retval; 45 } 46 public static void main(String args[]) throws Exception{ 47 if (args != null && args.length == 1) 48 System.out.println(isPhoneValid(args[0])); 49 else 50 System.out.println("usage: java MatchPhoneNumber <phoneNumber>"); 51 } 52 }
![]() |
Even programmers who don't know anything about regex can follow the code, which speaks to the elegance of J2SE regex.
Is the extra verbosity justified? Would it have been better to simply write the regex in a single line for this particular case? This is the sort of decision you'll need to make on a case-by-case basis for your particular needs. In my opinion, it's better to err on the side of verbosity than to risk terse code.