Confirming the Format of a Phone Number

This first example confirms that a given phone number has the normal format for a U.S. phone number.

Before beginning in earnest, I have to make a decision: Do I want to write my own pattern or try to find an existing one? Normally, the first thing I would do is check some online regex resource, such as http://www.regexlib.com, to find an existing pattern. However, because this is a relatively simple pattern, and because I want to demonstrate the process of writing it, I'll create it myself.

Because I've decided to write the pattern myself, the first question I need to answer is, what is a phone number? I begin by working backward from some sample numbers. This is the pull technique described in Chapter 1. It requires that I take some actual data and try to pull the pattern out of it. Say my sample set is this:

614-345-6789
345-6789
345 6789
345.6789
3456789
(614)345-6789
6143456789

I pick the first phone number on the list as my pattern model. In examining it, I see a fairly obvious blueprint: three digits, a hyphen, three digits, a hyphen, and then four digits. That leads to the pattern \d{3}-\d{3}-\d{4}. Table 5-1 shows the process of deriving the pattern.

Table 5-1: Pulling a General Regex Pattern from 614-345-6789
Step	What I Did	Why I Did It	Justification	Resulting Pattern
Step 1	Nothing	Initial state	N/A	614-345-6789
Step 2	Replaced digits with \d	To get a more generic description	\d stands for any single digit	\d\d\d-\d\d\d-\d\d\d\d
Step 3	Replaced *\d\d\d* with *\d{3}*	To produce a more succinct pattern	*\d{3}* is exactly equal to *\d\d\d*	\d{3}-\d{3}-\d\d\d\d
Step 4	Replaced *\d\d\d\d* with *\d{4}*	To produce a more succinct pattern	*\d{4}* is exactly equal to *\d\d\d\d*	\d{3}-\d{3}-\d{4}

Of course, the pattern also has to accommodate numbers consisting of only seven digits, such as 345-6789. At present, it can't, because it's modeled after data that has nine digits. Reconciling the pattern to do so leads to (?:\d{3}-)?\d{3}-\d{4}. Table 5-2 shows the process of deriving the pattern.

Table 5-2: Pushing *\d{3}-\d{3}-\d{4}* to Accommodate Seven-Digit Numbers
Step	What I Did	Why I Did It	Justification	Resulting Pattern
Step 5	Nothing	Initial state	N/A	\d{3}-\d{3}-\d{4}
Step 6	Grouped the leftmost *\d{3}-in parentheses, producing (\d{3}-)*	To treat *\d{3}-as a* single entity that might or might not exist.	Any part of a pattern can be subgrouped.	(\d{3}-)\d{3}-\d{4}
Step 7	Made *(\d{3}-)* optional by producing *(\d{3}-)?*	So that users can omit area codes.	Adding ? after a group makes it optional.	(\d{3}-)?\d{3}-\d{4}
Step 8	Added a ?: inside the opening (of *(\d{3}-)* to produce *(?:\d{3}-)?*	It makes the expression more efficient. Non-capturing groups require less memory.	Adding ?: inside a group makes the group noncapturing.	(?:\d{3}-)?\d{3}-\d{4}

Now the pattern will accept any seven or ten digit sequence of numbers, as long as they are grouped into sets of threes and fours, and separated by hyphens.

If I were programming in Perl, the next natural step would probably be to account for punctuation in the candidate, deal with an opening parenthesis that might or might not be there, and so on. Of course, this would most likely be addressed in the pattern itself.

But I'm not using Perl; I'm using a full featured, object-oriented language that's been designed to deal with nuisances while remaining clear. I decide to take advantage of that by scrubbing the data, and relying more on programmatic logic and less on regex wizardry.

Next, I use String.replaceAll to remove all punctuation and spacing from the phone number, as follows:

String scrubbedPhone = phone.replaceAll("\\p{Punct}|\\s","");

This replaces any and all punctuation or space characters with a zero-length string.

Note

\p{Punct} is a POSIX, U.S. ASCII predefined class that matches any punctuation character. Specifically, it matches !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ It was introduced in Chapter 1, in Table 1-12.

This way, I can count on the phone number being in the form 6143456789 or 3456789. This is great news, because I can now simplify the pattern even further, as shown in Table 5-3. By breaking the process down into a separate step, I've decreased the amount of complexity that will go into a given pattern.

Table 5-3: Removing References to - from *\d{3}-\d{3}-\d{4}* to Accommodate the Data Scrub
Step	What I Did	Why I Did It	Justification	Resulting Pattern
Step 9	Nothing	Initial state	N/A	(?:\d{3}-)?\d{3}-\d{4}
Step 10	Removed all references to the character -	This scrubbing guarantees that I'll never have to deal with punctuation in the phone number. Thus, it would be a mistake to expect it.	Removing the - simply means that the pattern won't check for, or require, the existence of a hyphen.	(?:\d{3})?\d{3}\d{4}
Step 11	Treated *(\d{3})* as a single entity and checked for its existence one or two times; thus, replaced *(?:\d{3})?\d{3}* with *(?:\d{3}){1,2}*	To make the expressions less verbose.	*(?:\d{3})?\d{3}* means "three digits or six digits." *(?:\d{3}){1,2}* means exactly the same thing. Thus, they're logically equivalent statements.	(?:\d{3}){1,2}\d{4}
Step 12	Went back to using the pattern from step 10.	Although the pattern from step 11 is briefer, it's more difficult to read.	See previous.	(?:\d{3})?\d{3}\d{4}

Notice that I back off from a change made in step 11 in step 12. Although it's true that step 11 made the expression less verbose, it also made it more difficult to read. In this case, I'm willing to have the pattern be slightly longer, if it will also be easier to read and maintain. Thus, the heart of my code consists of two lines. The first is line 32, which strips out any and all punctuation from the candidate:

String tmp = phone.replaceAll("\\p{Punct}|\\s","");

The second is line 36, which applies the pattern:

boolean retval = tmp.matches(PHONE_PATTERN) // (\d{3})?\d{3}\d{4}

Listing 5-1 shows the full implementation.

Listing 5-1: Searching for a Phone Number

01  import java.util.regex.*;
02  import java.util.logging.Logger;

03  public class MatchPhoneNumber{
04  private static Logger log = Logger.getAnonymousLogger();
05     private static final String PHONE_NUMBER_KEY="phoneNumber";
06     /**
07     * Confirms that the format for the given phone number is valid.
08     * @param phone a String representing the phone number.
09     * @returns true if the phone number format is acceptable.
10    */
11    public static boolean isPhoneValid(String phone){
12       boolean retval=false;
13           String msg = "\r\nCANDIDATE:" + phone;


14       //make sure the candidate has a shot passing
15       if (phone != null && phone.length() > 6)
16       {
17           //load the regex properties file
18           RegexProperties rb = new RegexProperties();
19           try
20           {
21             rb.load("../regex.properties");
22           }
23           catch(Exception e)
24           {
25                 e.printStackTrace();
26           }

27           //scrub the phone number, removing spaces
28           //and punctuation. We could store this
29           //pattern in the regex.property file as well,
30           //but it's not really so complex that
31           //it's confusing when Java-delimited
32           String tmp = phone.replaceAll("\\p{Punct}|\\s","");

33           //extract appropriate regex pattern and run check
34           //in this case (\d{3})?\d{3}\d{4}
36           String phoneNumberPattern=rb.getProperty(PHONE_NUMBER_KEY);

37           //do the actual comparison
38           retval= tmp.matches(phoneNumberPattern);

39           //log for debug purposes
40           msg += ":\r\nREGEX:" + phoneNumberPattern;
41      }
42      msg += "\r\nRESULT:" + retval +"\r\n";
43      log.info(msg);
44      return retval;
45   }
46  public static void main(String args[]) throws Exception{
47    if (args != null && args.length == 1)
48       System.out.println(isPhoneValid(args[0]));
49    else
50       System.out.println("usage: java MatchPhoneNumber <phoneNumber>");
51   }
52 }

Even programmers who don't know anything about regex can follow the code, which speaks to the elegance of J2SE regex.

Is the extra verbosity justified? Would it have been better to simply write the regex in a single line for this particular case? This is the sort of decision you'll need to make on a case-by-case basis for your particular needs. In my opinion, it's better to err on the side of verbosity than to risk terse code.