Extracting Phone Numbers From a File

In this example, I want to parse a file and extract any and all phone numbers. This is a program I wrote to help a friend who owns a small IT shop. He had all sorts of electronic documents and needed to extract phone numbers from them to call his clients back. I'll start this process at the very beginning, where I extracted requirements:

Q:	Are you looking for U.S. numbers or international ones?
Q:	Is speed an issue? Is someone going to be tapping his foot, waiting for this to finish?
Q:	Do the phone numbers follow any sort of consistent format?
Q:	Do they have hyphens or spaces in them?
Q:	Is the format of the file subject to change?
Q:	If there had to be a mistake, would you prefer too many phone number candidates or too few?
Q:	Do you need these numbers returned in any particular kind of format?
Q:	Do you have some files you've already looked through that I can use for testing?
Q:	How big are these files?
Q:	How many files are there?
Q:	What types of files are these?

Answers

A:	U.S. numbers, but that could change.
A:	No, running overnight is fine.
A:	They're either seven or ten digits.
A:	It depends—sometimes they do.
A:	Yes.
A:	Too many.
A:	I hadn't thought of that, but a consistent format would be great.
A:	Yes.
A:	Not that big. I don't know.
A:	About ten per night.
A:	Microsoft Word documents.

I think I have enough information at this point to get started. It sounds like the client wants anything that might be a seven-or ten-digit phone number, and that speed isn't an issue. It also sounds like the files don't get that large. This should be as simple as defining a phone number pattern and using the previously presented search methods. After all, I can already access a file and search its content. I decide to keep the actual regex in a external property file, of course, so I can tweak it as I need to. This is going to be an error-prone process until I get a sense of these files.

I'm ready to start. I decide to do a quick search of the Web, and I come up with a few patterns for phone numbers. Some of these are a little esoteric, but I'm willing to try them because my client wants as many candidates as possible. The patterns I found are as follows:

    ^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$
    ^([0-1]([\s-./\\])?)?(\(?[2-9]\d{2}\)?
    [2-9]\d{3})([\s-./\\])?(\d{3}([\s-./\\])?\d{4}
    [a-zA-Z0-9]{7})$
    ^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$

I write a quick pattern that ORs these patterns together, run through some documents, and find that, in fact, it doesn't work. Although I'm getting some phone numbers back, I'm also getting some numbers that can't possibly be phone number patterns because they include characters, long spaces, and punctuation.

I have two choices here: I can run a second validation on the candidates that do match, or I can tweak the pattern. This time, I decide to take my own pattern from earlier and try to work in the good traits from the other patterns, as shown in Table 5-5. This is the composition technique introduced in Chapter 1.

Table 5-5: Pulling a General Regex Pattern from 614-345-6789
Step	What I Did	Why I Did It	Justification	Resulting Pattern
Step 1	Nothing	Initial state	N/A	(?:\d{3})?\d{3}\d{4}
Step 2	Put optional - between the number groups	To get a more generic description	The phone number can be delimited with punctuation.	(?:\d{3}-?)?\d{3}-?\d{4}
Step 3	Put optional spaces between the number groups	To get a more generic description	The phone number can be delimited with spaces.	(?:\d{3}-?\s?)?\d{3}-?\s?\d{4}
Step 4	Swapped out - for *\p{Punct}*	To accommodate punctuation	*\p{Punct}* is a superset of -	(?:\d{3}\p{Punct}?\s?)?\d{3}\p {Punct}?\s?\d{4}
Step 5	Replaced *(?:\d{3} \p{Punct}?\s?)? \d{3}\p{Punct}?* with *(?\d{3}\p{Punct }?\s?){1,2}*	To create a more succinct pattern	The two are equivalent statements.	(?:\d{3}\p{Punct}?\s?){1,2}\d{4}

Using this pattern, I find that things seem rational. Finally, just before finishing, I decide to format all of my output in the form ddd-dddd or ddd-ddd-dddd. The resulting code is shown in Listing 5-12.

Listing 5-12: Extracting Phone Numbers from a File

01  /**
02  * mines phone numbers out of the given file, and returns
03  * them as strings.
04  * @param the String filePath of the file
05  * @throws IOException if the file is not found or
06  * is corrupted
07  *
08  * @return ArrayList contained well formatted phone numbers
09  * of the for ddd-ddd-dddd or ddd-dddd
10 */
11  public static ArrayList minePhoneNumbers(String filePath)
12  throws IOException{

13     ArrayList retval = new ArrayList();
14     //get pattern
15     String regex = RegexUtil.getProperty("../regex.properties","allPhones");
16     //find all the matches
17     Map result =
18        RegexUtil.searchFile(filePath, regex ,Pattern.MULTILINE);

19     //get the matching strings
20     Iterator it = result.values().iterator();

21     //provide a consistent format for phone numbers captured
22     while (it.hasNext())
23     {
24        String num = (String)it.next();
25        num = num.replaceAll("\\p{Punct}|\\s","");

26        if (num.length() == 7)
27          num=num.replaceAll("(\\d{3})(\\d{4})","$1-$2");
28        else
29         num=num.replaceAll("(\\d{3})(\\d{3})(\\d{4})","($1)-$2-$3");

30       retval.add(num);
31     }

32     return retval;
33  }

Listing 5-12 is fairly self-explanatory. However, I do want to point out lines 27 and 29. Notice how easy it was to make a minor adjustment and produce well-formatted, consistent output here. Line 27, for example, simply says, "I'd like to capture the first three digits in group number 1 and the last four digits in group number 2. Then, I'd like to separate those two groups with a hyphen." Again, this involved very easy, but ultimately very powerful, code.