Confirming Zip Codes

The next challenge is to provide a method that confirms zip codes for the United States. The method needs to accommodate punctuation, a space, or no delimiter at all between the five-digit and four-digit parts of the zip code. It needs to accommodate zip codes that are only five digits long. Suddenly, there's requirements creep: It now needs to validate zip codes for Canada, the United Kingdom, Argentina, Sweden, Japan, and the Netherlands as well.

The first thing I do is search the Web for patterns, starting at http://www.regexlib.com. This returns regular expressions for all of the countries previously mentioned. Next, I take those regular expressions and create entries in the regex.properties file, so I can use the RegexProperties class from Chapter 4.

The point of doing so, of course, is to externalize the expressions themselves and to avoid having to double-delimit special characters. I decide to use intelligent keys for the property keys. That is, I'm anticipating that I'll have access to the country code for each of these regex patterns. Therefore, I can define the property file keys based on that country code. For example, since the country code for Japan is JP, I define the key to the zip code pattern for Japan as zipJP. Listing 5-2 summarizes the entries made to the regex.properties file.

Listing 5-2: New Entries in the regex.properties File

#Japanese postal codes
zipJP=^\d{3}-\d{4}$

#US postal codes
zipUS=^\d{5}\p{Punct}?\s?(?:\d{4})?$

#Dutch postal code
zipNL=^[0-9]{4}\s*[a-zA-Z]{2}$

#Argentinean postal code
zipAR=^\d{3}-\d{4}$

#Swedish postal code
zipSE=^(s-|S-){0,1}[0-9]{3}\s?[0-9]{2}$

#Canadian postal code
zipCA=^([A-Z]\d[A-Z]\s\d[A-Z]\d)$

#UK postal code
zipUK=^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$

Finally, I write the code. The algorithm is to look up the appropriate regex for a given country given the appropriate country code, apply the pattern, and return true or false as appropriate. Listing 5-3 shows the code that does this.

Listing 5-3: Matching Zip Codes for Various Countries

01  import java.io.*;
02  import java.util.logging.Logger;
03  import java.util.regex.*;

04  /**
05  *Validates zip codes from the given country.
06  *@author M Habibi
07  */
08  public class MatchZipCodes{
09      private static Logger log = Logger.getAnonymousLogger();
10     private static final String ZIP_PATTERN="zip";
11     private static RegexProperties regexProperties;
12     //load the regex properties file
13     //do this at the class level
14     static
15     {
16         try
17         {
18             regexProperties = new RegexProperties();
19             regexProperties.load("../regex.properties");
20         }
21         catch(Exception e)
22         {
23             e.printStackTrace();
24         }
25     }

26     public static void main(String args[]){
27         String msg = "usage: java MatchZipCodes countryCode Zip";

28         if (args != null && args.length == 2)
29             msg = ""+isZipValid(args[0],args[1]);

30         //output either the usage message, or the results
31         //of running the isZipValid method
32         System.out.println(msg);
33     }
34     /**
35     * Confirms that the format for the given zip code is valid.
36     * @param the <code>String</code> countryCode
37     * @param the <code>String</code> zip
38     * @return <code>boolean</code>
39     *
40     * @author M Habibi
41     */
42     public static boolean isZipValid(String countryCode, String zip)
43     {
44         boolean retval=false;
45         //use the country code to form a unique into the regex
46         //properties file
47         String zipPatternKey = ZIP_PATTERN + countryCode.toUpperCase();

48         //extract the regex pattern for the given country code
49         String zipPattern = regexProperties.getProperty(zipPatternKey);

50         //if there was some sort of problem, don't bother trying
51         //to execute the regex
52         if (zipPattern != null)
53             retval = zip.trim().matches(zipPattern);
54         else
55         {
56             String msg = "regex for country code "+countryCode;
57             msg+= " not found in property file ";
58             log.warning(msg);
59         }
60         //create log report
61         String msg = "regex="+zipPattern +
62         "\nzip="+zip+"\nCountryCode="+
63         countryCode+"\nmatch result="+retval;
64         log.finest(msg);

65         return retval;
66     }
67 }

Outside of the comments and such, the real work in this method is done in three lines. Line 47 forms the proper key based on the country code:

47      String zipPatternKey = ZIP_PATTERN + countryCode.toUpperCase();

For example, zipPatternKey equals zipUS for the US country code. Next, line 49 extracts the relevant pattern based on that key:

49   String zipPattern = regexProperties.getProperty(zipPatternKey);

Line 53 actually compares the pattern against the key:

53             retval = zip.trim().matches(zipPattern);

The only regex change I made in this example was to make the actual pattern just a little more memory efficient and a little more lenient, as shown in Table 5-4. Specifically, leniency means that the pattern will accept any punctuation, a space, or no delimiter at all between the first five digits and the last four digits of a U.S. zip code. The pattern will also accept five digits as a sufficient U.S. zip code.

Table 5-4: Making the Zip Code Pattern More Lenient and Efficient
Step	What I Did	Why I Did It	Justification	Resulting Pattern
Step 1	Nothing	Initial state	N/A	\d{5}(-\d{4})?
Step 2	Added ?: inside the capturing group *(-\d{4})* to produce *(?:-\d{4})*	To produce a more efficient pattern.	We don't need a capture here.	\d{5}(?:-\d{4})?
Step 3	Replaced - with *\p{Punct}?* to produce *(?:\p{Punct}?\d{4})?*	Any punctuation—or no punctuation at all—can be used as a delimiter.	*\p{Punct}?* is a superset of -, and it's optional, so the regex engine is now willing to accept any punctuation or no punctuation at all as a delimiter.	\d{5}(?:\p{Punct}?\d{4})?
Step 4	Added a *\s?* pattern to the list of acceptable delimiters between the five digits and the four digits of a U.S. zip code	Zip codes that use a space or empty string to separate the five digits from the four digits will pass.	A space between the five digits of a U.S. zip code and the following four digits is optional. This is simply a more lenient interpretation.	\d{5}(?:\p{Punct}?\s?\d{4})?
Step 5	Moved the *\p{Punct}?\s?* out of the noncapturing group	This improves readability. Optional subpatterns inside a optional noncapturing group can be hard to follow.	Logically, the two are equivalent.	\d{5}\p{Punct}?\s?(?:\d{4})?
Step 6	Surrounded the pattern with a beginning-of-line ^ tag and an end-of-line $ tag	This increases matching speed. The more precise the pattern, the better it will perform.	All zip codes will be coming into the method as extracted strings. Thus, they'll always have a beginning of line and an end of line.	^\d{5}\p{Punct}?\s?(?:\d{4})?$

Because the regex patterns are externalized, they can be tweaked later to become more accommodating for the various regions. Better yet, more country codes can be added without requiring code changes: Simply add the appropriate entries to the regex.properties file.

The point here is that even using generic regex patterns found online, I still have a very Java-like flavor to the code. It's modular, adaptable, scalable, and clear.