Integrating Java with Regular Expressions

Thus far, you've worked almost exclusively with regular expressions, but not really with Java. Now it's time to consider how the two interact. The following examples differ from the preceding ones in that they incorporate Java code with regular expressions. They offer a more complete picture of how you can use some J2SE regex syntax.

Some of the regular expressions you'll see here are slightly more advanced than in the examples you've seen previously, as they build on the fundamentals discussed thus far in the chapter. For example, Listing 1-2 combines groups with quantifiers.

Listing 1-2: MatchPhoneNumber.java

import java.util.regex.*;
public class MatchPhoneNumber{
   public static void main(String args[]){
      isPhoneValid(args[0]);
   }

   /**
   * Confirms that the format for the given phone number is valid.
   * @param phone is a String representing the phone number.
   * @returns true if the phone number format is acceptable.
   */
   public static boolean isPhoneValid(String phone){
      boolean retval=false;

      String phoneNumberPattern =
        "(\\d-)?(\\d{3}-)?\\d{3}-\\d{4}";

      retval= phone.matches(phoneNumberPattern);

      //prepare a message indicating success or failure
      String msg = "   NO MATCH: pattern:" + phone
             + "\r\n             regex: " + phoneNumberPattern;

      if (retval){
      msg = "   MATCH   : pattern:" + phone
          + "\r\n             regex: " + phoneNumberPattern;
      }

      System.out.println(msg +"\r\n");
      return retval;
   }
}

Don't be discouraged if the patterns themselves aren't completely clear to you right now. An intuitive understanding will develop as you continue to read this book. Focus on the concepts and become comfortable with how the Java code and the regex complement each other.

There are only two pieces of information you need to take full advantage of the following examples:

Any \-delimited regex expression metacharacter needs to be delimited once again when it's used in Java code. Thus, \d becomes \\d and \s becomes\\s in your Java code. Correspondingly, a more complex expression such as (\d-)?(\d{3}-)?\d{3}-\d{4}\s becomes (\\d-)?(\\d{3}-)?\\d{3}-\\d{4}\\s in Java code. All \ characters are doubled to produce \\ when they're used in a String object.
In this book, when I talk about a regular expression in and of itself, I don't use the double delimiting mechanism. However, I do when working with specific coding examples.
The String.matches(String regex) method is a new method that has been added to the String class. It compares the String it's called on to the given regular expression, regex, and returns true if the regex pattern matches the String exactly. To match exactly means that the String in question can't contain any characters—not even invisible characters such as newlines and spaces—that aren't accounted for in the regex pattern.

Confirming Phone Number Formats Example

The code in Listing 1-2 simply determines if the given phone number meets the criteria of being well formatted. It takes advantage of two metacharacters introduced in Table 1-6. Specifically it uses range, {n,m}, indicating that the previous character or class must be repeated at least n times and no more than m times. It also uses the character, indicating the previous character or class must be present zero or one time.

The pattern as a whole checks for seven digits preceded by optional country and area codes. Output 1-2 shows the result of running the program, and Table 1-19 dissects the pattern.

Table 1-19: The Pattern *(\d-)?(\d{3}-)?\d{3}-\d{4*}
Regex	Description
(	A group consisting of
\d	A digit
-	Followed by a hyphen (-)
)	The end of this group
?	Look for zero or one of the preceding
(	Followed by a group consisting of
\d	A digit
{	Repeated at least
3	Three times
}	End repetition
-	Followed by a hyphen
)	The end of this group
?	Look for zero or one of the preceding
\d	Followed by a digit
{	Repeated at least
3	Three times
}	End repetition
-	Followed by a hyphen
\d	Followed by a digit
{	Repeated at least
4	Four times
}	End repetition
* In English: Look for a single digit followed by a hyphen. This is optional. Then, look for three digits followed by a hyphen. This is also optional. Next, look for three digits, followed by a hyphen, followed by four digits.

Output 1-2: Result of Running MatchPhoneNumber.java

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "1-999-111-2222"
   MATCH  : pattern:1-999-111-2222
            regex: (\d-)?(\d{3}-)?\d{3}-\d{4}


C:\RegEx\Examples\chapter1>java MatchPhoneNumber "999-111-2222"
   MATCH   : pattern:999-111-2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "1-111-2222"
   MATCH   : pattern:1-111-2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "111-2222"
   MATCH   : pattern:111-2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}
C:\RegEx\Examples\chapter1>java MatchPhoneNumber "1.999-111-2222"
   NO MATCH: pattern:1.999-111-2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "999 111-2222"
   NO MATCH: pattern:999 111-2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "1 111 2222"
   NO MATCH: pattern:1 111 2222
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

C:\RegEx\Examples\chapter1>java MatchPhoneNumber "111-JAVA"
   NO MATCH: pattern:111-JAVA
             regex: (\d-)?(\d{3}-)?\d{3}-\d{4}

Confirming Zip Codes Example

The code in Listing 1-3 determines if the zip code meets the criterion of being well formatted. It checks for five digits optionally followed by a hyphen and four digits. Output 1-3 shows the result of running the program. Table 1-20 dissects the pattern.

Table 1-20: The Pattern *\d{5}(-\d{4})*?
Regex	Description
\d	A digit
{	Repeated at least
5	Five times
}	End repetition
(	Open group
-	Consisting of a hyphen
\d	A digit
{	Repeated at least
4	Four times
}	End repetition
)	The end of this group
?	Look for zero or one of the preceding
* In English: Look for five digits, optionally followed by a hyphen and four digits.

Listing 1-3: MatchZipCodes.java

import java.util.regex.*;
import java.io.*;

public class MatchZipCodes{
   public static void main(String args[]){
      isZipValid(args[0]);
   }

   /**
   * Confirms that the format for the given zip code is valid.
   * @param zip is a String representing the zip code.
   * @returns true if the zip code format is acceptable.
   */
   public static boolean isZipValid(String zip){
      boolean retval=false;
      String zipCodePattern = "\\d{5}(-\\d{4})?";
      retval = zip.matches(zipCodePattern);

      //prepare a message indicating success or failure
      String msg = "   NO MATCH: pattern:" + zip
             + "\r\n             regex: " + zipCodePattern;

      if (retval){
      msg = "   MATCH   : pattern:" + zip
          + "\r\n             regex: " + zipCodePattern;
      }

      System.out.println(msg +"\r\n");
      return retval;
   }
}

Output 1-3: Result of Running MatchZipCodes.java

C:\RegEx\Examples\chapter1>java MatchZipCodes "45643-4443"
   MATCH  : pattern:45643-4443
            regex: \d{5}(-\d{4})?

C:\RegEx\Examples\chapter1>java MatchZipCodes "45643"
   MATCH   : pattern:45643
             regex: \d{5}(-\d{4})?

C:\RegEx\Examples\chapter1>java MatchZipCodes "443"
   NO MATCH: pattern:443
             regex: \d{5}(-\d{4})?

C:\RegEx\Examples\chapter1>java MatchZipCodes "45643-44435"
   NO MATCH: pattern:45643-44435
             regex: \d{5}(-\d{4})?

C:\RegEx\Examples\chapter1>java MatchZipCodes "45643 44435"
   NO MATCH: pattern:45643 44435
             regex: \d{5}(-\d{4})?

Confirming Dates Example

The code in Listing 1-4 checks the format of a given date. It confirms that given date format consists of one or two digits followed by a hyphen, followed by one or two digits, followed by a hyphen, followed by four digits. Output 1-4 shows the result of running the program. Table 1-21 dissects the pattern.

Table 1-21: The Pattern *\d{1,2}-\d{1,2}-\d{4*}
Regex	Description
\d	A digit
{	Repeated at least
1	One time
,	But no more than
2	Two times
}	Close repetition
-	Followed by a hyphen
\d	Followed by a digit
{	Repeated at least
1	One time
,	But no more than
2	Two times
}	Close repetition
-	Followed by a hyphen
\d	Followed by a digit
{	Repeated at least
1	Four times
}	Close repetition
* In English: Look for one or two digits, followed by a hyphen, followed by one or two digits, followed by a hyphen, followed by four digits.

Listing 1-4: MatchDates.java

import java.util.regex.*;
import java.io.*;

public class MatchDates{
   public static void main(String args[]){
      isDateValid(args[0]);
   }
   /**
   * Confirms that given date format consists of one or two digits
   * followed by a hyphen, followed by one or two digits, followed
   * by a hyphen, followed by four digits
   * @param date is a String representing the date.
   * @returns true if date format is acceptable.
   */
   public static boolean isDateValid(String date){
      boolean retval=false;

      String datePattern ="\\d{1,2}-\\d{1,2}-\\d{4}";
      retval = date.matches(datePattern);

      //prepare a message indicating success or failure
      String msg = "   NO MATCH: pattern:" + date
             + "\r\n             regexLength: " + datePattern;

      if (retval){
      msg = "   MATCH   : pattern:" + date
          + "\r\n             regexLength: " + datePattern;
      }

      System.out.println(msg +"\r\n");
      return retval;
   }
}

Output 1-4: Result of Running MatchDates.java

C:\RegEx\Examples\chapter1>java MatchDates "04-02-1999"
   MATCH  : pattern:04-02-1999
            regexLength: \d{1,2}-\d{1,2}-\d{4}

C:\RegEx\Examples\chapter1>java MatchDates "15-42-1999"
   MATCH   : pattern:15-42-1999
             regexLength: \d{1,2}-\d{1,2}-\d{4}

C:\RegEx\Examples\chapter1>java MatchDates "April fourth nineteen ninety nine"
   NO MATCH: pattern:April fourth nineteen ninety nine
             regexLength: \d{1,2}-\d{1,2}-\d{4}
C:\RegEx\Examples\chapter1>java MatchDates "15-42-20002"
   NO MATCH: pattern:15-42-20002
             regexLength: \d{1,2}-\d{1,2}-\d{4}

C:\RegEx\Examples\chapter1>java MatchDates "02-02-20002"
   NO MATCH: pattern:02-02-20002
             regexLength: \d{1,2}-\d{1,2}-\d{4}

C:\RegEx\Examples\chapter1>java MatchDates "04-02-02"
   NO MATCH: pattern:04-02-02
             regexLength: \d{1,2}-\d{1,2}-\d{4}


C:\RegEx\Examples\chapter1>java MatchDates "04-02-garbage"
   NO MATCH: pattern:04-02-garbage
             regexLength: \d{1,2}-\d{1,2}-\d{4}

Confirming Name Formats Example

The code in Listing 1-5 determines if the given name meets the criterion of being well formatted. It looks for a first name token, an optional middle name token, and finally a last name token. For this example's purposes, a name token consists of a capital letter followed by any number of lowercase letters.

Listing 1-5: MatchNameFormats.java

import Java.util.regex.*;
import java.io.*;

public class MatchNameFormats{
   public static void main(String args[]){

      isNameValid(args[0]);
   }

   /**
   * Confirms that the format for the given name is valid.
   * @param name is a String representing the name.
   * @returns true if the name format is acceptable.
   */
   public static boolean isNameValid(String name){
     boolean retval=false;

     String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)";


     String namePattern = "("+nameToken+"){2,3}";

     retval = name.matches(namePattern);

     //prepare a message indicating success or failure
     String msg = "NO MATCH: pattern:" + name
          + "\r\n           regex :" + namePattern;

     if (retval){
     msg = "MATCH     pattern:" + name
          + "\r\n           regex :" + namePattern;
     }

     System.out.println(msg +"\r\n");
     return retval;
     }
}

This example is interesting because it takes advantage of Java's robustness to a degree that the previous example didn't. Specifically, you define what you mean when you say a "name token":

   String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)";

Then you use that definition later:

   String namePattern = "("+nameToken+"){2,3}";

Note

\p{Upper} and \p{Lower} are described shortly. They simply mean any uppercase character and any lowercase character, respectively.

This helps to keep the regex pattern from becoming overwhelming, and it also helps to isolate errors. As the examples in this book grow more ambitious, you'll start to see that coupling regular expressions with Java's powerful language can offer benefits that would, at best, be terse using regular expressions alone. Listing 1-5 shows the program MatchNameFormats.java, Output 1-5 shows the result of running the program, and Table 1-22 dissects the pattern.

Table 1-22: The Pattern *(\p{Upper}(\p{Lower}+\s?)){2,3*}
Regex	Description
(	A group consisting of
\p{Upper}	An uppercase character
(	Followed by a inner group consisting of
\p{Lower}	A lowercase character
+	Repeated one or more times
\s?	Followed by an optional space
)	The end of the inner group
)	The end of the outer group
{	Repeated at least
2	Two times
,	But no more than
3	Three times
}	End repetition
* In English: Look for two or three words beginning with a capital letter followed by any number of lowercase letters. Each word could be followed by a single space.

Output 1-5: Result of Running MatchNameFormats.java

C:\RegEx\Examples\chapter1>java MatchNameFormats "John Smith"
MATCH    pattern:John Smith
          regex :(\p{Upper}(\p{Lower}+\s?)){2,3}

C:\RegEx\Examples\chapter1>java MatchNameFormats "John McGee"
MATCH     pattern:John McGee
           regex :(\p{Upper}(\p{Lower}+\s?)){2,3}

C:\RegEx\Examples\chapter1>java MatchNameFormats "John Willliam Smith"
MATCH     pattern:John Willliam Smith
           regex :(\p{Upper}(\p{Lower}+\s?)){2,3}

C:\RegEx\Examples\chapter1>java MatchNameFormats "John Q Smith"
NO MATCH: pattern:John Q Smith
           regex :(\p{Upper}(\p{Lower}+\s?)){2,3}


C:\RegEx\Examples\chapter1>java MatchNameFormats "John allen Smith"
NO MATCH: pattern:John allen Smith
           regex :(\p{Upper}(\p{Lower}+\s?)){2,3}

C:\RegEx\Examples\chapter1>java MatchNameFormats "John"
NO MATCH: pattern:John
           regex :(\p{Upper}(\p{Lower}+\s?)){2,3}

A couple of questions naturally arise from this example:

Why did John Q Public fail? Because Q is not a name token, as you've defined name tokens (i.e., a capital letter followed by one or more lowercase letters).
Why did John allen Smith fail? Because allen doesn't start with a capital letter.
Why did John fail? Although John is a valid name token, it isn't repeated two or three name tokens. It's simply one name token.
Why did John McGee pass? McGee isn't an uppercase letter followed by any number of lowercase letters. Try to puzzle this one out on your own. It's answered in the "FAQs" section at the end of the chapter.

This example uses the composition technique mentioned at the beginning of this chapter. That is, it uses patterns previous defined to compose a new pattern. If you think about it, this is a very engineer-like thing to do: Build small blocks, then use those blocks to build more complicated pieces.

Confirming Addresses Example

The code in Listing 1-6 simply determines if the given address meets the criterion of being well formatted. It takes advantage of the name and zip code patterns created earlier, and it adds its own address pattern. Output 1-6 shows the result of running the program. Table 1-23 dissects the pattern.

Table 1-23: The Pattern *^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?*$
Regex	Description
^	The beginning of a line followed by
(	A group consisting of
\p{Upper}	An uppercase character
(	Followed by a inner group consisting of
\p{Lower}	A lowercase character
+	Repeated one or more times
\s?	Followed by an optional space
)	The end of the inner group
)	The end of the outer group
{	Repeated at least
2	Two times
,	But no more than
3	Three times
<space>	Followed by a space
\w	Followed by a any alphanumeric character
+	Repeated one or more times
<space>	Followed by a space
.	Followed by any character
*	Repeated any number of times
,	Followed by a comma
<space>	Followed by a space
\w	Followed by any alphanumeric character
+	Repeated one or more times
<space>	Followed by a space
\d	Followed by a digit
{	Repeated at least
5	Five times
}	End repetition
(	Open group
-	Consisting of a hyphen
\d	A digit
{	Repeated at least
4	Four times
}	End repetition
)	The end of this group
?	Look for zero or one of the preceding
* In English: Look for a name token, as previously defined, followed by some words, a comma, and then more words, followed by a zip code. This example uses the composition technique.

Listing 1-6: MatchAddress.java

import java.util.regex.*;
import java.io.*;

public class MatchAddress{
    public static void main(String args[]){
        isAddressValid(args[0]);
    }

    /**
    * Confirms that the format for the given address is valid.
    * @param addr is a String representing the address
    * @returns true if the zip code format is acceptable.
    */
    public static boolean isAddressValid(String addr){
       boolean retval = false;

       //use the name pattern created earlier.
       String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)";

       String namePattern = "("+nameToken+"){2,3}";

       //use the zip code pattern created earlier.
       String zipCodePattern = "\\d{5}(-\\d{4})?";

      //construct an address pattern
      String addressPattern = "^" + namePattern
         + "\\w+ .*, \\w+ " + zipCodePattern +"$";

      retval= addr.matches(addressPattern);

      //prepare a message indicating success or failure
      String msg = "NO MATCH\npattern:\n " + addr
         + "\nregexLength:\n "
         + addressPattern;

      if (retval){
      msg = "MATCH\npattern:\n " + addr
         + "\nregexLength:\n "
         + addressPattern;
      }
      System.out.println(msg +"\r\n");
      return retval;
   }
}

Output 1-6: Result of Running MatchAddress.java

C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "John Smith 888 Luck Street,
NY 64332"
MATCH
pattern:
 John Smith 888 Luck Street, NY 64332
regexLength:
 ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$


C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "John A. Smith 888 Luck Stree
t, NY 64332-4453"
NO MATCH
pattern:
 John A. Smith 888 Luck Street, NY 64332-4453
regexLength:
 ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$


C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "John Allen Smith 888 Luck Street, NY 64332-4453"
MATCH
pattern:
 John Allen Smith 888 Luck Street, NY 64332-4453
regexLength:
 ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$


C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "888 Luck Street, NY 64332"
NO MATCH
pattern:
 888 Luck Street, NY 64332
regexLength:
 ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "P.O. BOX 888 Luck Street, NY 64332-4453"
NO MATCH
pattern:
 P.O. BOX 888 Luck Street, NY 64332-4453
regexLength:
 ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$


C:\RegEx\chapter_1\Examples\chapter1>
java MatchAddress "John Allen Smith 888 Luck st., NY"
NO MATCH
pattern:
 John Allen Smith 888 Luck st., NY
regexLength:
^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$

Finding Duplicate Words Example

I discussed the code in Listing 1-7 in the "Groups and Back References" section earlier. The point in reintroducing it here is to demonstrate how regular expressions actually interact with Java code.

Listing 1-7: MatchDuplicateWords.java

import java.util.regex.*;
import java.io.*;

public class MatchDuplicateWords{
   public static void main(String args[]){
      hasDuplicate(args[0]);
   }

   /**
   * Confirms that given phrase avoids duplicate words.
   * @param phrase is a String representing the phrase.
   * @returns true if the phrase avoids duplicate
   * words.
   */
   public static boolean hasDuplicate(String phrase){
      boolean retval=false;

      String duplicatePattern =
      "\\b(\\w+) \\1\\b";
      // Compile the pattern
      Pattern p = null;
      try{
        p = Pattern.compile(duplicatePattern);
      }
      catch (PatternSyntaxException pex){
         pex.printStackTrace();
         System.exit(0);
      }
      //count the number of matches.
      int matches = 0;

      //get the matcher
      Matcher m = p.matcher(phrase);
      String val=null;

      //find all matching Strings
      while (m.find()){
         retval = true;
        val = ":" + m.group() +":";
        System.out.println(val);
        matches++;
      }

      //prepare a message indicating success or failure
      String msg = "   NO MATCH: pattern:" + phrase
             + "\r\n             regex: "
             + duplicatePattern;

      if (retval){
      msg = " MATCH     : pattern:" + phrase
          + "\r\n         regex: "
          + duplicatePattern;
      }

      System.out.println(msg +"\r\n");
      return retval;
   }
}

As you read this example, notice that it uses a Pattern and Matcher, and not the String.matches(regex) method, as most of the examples in the previous sections have. Try to guess why this approach has been taken. For the answer, look in the "FAQs" section at the end of this chapter. Output 1-7 shows the result of running the program. The pattern is dissected in Table 1-24.

Table 1-24: The Pattern *\b(\w+) \1\*b
Regex	Description
\b	A word boundary
(	Followed by a group consisting of
\w	An alphanumeric or underscore character
+	Repeated one or more times
)	Close group
<space>	Followed by a space
\1	Followed by the exact group of characters captured previously
\b	Followed by a word boundary
* In English: Look for a word boundary, followed by a group of alphanumeric characters, followed by a space, followed by the exact same group of alphanumeric characters found previously, followed by a word boundary. In short, look for duplicate words.

Output 1-7: Result of Running MatchDuplicateWords.java

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "pizza pizza"
:pizza pizza:
   MATCH   : pattern:pizza pizza
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "Faster pussycat kill kill"
:kill kill:
   MATCH   : pattern:Faster pussycat kill kill
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "The mayor of of simpleton"
:of of:
   MATCH   : pattern:The mayor of of simpleton
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "Never Never Never Never Never"
:Never Never:
:Never Never:
   MATCH   : pattern:Never Never Never Never Never
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "222 2222"
   NO MATCH: pattern:222 2222
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "sara sarah"
   NO MATCH: pattern:sara sarah
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords "Faster pussycat kill, kill"
   NO MATCH: pattern:Faster pussycat kill, kill
             regex: \b(\w+) \1\b

C:\RegEx\Examples\chapter1>java MatchDuplicateWords ". ."
   NO MATCH: pattern:. .
             regex: \b(\w+) \1\b