Regular Expression Operations

In this section, you'll explore slightly more realistic uses of regular expressions. In the practical world, people use regular expressions for one of three basic broad categories:

Data validation: This is the process of making sure that your candidate String conforms to a specific format (e.g., making sure passwords are at least eight characters long and contain at least two digits).
Search and/or replace: This is another popular usage of regular expressions, and for good reason. Say you want to send a letter to all of your customers, and you want each letter to be personalized by interspersing the customer's name throughout the letter. Of course, this is a little more complex than it sounds, because different names have different lengths, and you don't want to overwrite the next word in your letter when you insert a longer name. Regex is a perfect solution for these types of problems.
Decomposing text: This can also be a challenging activity, particularly if the String in question needs to be split according to complex rules. Fortunately, doing so becomes much easier with regular expressions, as Listing 1-11 (which follows shortly) demonstrates.

Data Validation

Data validation, or making sure that data matches a prescribed format, is one of the most common uses for regular expressions. This can be particularly challenging because data often takes inexact forms and is defined by unspoken rules.

J2SE 1.4 offers you several ways to validate data. The easiest is using the new method boolean String.matches(String regex). This method confirms that the pattern passed in exactly matches the String that it's called on.

This exactness can be tricky, so it's important to understand it well. For example, say you need to confirm that a given String contains the word Java, followed by space, followed by some digit. Further, assume that your candidate String is I love Java 4. The next section demonstrates the process of working through this example.

Data Validation with Strings Example

This example seems simple enough, so you start out by testing the pattern Java \d. Table 1-25 shows a breakdown of the pattern.

Table 1-25: The Pattern Java \d
Regex	Description
J	A capital J
a	Followed the character a
v	Followed the character v
a	Followed the character a
<space>	Followed by a single space
\d	Followed by digit

That was pretty easy, so you confidently write your code, as shown in Listing 1-8.

Listing 1-8: ValidationTest.java

  import java.util.regex.*;

  public class ValidationTest{
     public static void main(String args[]){
         String candidate = "I love Java 4";
         String pattern ="Java \\d";
         System.out.println(candidate.matches(pattern));
     }
  }

Then you run it:

java ValidationTest

and you watch it fail in Output 1-8.

Output 1-8: Result of Running ValidationTest.java

C:\RegEx\code>java ValidationTest
Does candidate : I love Java 4
match pattern  : Java \d?

false

What happened? Because your input string is I love Java 4, and the Java 4 is preceded by I love, the input isn't an exact match to the pattern Java \d. It's a partial match. So what do you do now?

You have two options. You could modify the pattern to allow for characters before and/or after the Java 4 you want to match on, or you could just use the Pattern and Matcher objects. Let's explore the pros and cons of each option.

To use the String.matcher(String regex) method, you need to account for any and all characters that might precede or follow the pattern Java \d. Thus, you use the pattern .*\bJava \d(|$), which Table 1-26 dissects.

Table 1-26: The Pattern .*\bJava \d(|$)
Regex	Description
.	Any character
*	Repeated any number of times
\b	Followed by a word boundary
J	Followed by a capital J
a	Followed the character a
v	Followed the character v
a	Followed the character a
<space>	Followed by a single space
\d	Followed by a digit
(	Followed by a group consisting of
<space>	A space
\|	Or
$	An end-of-line character
)	Close group

Data Validation with the Pattern and Matcher Objects Example

Writing the pattern in the preceding section involved a little bit more work than expected. Let's see if it's any easier to use the Pattern and Marcher objects in Listing 1-9. The output is shown in Output 1-9.

Listing 1-9: ValidationTestWithPatternAndMatcher.java

   import java.util.regex.*;

   public class ValidationTestWithPatternAndMatcher{
      public static void main(String args[]){
   // Compile the pattern
   Pattern p = null;
   try{
     p = Pattern.compile("Java \\d");
   }
   catch (PatternSyntaxException pex){
      pex.printStackTrace();
      System.exit(0);
   }

   //define the matcher string

   String candidate = "I love Java 4";
   //get the matcher
   Matcher m = p.matcher(candidate);

   System.out.println("result=" + m.find());
   }
}

Output 1-9: Result of Running ValidationTestWithPatternAndMatcher.java

C:\RegEx\Examples\chapter1>java
ValidationTestWithPatternAndMatcher
result = true

The pattern used in Listing 1-9 is less complicated than that in Listing 1-8. It's simply the original string Java \d. But the Java code requires explicit usage of the Pattern and Matcher objects, which is slightly more demanding of the programmer. You're doing this because you want explicit access to the Matcher.find method, which allows you to examine the input string and see if any part of it matches the pattern. Again, this in contrast to the String.matches(String regex) method, which requires an exact match.

Generally speaking, there are two types of validation. The first type requires an exact match. For these, the easiest validation method is probably to use the String.matches(String regex), because it rejects anything that doesn't match fully and completely.

The second type of validation requires that the string contain the pattern at some point, but it doesn't require an exact match. For example, you might require that a password contain nonalphanumeric characters. These types of validations are best achieved by using the Matcher and Pattern objects. Chapter 5 provides more complex validation examples.

Search and Replace

One of the most powerful features of the new regex package is the ability to search for and replace Strings and substrings. As you may recall, this sort of activity was previously a tedious affair, as it required the use of tokenizers or the use of the String.substring methods, along with a lot of String arithmetic.

Thankfully, those days are over. There are two general ways to do search and replace operations in J2SE. The following example travels the easier path by taking advantage of two new methods added to the String class. (Chapter 4 contains more complex examples that use the Pattern and Matcher classes directly.) The two methods relevant for the following example are as follows:

replaceFirst(String regex, String replacement)
replaceAll(String regex,String replacement)

The first method, replaceFirst(String regex, String replacement), simply replaces the first occurrence of the regex pattern with the replacement String. The second method, replaceAll, replaces all occurrences of the pattern with the replacement String. I explain these new methods in detail in Chapter 2.

Search and Replace Example

If you're like me, you probably think about programming more than you should. Say you're writing an essay on boxing. Further, say you decide to update your essay on boxing programmatically instead of manually. Listing 1-10 shows the code for doing so. The example searches for and replaces some commonly misused phrases from the given paragraph. Output 1-10 shows the result of running the program.

Listing 1-10: StyleSearchAndReplace.java

public class StyleSearchAndReplace{
  public static void main(String args[]){

    String statement = "The question as to whether the jab is"+
    " superior to the cross has been debated for some time in"+
    " boxing circles. However, it is my opinion that this"+
    " false dichotomy misses the point. I call your attention"+
    " to the fact that the best boxers often use a combination of"+
    " the two. I call your attention to the fact that Mohammed"+
    " Ali,the Greatest of the sport of boxing, used both. He had"+
    " a tremendous jab, yet used his cross effectively, often,"+
    " and well";

    String newStmt=
    statement.replaceAll("The question as to whether","Whether");

    newStmt= newStmt.replaceAll(" of the sport of boxing","");
    newStmt=newStmt.replaceAll("amount of success","success");
    newStmt=
     newStmt.replaceAll("However, it is my opinion that this","This");

    newStmt= newStmt.replaceAll("a combination of the two","both");
    newStmt= newStmt.replaceAll("This is in spite of the fact that"
     +" the", "The");
    newStmt=
     newStmt.replaceAll("I call your attention to the fact that","");

    System.out.println("BEFORE:\n"+statement + "\n");
    System.out.println("AFTER:\n"+newStmt);
   }
}

As Output 1-10 shows, the clarity of the paragraph has improved somewhat as a result of this process.

Output 1-10: Result of Running StyleSearchAndReplace.java

C:\RegEx\Examples\chapter1>java StyleSearchAndReplace
BEFORE:
The question as to whether the jab is superior to the cross has been debated for
some time in boxing circles. However, it is my opinion that this false dichotomy
misses the point. I call your attention to the fact that the best boxers often
use a combination of the two. I call your attention to the fact that Mohammed
Ali,the Greatest of the sport of boxing, used both. He had a tremendous jab, yet
used his cross effectively,often, and well
AFTER:

Whether the jab is superior to the cross has been debated for some time in boxing
circles. This false dichotomy misses the point. the best boxers often use both.
Mohammed Ali,the Greatest, used both. He had a tremendous jab, yet used his cross
effectively,often, and well

Splitting a String

There are many mechanisms available for splitting a String, the most obvious being the StringTokenizer. However, splitting a String can be surprisingly complex, because it can require fairly complex criteria. For example, it's easy enough to split a comma-separated file, but what about splitting a word into vowels and consonants? The latter can be ridiculously complicated. Fortunately, regular expressions can be particularly helpful in these sorts of situations, as you'll learn in the following sections.

Splitting a String Example

In English rhetoric, we learn that one of the best ways to strengthen a sentence is to place positives and negatives in opposition. The code in Listing 1-11 takes a sentence and attempts to strengthen it by placing the positives and negatives in opposition. Output 1-11 shows the result.

Listing 1-11: StyleSplitExample.java

public class StyleSplitExample{
   public static void main(String args[]){
      String phrase1= "but simple justice, not charity";
      strengthenSentence(phrase1);

      String phrase2=
       "but that I love Rome more, not that I love Caesar less";
      strengthenSentence(phrase2);

      String phrase3=
      "ask what you can do for your country, ask not what your "
      + "country can do for you";
      strengthenSentence(phrase3);
   }
   /**
   * Splits and rearranges the given String, hopefully to a more
   * powerful effect.
   * @param sentence is a String representing the phrase we want to
   * strengthen.
   * @returns is a String representing the modified phrase.
   */
   public static String strengthenSentence(String sentence){
      String retval=null;

      String[] tokens = null;

      String splitPattern = ",";

      tokens= sentence.split(splitPattern);

      if (tokens==null){
         String msg = "   NO MATCH: pattern:" + sentence
             + "\r\n             regex: " + splitPattern;
      }
      else{
         retval = tokens[1] + ", " + tokens[0];
         System.out.println("BEFORE: " + sentence);
         System.out.println("AFTER : " + retval +"\n");
      }
      return retval;
   }
}

Output 1-11: Result of Running StyleSplitExample.java

C:\RegEx\Examples\chapter1>java StyleSplitExample
BEFORE: but simple justice, not charity
AFTER : not charity, but simple justice

BEFORE: but that I love Rome more, not that I love Caesar less
AFTER : not that I love Caesar less, but that I love Rome more

BEFORE: ask what you can do for your country, ask not what your
country can do for you
AFTER : ask not what your country can do for you, ask what you can
do for your country

Conditional String Splitting Example

Regex becomes particularly useful when you have more complete String parsing needs. It's easy enough to split a string when it's in a well-defined format, such as a comma-delimited file. You don't need regex for that; a StringTokenizer will do just fine. But what if you want to split a string based on, say, a word or any of its synonyms?

Regular expressions can be helpful in these kinds of scenarios because they allow you to qualify complex criteria for effecting a division. Listing 1-12 splits the given phrase based on occurrences of the word compromise or its synonyms. Output 1-12 shows the result of running the program.

Listing 1-12: Split.java

public class Split{
   public static void main(String args[]){

      String statement = "I will not compromise. I will not "+
      "cooperate. There will be no concession, no conciliation, no "+
      "finding the middle group, and no give and take.";

      String tokens[] =null;

      String splitPattern= "compromise|cooperate|concession|"+
      "conciliation|(finding the middle group)|(give and take)";

      tokens=statement.split(splitPattern);

      System.out.println("REGEX PATTERN:\n"+splitPattern + "\n");

      System.out.println("STATEMENT:\n"+statement + "\n");
      System.out.println("\nTOKENS");
      for (int i=0; i < tokens.length; i++){
      System.out.println(tokens[i]);
      }
   }
}

Output 1-12: Result of Running Split.java

C:\RegEx\Examples\chapter1>java Split
REGEX PATTERN:
compromise|cooperate|concession|conciliation|(finding the middle group)|(give
and take)

STATEMENT:
I will not compromise. I will not cooperate. There will be no
concession, no conciliation,
 no finding the middle group, and no give and take.

TOKENS:
I will not
. I will not
. There will be no
, no
, no
, and no
.

This example illustrates the new types of possibilities that now exist as part of the standard Java implementation. I discuss more sophisticated splits in Chapters 3 and 4.