Team LiB
Previous Section Next Section

Searching a File

Building on the previous example, I decide to provide a utility for searching the content of a file and returning all matching strings within that file. I'll use FileChannels for the actual file I/O. Although a discussion of FileChannels is beyond the scope of this book, in my opinion they're the best way to access files in Java.

My strategy is to use a FileChannel to open a file, read its content into a String, release the FileChannel, and then use the searchString method to parse the String. This is faster than reading through the file line by line and examining its content, though it is memory intensive. Listing 5-7 shows the code for doing this.

Listing 5-7: Reading in File Content
Start example
01     /**
02     * extracts the content of a file
03     * @param String fileName the name of the file to extract
04     * @throws IOException
05     *
06     * @return String representing the contents of the file
07     */
08     public static String getFileContent(String fileName)
09     throws IOException{
10        String retval = null;
11        //get access to the FileChannel
12        FileInputStream fis =
13          new FileInputStream(fileName);
14        FileChannel fc = fis.getChannel();

15        //get the file content
16        retval = getFileContent(fc);

17        //close up shop
18        fc.close();
19        fc = null;

20        return retval;
21     }

22     /**
23     * extracts the content of a file
24     * @param String fileName the name of the file to extract
25     * @throws IOException
26     *
27     * @return String representing the contents of the file
28     */
29     private static String getFileContent(FileChannel fc)
30     throws IOException{
31         String retval = null;
32        //read the contents of the FileChannel
33         ByteBuffer bb = ByteBuffer.allocate((int)fc.size());
34         fc.read(bb);

35         //save the contents as a string
36         bb.flip();
37         retval = new String(bb.array());
38         bb = null;

39         return retval;
40     }
End example

Next, I need to provide a method that will load the file, search it, and return the results. Given the two previous methods, this becomes fairly easy, as shown in Listing 5-8.

Listing 5-8: Opening a File, Searching the File, and Returning the Results
Start example
01     public static Map searchFile(
02         String file,
03         String searchPattern,
04         int flags
05     ) throws IOException
06     {
07        String fileContent = getFileContent(file);
08        Map retval = searchFile(fileContent,searchPattern,flags);
09        return retval;
10     }}
End example

I take the program out for a spin and compare it to grep. To be honest, it seems to lack a bit in the comparison. The grep program returns the entire line of a matching token, whereas this method only returns the matching token. That's not terrible, because the client could request the entire line by using the correct regex pattern. But it's not really as friendly as it could be, especially for the average user.

I decide to "pad" the pattern to capture an entire line, assuming that the original search pattern has no punctuation, and thus no regex, in it. Listing 5-9 shows my modified searchFile method.

Listing 5-9: Modifying the searchFile Method to Make It Friendlier
Start example
01  public static Map searchFile(
02      String file,
03      String searchPattern,
04      int flags
05  ) throws IOException
06  {
07     String fileContent = getFileContent(file);

08     //if the search pattern doesn't have any punctuation
09     //then assume it's not a regular expression and extract
10    //the entire line in which it was found
11     String[] regexTokens = searchPattern.split("\\p{Punct}");

12     if (regexTokens.length == 1)
13     {
14         searchPattern = "^.*"+ searchPattern+".*$";
15     }

16    Map retval = searchString(fileContent,searchPattern,flags);
17    return retval;
18 }

End example

Discussion Point

At this point, there should be some reasonable questions on your mind. Isn't this supposed to be a regex book? There wasn't anything particularly regex-like about the search file and search string methods; they were pretty much straight Java code, which you already know how to write. What's going on?

The point here is that regex is just a tool. It doesn't change the fact that you're still writing Java code, and you need to follow good, modular, object-oriented principles, even as you're working with regular expressions. Regex allows you to bridge trouble spots you might never have crossed otherwise, but it's just a tool. Like any well-built engine, the java.util.regex engine announces its excellence by humming quietly along and not forcing you to worry about it.

Working with Very Large Files

Another valid question at this point is, what if the content of the file you're trying to parse is too large to make reading all of it into memory a practical option? In general, you have two paths you can take here. You can use one of the new Java features, such as MappedByteBuffers, or you can split the file into manageable sections and parse each of those in turn.

If you decide to use MappedByteBuffers for regex, Listing 5-10 contains an example showing how. I'm hesitant, however, to advocate MappedByteBuffers with regex too strongly for three reasons. First and foremost, their behavior is very system dependent, so you should probably rule them out if you need platform independence. Second, even within a given platform, their behavior isn't well defined. Thus, depending on what else you're doing with your operating system, you could get inconsistent results. Third, you need to consider the fact that, if the entire file can't be loaded into memory at one time, trying to apply a pattern that might have wildcards in it is going to be a tricky affair.

Listing 5-10: Accessing a File Through a MappedByteBuffer
Start example
01   public static boolean getFileContentUsingMappedByteBuffer
02   (
03       String fileName
04   ) throws IOException
05   {
06       boolean retval = false;
07       RandomAccessFile raf = new RandomAccessFile(fileName,"rwd");
08       FileChannel fc = raf.getChannel();

09       MappedByteBuffer mbb =
10        fc.map(FileChannel.MapMode.READ_WRITE,0,fc.size());

11       CharSequence cb = mbb.asCharBuffer();

12       return retval;
13 }
End example

You may very well need to reconsider your patterns, and break the file up into logical blocks based on your insight into its structure. One strategy might be to check the size of the file and divide that by 10, 100, or whatever fraction is easily loadable given your system's memory limitations, and then search that portion. Although this isn't ideal, it is more predictable than the corresponding mapped-memory approach. The bottom line is that regardless of the regex flavor or provider you use, very large files require special treatment.


Team LiB
Previous Section Next Section