< Free Open Study > |
1.3 The Regular-Expression Frame of MindAs we'll soon see, complete regular expressions are built up from small buildingblock units. Each individual building block is quite simple, but since they can be combined in an infinite number of ways, knowing how to combine them to achieve a particular goal takes some experience. So, this chapter provides a quick overview of some regular-expression concepts. It doesn't go into much depth, but provides a basis for the rest of this book to build on, and sets the stage for important side issues that are best discussed before we delve too deeply into the regular expressions themselves. While some examples may seem silly (because some are silly), they represent the kind of tasks that you will want to do — you just might not realize it yet. If each point doesn't seem to make sense, don't worry too much. Just let the gist of the lessons sink in. That's the goal of this chapter. 1.3.1 If You Have Some Regular-Expression ExperienceIf you're already familiar with regular expressions, much of this overview will not be new, but please be sure to at least glance over it anyway. Although you may be aware of the basic meaning of certain metacharacters, perhaps some of the ways of thinking about and looking at regular expressions will be new. Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them. Some of the lessons present the same information that you are already familiar with, but in ways that may be new and which are the first steps to really understanding. 1.3.2 Searching Text Files: EgrepFinding text is one of the simplest uses of regular expressions—many text editors and word processors allow you to search a document using a regular-expression pattern. Even simpler is the utility egrep. Give egrep a regular expression and some files to search, and it attempts to match the regular expression to each line of each file, displaying only those lines in which a match is found. egrep is freely available for many systems, including DOS, MacOS, Windows, Unix, and so on. See this book's web site, regex.info, for links on how to obtain a copy of egrep for your system. Returning to the email example from Section 1.1, the command I actually used to generate a makeshift table of contents from the email file is shown in Figure 1-1. egrep interprets the first command-line argument as a regular expression, and any remaining arguments as the file(s) to search. Note, however, that the single quotes shown in Figure 1-1 are not part of the regular expression, but are needed by my command shell. [3] When using egrep, I usually wrap the regular expression with single quotes. Exactly which characters are special, in what contexts, to whom (to the regular-expression, or to the tool), and in what order they are interpreted are all issues that grow in importance when you move to regular-expression use in full- fledged programming languages—something we'll see starting in the next chapter.
Figure 1. Invoking egrep from the command lineWe'll start to analyze just what the various parts of the regex mean in a moment, but you can probably already guess just by looking that some of the characters have special meanings. In this case, the parentheses, the ^ , and the | characters are regular-expression metacharacters, and combine with the other characters to generate the result I want. On the other hand, if your regular expression doesn't use any of the dozen or so metacharacters that egrep understands, it effectively becomes a simple "plain text" search. For example, searching for cat in a file finds and displays all lines with the three letters c·a·t in a row. This includes, for example, any line containing vacation. Even though the line might not have the word cat, the c·a·t sequence in vacation is still enough to be matched. Since it's there, egrep goes ahead and displays the whole line. The key point is that regular-expression searching is not done on a "word" basis—egrep can understand the concept of bytes and lines in a file, but it generally has no idea of English's (or any other language's) words, sentences, paragraphs, or other high-level concepts. |
< Free Open Study > |