4.7. Linux and GNU grep
Linux uses the GNU version of grep, which in functionality is much the same as grep, only better. In addition to POSIX character classes (see Table 4.7 and Table 4.8), there are a number of new options, including –G, –E, –F, and –P that allow you to use regular grep for everything, and still get the functionality of both egrep and fgrep.
Table 4.7. The Basic Set—GNU grep's Regular Expression MetacharactersMetacharacter | Function | Example | What It Matches |
---|
^ | Beginning-of-line anchor | ^love | Matches all lines beginning with love. | $ | End-of-line anchor | love$ | Matches all lines ending with love. | . | Matches one character | l..e | Matches lines containing an l, followed by two characters, followed by an e. | * | Matches zero or more characters | *love | Matches lines with zero or more spaces, of the preceding characters followed by the pattern love. | [ ] | Matches one character in the set | [Ll]ove | Matches lines containing love or Love. | [^] | Matches one character not in the set | [^A–K]ove | Matches lines not containing A through K followed by ove. | \< | Beginning-of-word anchor | \<love | Matches lines containing a word that begins with love. | \> | End-of-word anchor | love\> | Matches lines containing a word that ends with love. | \(..\) | Tags matched characters | \(love\)able | Tags marked portion in a register to be remembered later as number 1. To reference later, use \1 to repeat the pattern. May use up to nine tags, starting with the first tag at the leftmost part of the pattern. For example, the pattern love is saved in register 1 to be referenced later as \1. |
x\{m\}
x\{m,\}
x\{m,n\}
| Repetition of character x:
m times,
at least m times, or
between m and n times |
o\{5\}
o\{5,\}
o\{5,10\}
| Matches if line has 5 occurrences of o,
at least 5 occurrences of o, or
between 5 and 10 occurrences of o. | \w | Alphanumeric word character; [a-zA-Z0-9_] | l\w*e | Matches an l followed by zero more word characters, and an e. | \W | Nonalphanumeric word character; [^a-zA-Z0-9_] | love\W+ | Matches love followed by one or more nonword characters (., ?, etc.). | \b | Word boundary | \blove\b | Matches only the word love. |
Table 4.8. The Extended Set—Used with egrep and grep –EMetacharacter | Function | Example | What It Matches |
---|
+ | Matches one or more of the preceding characters | [a–z]+ove | Matches one or more lowercase letters, followed by ove. Would find move, approve, love, behoove, etc. | ? | Matches zero or one of the preceding characters | lo?ve | Matches for an l followed by either one or not any o's at all. Would find love or lve. | a|b|c | Matches either a or b or c | love|hate | Matches for either expression, love or hate. | ( ) | Groups characters |
love(able|rs)
(ov)+
| Matches for loveable or lovers. Matches for one or more occurrences of ov. | (..) (...) \1 \2 | Tags matched characters | \(love\)ing | Tags marked portion in a register to be remembered later as number 1. To reference later, use \1 to repeat the pattern. May use up to nine tags, starting with the first tag at the leftmost part of the pattern. For example, the pattern love is saved in register 1 to be referenced later as \1. |
x{m}
x{m,}
x{m,n}
| Repetition of character x: m times, at least m times, or between m and n times |
o\{5\}
o\{5,\}
o\{5,10\}
| Matches if line has 5 occurrences of o, at least 5 occurrences of o, or between 5 and 10 occurrences of o. |
4.7.1 Basic and Extended Regular Expressions
The GNU grep command supports the same regular expression metacharacters (see Table 4.7) as the UNIX grep and then some (see Table 4.8), to modify the way it does its search or displays lines. For example, you can provide options to turn off case sensitivity, display line numbers, display filenames, and so on.
There are two versions of regular expression metacharacters: basic and extended. The regular version of GNU grep (also grep –G) uses the basic set (see Table 4.7), and egrep (or grep –E) uses the extended set (Table 4.8). With GNU grep, both sets are available. The basic set consists of
^, $, ., *, [ ], [^ ], \< \>, and \( \)
In addition, GNU grep recognizes \b, \w, and \W, as well as a new class of POSIX metacharacters (see Table 4.9).
Table 4.9. The Bracketed Character ClassBracket Class | Meaning |
---|
[:alnum:] | Alphanumeric characters | [:alpha:] | Alphabetic characters | [:cntrl:] | Control characters | [:digit:] | Numeric characters | [:graph:] | Nonblank characters (not spaces, control characters, etc.) | [:lower:] | Lowercase letters | [:print:] | Like [:graph:], but includes the space character | [:punct:] | Punctuation characters | [:space:] | All whitespace characters (newlines, spaces, tabs) | [:upper:] | Uppercase letters | [:xdigit:] | Allows digits in a hexadecimal number (0-9a-fA-F) |
With the –E option to GNU grep, the extended (egrep) set is available, but even without the –E option, regular grep, the default, can use the extended set of metacharacters, provided that the metacharacters are preceded with a backslash. For example, the extended set of metacharacters is
?, +, { }, |, ( )
The extended set of metacharacters have no special meaning to regular grep, unless they are backslashed, as follows:
\?, \+, \{, \|, \(, \)
The format for using the GNU grep is shown in Table 4.6.
Table 4.6. GNU grepFormat | What It Understands |
---|
grep 'pattern' filename(s) | Basic RE metacharacters (the default) | grep –G 'pattern' filename(s) | Same as above; the default | grep –E 'pattern' filename(s) | Extended RE metacharacters | grep –F 'pattern' filename | No RE metacharacters | grep –P 'pattern' filename | Interpret the pattern as a Perl RE |
The POSIX Class
POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way they encode characters, represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions the bracketed character class of characters shown in Table 4.9.
The class, for example, [:alnum:] is another way of saying A–Za–z0–9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A–Za–z0–9, by itself, is not a regular expression, but [A–Za–z0–9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A–Za–z0–9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.
Example 4.37.
1 % grep '[[:space:]]\.[[:digit:]][[:space:]]' datafile
southwest SW Lewis Dalsass 2.7 .8 2 18
southeast SE Patricia Hemenway 4.0 .7 4 17
2 % grep -E '[[:space:]]\.[[:digit:]][[:space:]]' datafile
southwest SW Lewis Dalsass 2.7 .8 2 18
southeast SE Patricia Hemenway 4.0 .7 4 17
3 % egrep '[[:space:]]\.[[:digit:]][[:space:]]' datafile
southwest SW Lewis Dalsass 2.7 .8 2 18
southeast SE Patricia Hemenway 4.0 .7 4 17
EXPLANATION
1, 2, 3 For all Linux variants of grep (other than fgrep), the POSIX bracketed class set is supported. In each of these examples, grep will search for a space character, a literal period, a digit [0–9], and another space character.
|