4.7. Linux and GNU grep

4.7. Linux and GNU `grep`

Linux uses the GNU version of grep, which in functionality is much the same as grep, only better. In addition to POSIX character classes (see Table 4.7 and Table 4.8), there are a number of new options, including –G, –E, –F, and –P that allow you to use regular grep for everything, and still get the functionality of both egrep and fgrep.^[2]

^[2] To use grep recursively, see Appendix A for GNU rgrep and xargs.

Table 4.7. The Basic Set—GNU grep's Regular Expression Metacharacters
Metacharacter
Function
Example
What It Matches
^
Beginning-of-line anchor
^love
Matches all lines beginning with love.
$
End-of-line anchor
love$
Matches all lines ending with love.
.
Matches one character
l..e
Matches lines containing an l, followed by two characters, followed by an e.
*
Matches zero or more characters
*love
Matches lines with zero or more spaces, of the preceding characters followed by the pattern love.
[ ]
Matches one character in the set
[Ll]ove
Matches lines containing love or Love.
[^]
Matches one character not in the set
[^A–K]ove
Matches lines not containing A through K followed by ove.
\<^[a]
Beginning-of-word anchor
\<love
Matches lines containing a word that begins with love.
\>
End-of-word anchor
love\>
Matches lines containing a word that ends with love.
$..$^[b]
Tags matched characters
$love$able
Tags marked portion in a register to be remembered later as number 1. To reference later, use \1 to repeat the pattern. May use up to nine tags, starting with the first tag at the leftmost part of the pattern. For example, the pattern love is saved in register 1 to be referenced later as \1.

x\{m\} x\{m,\} x\{m,n\}^[c]

Repetition of character x:

m times,

at least m times, or

between m and n times

o\{5\} o\{5,\} o\{5,10\}

Matches if line has 5 occurrences of o,

at least 5 occurrences of o, or

between 5 and 10 occurrences of o.
\w
Alphanumeric word character; [a-zA-Z0-9_]
l\w*e
Matches an l followed by zero more word characters, and an e.
\W
Nonalphanumeric word character; [^a-zA-Z0-9_]
love\W+
Matches love followed by one or more nonword characters (., ?, etc.).
\b
Word boundary
\blove\b
Matches only the word love.

^[a] Won't work unless backslashed, even with grep –E and GNU egrep; work with UNIX egrep at all.

^[b] These metacharacters are really part of the extended set, but are placed here because they work with UNIX grep and GNU regular grep, if backslashed. They do not work with UNIX egrep at all.

^[c] The \{ \} metacharacters are not supported on all versions of UNIX or all pattern-matching utilities; they usually work with vi and grep. They don't work with UNIX egrep at all.

Table 4.8. The Extended Set—Used with egrep and grep –E
Metacharacter
Function
Example
What It Matches
+
Matches one or more of the preceding characters
[a–z]+ove
Matches one or more lowercase letters, followed by ove. Would find move, approve, love, behoove, etc.
?
Matches zero or one of the preceding characters
lo?ve
Matches for an l followed by either one or not any o's at all. Would find love or lve.
a|b|c
Matches either a or b or c
love|hate
Matches for either expression, love or hate.
( )
Groups characters

love(able|rs) (ov)+

Matches for loveable or lovers. Matches for one or more occurrences of ov.
(..) (...) \1 \2^[a]
Tags matched characters
$love$ing
Tags marked portion in a register to be remembered later as number 1. To reference later, use \1 to repeat the pattern. May use up to nine tags, starting with the first tag at the leftmost part of the pattern. For example, the pattern love is saved in register 1 to be referenced later as \1.

x{m} x{m,} x{m,n}^[b]

Repetition of character x: m times, at least m times, or between m and n times

o\{5\} o\{5,\} o\{5,10\}

Matches if line has 5 occurrences of o, at least 5 occurrences of o, or between 5 and 10 occurrences of o.

^[a] Tags and back references do not work with UNIX egrep.

^[b] The \{ \} metacharacters are not supported on all versions of UNIX or all pattern-matching utilities; they usually work with vi and grep. They do not work with UNIX egrep at all.

4.7.1 Basic and Extended Regular Expressions

The GNU grep command supports the same regular expression metacharacters (see Table 4.7) as the UNIX grep and then some (see Table 4.8), to modify the way it does its search or displays lines. For example, you can provide options to turn off case sensitivity, display line numbers, display filenames, and so on.

There are two versions of regular expression metacharacters: basic and extended. The regular version of GNU grep (also grep –G) uses the basic set (see Table 4.7), and egrep (or grep –E) uses the extended set (Table 4.8). With GNU grep, both sets are available. The basic set consists of

^, $, ., *, [ ], [^ ], \< \>, and 

In addition, GNU grep recognizes \b, \w, and \W, as well as a new class of POSIX metacharacters (see Table 4.9).

Table 4.9. The Bracketed Character Class
Bracket Class
Meaning
[:alnum:]
Alphanumeric characters
[:alpha:]
Alphabetic characters
[:cntrl:]
Control characters
[:digit:]
Numeric characters
[:graph:]
Nonblank characters (not spaces, control characters, etc.)
[:lower:]
Lowercase letters
[:print:]
Like [:graph:], but includes the space character
[:punct:]
Punctuation characters
[:space:]
All whitespace characters (newlines, spaces, tabs)
[:upper:]
Uppercase letters
[:xdigit:]
Allows digits in a hexadecimal number (0-9a-fA-F)

With the –E option to GNU grep, the extended (egrep) set is available, but even without the –E option, regular grep, the default, can use the extended set of metacharacters, provided that the metacharacters are preceded with a backslash.^[3] For example, the extended set of metacharacters is

^[3] In any version of grep, a metacharacter can be quoted with a backslash to turn off its special meaning.


?, +, { }, |, ( )

The extended set of metacharacters have no special meaning to regular grep, unless they are backslashed, as follows:


\?, \+, \{, \|, \(, \)

The format for using the GNU grep is shown in Table 4.6.

Table 4.6. GNU grep
Format
What It Understands
grep 'pattern' filename(s)
Basic RE metacharacters (the default)
grep –G 'pattern' filename(s)
Same as above; the default
grep –E 'pattern' filename(s)
Extended RE metacharacters
grep –F 'pattern' filename
No RE metacharacters
grep –P 'pattern' filename
Interpret the pattern as a Perl RE

The POSIX Class

POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way they encode characters, represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions the bracketed character class of characters shown in Table 4.9.

The class, for example, [:alnum:] is another way of saying A–Za–z0–9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A–Za–z0–9, by itself, is not a regular expression, but [A–Za–z0–9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A–Za–z0–9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.

Example 4.37.


1   % grep '[[:space:]]\.[[:digit:]][[:space:]]' datafile

    southwest   SW      Lewis Dalsass          2.7     .8   2   18

    southeast   SE      Patricia Hemenway      4.0     .7   4   17



2   % grep -E '[[:space:]]\.[[:digit:]][[:space:]]' datafile

    southwest   SW     Lewis Dalsass         2.7     .8   2   18

    southeast   SE     Patricia Hemenway     4.0     .7   4   17



3   % egrep '[[:space:]]\.[[:digit:]][[:space:]]' datafile

    southwest   SW     Lewis Dalsass         2.7     .8   2   18

    southeast   SE     Patricia Hemenway     4.0     .7   4   17

EXPLANATION

1, 2, 3 For all Linux variants of grep (other than fgrep), the POSIX bracketed class set is supported. In each of these examples, grep will search for a space character, a literal period, a digit [0–9], and another space character.

< Day Day Up >