< Day Day Up > |
6.8. Regular ExpressionsA regular expression to awk is a pattern that consists of characters enclosed in forward slashes. Awk supports the use of regular expression metacharacters (same as egrep) to modify the regular expression in some way. If a string in the input line is matched by the regular expression, the resulting condition is true, and any actions associated with the expression are executed. If no action is specified and an input line is matched by the regular expression, the record is printed. See Table 6.5. Example 6.22.% nawk '/Mary/' employees Mary Adams 5346 11/4/63 28765
EXPLANATION All lines in the employees file containing the regular expression pattern Mary are displayed. Example 6.23.% nawk '/Mary/{print $1, $2}' employees Mary Adams EXPLANATION The first and second fields of all lines in the employees file containing the regular expression pattern Mary are displayed. The metacharacters listed in Table 6.6 are supported by most versions of grep and sed, but are not supported by any versions of awk.
6.8.1 Matching on an Entire LineA stand-alone regular expression matches for the pattern on an entire line and if no action is given, the entire line where the match occurred will be printed. The regular expression can be anchored to the beginning of the line with the ^ metacharacter. Example 6.24.% nawk '/^Mary/' employees Mary Adams 5346 11/4/63 28765 EXPLANATION All lines in the employees file that start with the regular expression Mary are displayed. Example 6.25.% nawk '/^[A-Z][a-z]+ /' employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 Sally Chang 1654 7/22/54 650000 Billy Black 1683 9/23/44 336500 EXPLANATION All lines in the employees file beginning with an uppercase letter, followed by one or more lowercase letters, followed by a space, are displayed. 6.8.2 The match OperatorThe match operator, the tilde (~), is used to match an expression within a record or field. Example 6.26.% cat employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 Sally Chang 1654 7/22/54 650000 Billy Black 1683 9/23/44 336500 % nawk '$1 ~ /[Bb]ill/' employees Billy Black 1683 9/23/44 336500 EXPLANATION Any lines matching Bill or bill in the first field are displayed. Example 6.27.% nawk '$1 !~ /ly$/' employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 EXPLANATION Any lines not matching ly, when ly is at the end of the first field are displayed. The POSIX Character ClassPOSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way characters are encoded, alphabets, the symbols used to represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions, the bracketed character class of characters shown in Table 6.7. Gawk supports this new character class of metacharacters, whereas awk and nawk do not.
The class, [:alnum:] is another way of saying A–Za–z0–9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A–Za–z0–9, by itself, is not a regular expression, but [A–Za–z0–9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A–Za–z0–9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts. Example 6.28.
% gawk '/[[:lower:]]+g[[:space:]]+[[:digit:]]/' employees
Sally Chang 1654 7/22/54 650000
EXPLANATION Gawk searches for one or more lowercase letters, followed by a g, followed by one or more spaces, followed by a digit. (If you are a Linux user, awk is linked to gawk, making both awk and gawk valid commands.) |
< Day Day Up > |