6.8. Regular Expressions

A regular expression to awk is a pattern that consists of characters enclosed in forward slashes. Awk supports the use of regular expression metacharacters (same as egrep) to modify the regular expression in some way. If a string in the input line is matched by the regular expression, the resulting condition is true, and any actions associated with the expression are executed. If no action is specified and an input line is matched by the regular expression, the record is printed. See Table 6.5.

Example 6.22.


% nawk  '/Mary/'  employees

Mary Adams     5346     11/4/63     28765

Table 6.5. awk Regular Expression Metacharacters
Metacharacter
What It Does
^
Matches at the beginning of string
$
Matches at the end of string
.
Matches for a single character
*
Matches for zero or more of the preceding characters
+
Matches for one or more of the preceding characters
?
Matches for zero or one of the preceding characters
[ABC]
Matches for any one character in the set of characters A, B, or C
[^ABC]
Matches any one character not in the set of characters A, B, or C
[A–Z]
Matches for any one character in the range from A to Z
A|B
Matches either A or B
(AB)+
Matches one or more sets of AB; e.g., AB, ABAB, ABABAB
\*
Matches for a literal asterisk
&
Used in the replacement string to represent what was found in the search string

EXPLANATION

All lines in the employees file containing the regular expression pattern Mary are displayed.

Example 6.23.


% nawk  '/Mary/{print $1, $2}'  employees

Mary Adams

EXPLANATION

The first and second fields of all lines in the employees file containing the regular expression pattern Mary are displayed.

The metacharacters listed in Table 6.6 are supported by most versions of grep and sed, but are not supported by any versions of awk.

Table 6.6. Metacharacters NOT supported
Metacharacter
Function
\< >/
Word anchors

Backreferencing
\{ \}
Repetition

6.8.1 Matching on an Entire Line

A stand-alone regular expression matches for the pattern on an entire line and if no action is given, the entire line where the match occurred will be printed. The regular expression can be anchored to the beginning of the line with the ^ metacharacter.

Example 6.24.


% nawk  '/^Mary/'  employees

Mary Adams     5346     11/4/63     28765

EXPLANATION

All lines in the employees file that start with the regular expression Mary are displayed.

Example 6.25.


% nawk  '/^[A-Z][a-z]+ /'  employees

Tom Jones        4424     5/12/66     543354

Mary Adams       5346     11/4/63     28765

Sally Chang      1654     7/22/54     650000

Billy Black      1683     9/23/44     336500

EXPLANATION

All lines in the employees file beginning with an uppercase letter, followed by one or more lowercase letters, followed by a space, are displayed.

6.8.2 The `match` Operator

The match operator, the tilde (~), is used to match an expression within a record or field.

Example 6.26.


% cat employees

Tom Jones       4424    5/12/66     543354

Mary Adams      5346     11/4/63     28765

Sally Chang     1654     7/22/54     650000

Billy Black     1683     9/23/44     336500



% nawk '$1 ~ /[Bb]ill/' employees

Billy Black     1683     9/23/44     336500

EXPLANATION

Any lines matching Bill or bill in the first field are displayed.

Example 6.27.


% nawk '$1 !~ /ly$/' employees

Tom Jones       4424     5/12/66     543354

Mary Adams      5346     11/4/63     28765

EXPLANATION

Any lines not matching ly, when ly is at the end of the first field are displayed.

The POSIX Character Class

POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way characters are encoded, alphabets, the symbols used to represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions, the bracketed character class of characters shown in Table 6.7. Gawk supports this new character class of metacharacters, whereas awk and nawk do not.

Table 6.7. Bracketed Character Class Added by POSIX
Bracket Class
Meaning
[:alnum:]
Alphanumeric characters
[:alpha:]
Alphabetic characters
[:cntrl:]
Control characters
[:digit:]
Numeric characters
[:graph:]
Nonblank characters (not spaces, control characters, etc.)
[:lower:]
Lowercase letters
[:print:]
Like [:graph:], but includes the space character
[:punct:]
Punctuation characters
[:space:]
All whitespace characters (newlines, spaces, tabs)
[:upper:]
Uppercase letters
[:xdigit:]
Allows digits in a hexadecimal number (0-9a-fA-F)

The class, [:alnum:] is another way of saying A–Za–z0–9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A–Za–z0–9, by itself, is not a regular expression, but [A–Za–z0–9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A–Za–z0–9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.

Example 6.28.


% gawk '/[[:lower:]]+g[[:space:]]+[[:digit:]]/' employees

Sally Chang 1654 7/22/54 650000

EXPLANATION

Gawk searches for one or more lowercase letters, followed by a g, followed by one or more spaces, followed by a digit. (If you are a Linux user, awk is linked to gawk, making both awk and gawk valid commands.)

< Day Day Up >