Previous Section  < Free Open Study >  Next Section

7.5 The Match Operator

The basic match

     $text =~ m/regex/

is the core of Perl regular-expression use. In Perl, a regular-expression match is an operator that takes two operands, a target string operand and a regex operand, and returns a value.

How the match is carried out, and what kind of value is returned, depend on the context the match is used in (see Section 7.3.1), and other factors. The match operator is quite flexible—it can be used to test a regular expression against a string, to pluck data from a string, and even to parse a string part by part in conjunction with other match operators. While powerful, this flexibility can make mastering it more complex. Some areas of concern include:

  • How to specify the regex operand

  • How to specify match modifiers, and what they mean

  • How to specify the target string to match against

  • A match's side effects

  • The value returned by a match

  • Outside influences that affect the match

The general form of a match is:


     StringOperand =~ RegexOperand

There are various shorthand forms, and it's interesting to note that each part is optional in one shorthand form or another. We'll see examples of all forms throughout this section.

7.5.1 Match's Regex Operand

The regex operand can be a regex literal or a regex object. (Actually, it can be a string or any arbitrary expression, but there is little benefit to that.) If a regex literal is used, match modifiers may also be specified.

7.5.1.1 Using a regex literal

The regex operand is most often a regex literal within m/···/ or just /···/. The leading m is optional if the delimiters for the regex literal are forward slashes or question marks (delimiters of question marks are special, discussed in a bit). For consistency, I prefer to always use the m, even when it's not required. As described earlier, you can choose your own delimiters if the m is present (see Section 7.2.1.2).

When using a regex literal, you can use any of the core modifiers described in Section 7.2.3. The match operator also supports two additional modifiers, /g and /c, discussed in a bit.

7.5.1.2 Using a regex object

The regex operand can also be a regex object, created with qr/···/. For example:

     my $regex = qr/regex/;

      .

      .

      .

     if ($text =~ $regex) {

          .

          .

          .

You can use m/···/ with a regex object. As a special case, if the only thing within the "regex literal" is the interpolation of a regex object, it's exactly the same as using the regex object alone. This example's if can be written as:

     if ($text =~ m/$regex/) {

         .

         .

         .

This is convenient because it perhaps looks more familiar, and also allows you to use the /g modifier with a regex object. (You can use the other modifiers that m/···/ supports as well, but they're meaningless in this case because they can never override the modes locked in a regex object see Section 7.4.1.1.)

7.5.1.3 The default regex

If no regex is given, such as with m// (or with m/$SomeVar/ where the variable $SomeVar is empty or undefined), Perl reuses the regular expression most recently used successfully within the enclosing dynamic scope. This used to be useful for efficiency reasons, but is now obsolete with the advent of regex objects (see Section 7.4).

7.5.1.4 Special match-once ?···?

In addition to the special cases for the regex-literal delimiters described earlier, the match operator treats the question mark as a special delimiter. The use of a question mark as the delimiter (as with m?···?) enables a rather esoteric feature such that after the successfully m?···? matches once, it cannot match again until the function reset is called in the same package. Quoting from the Perl Version 1 manual page, this features was "a useful optimization when you only want to see the first occurrence of something in each of a set of files," but for whatever reason, I have never seen it used in modern Perl.

The question mark delimiters are a special case like the forward slash delimiters, in that the m is optional: ?···? by itself is treated as m?···?.

7.5.2 Specifying the Match Target Operand

The normal way to indicate "this is the string to search" is using =~ , as with $text =~ m/···/ . Remember that =~ is not an assignment operator, nor is it a comparison operator. It is merely a funny-looking way of linking the match operator with one of its operands. (The notation was adapted from awk.)

Since the whole " expr =~ m/···/ " is an expression itself, you can use it wherever an expression is allowed. Some examples (each separated by a wavy line):

     $text =~ m/···/;    # Just do it, presumably, for the side effects.

     ...........................

     if ($text =~ m/···/) {

      # Do code if match is successful

       .

       .

       .

     ...........................

     $result = ( $text =~ m/···/ );  # Set $result to result of match against $text

     $result =   $text =~ m/···/ ;  # Same thing; =~ has higher precedence than =

     ...........................

       $copy = $text;                # Copy $text to $result ...

       $copy           =~ m/···/;    # ... and perform match on $result

     ( $copy = $text ) =~ m/···/;    # Same thing in one expression

7.5.2.1 The default target

If the target string is the variable $_, you can omit the " $_ =~ " parts altogether. In other words, the default target operand is $_.

Something like

     $text =~ m/regex/;

means " Apply regex to the text in $text, ignoring the return value but doing the side effects. " If you forget the '~', the resulting

     $text = m/regex/;

becomes "Apply regex to the text in $_, do the side effects, and return a true or false value that is then assigned to $text." In other words, the following are the same:


     $text =        m/regex/;

     $text = ($_ =~ m/regex/);

Using the default target string can be convenient when combined with other constructs that have the same default (as many do). For example, this is a common idiom:

     while (<>)

     {

        if (m/···/) {

          .

          .

          .

       } elsif (m/···/) {

          .

          .

          .

In general, though, relying on default operands can make your code less approachable by less experienced programmers.

7.5.2.2 Negating the sense of the match

You can also use !~ instead of =~ to logically negate the sense of the return value. (Return values and side effects are discussed soon, but with !~, the return value is always a simple true or false value.) The following are identical:

     if ($text !~ m/···/)



     if (not $text =~ m/···/)

     

     unless ($text =~ m/···/)

Personally, I prefer the middle form. With any of them, the normal side effects, such as the setting of $1 and the like, still happen. !~ is merely a convenience in an "if this doesn't match" situation.

7.5.3 Different Uses of the Match Operator

You can always use the match operator as if it returns a simple true/false indicating the success of the match, but there are ways you can get additional information about a successful match, and to work in conjunction with other match operators. How the match operator works depends primarily on the context in which it's used (see Section 7.3.1), and whether the /g modifier has been applied.

7.5.3.1 Normal "does this match?"—scalar context without /g

In a scalar context, such as the test of an if, the match operator returns a simple true or false:

     if ($target =~ m/···/) {

         # . . . processing after successful match . . .

          .

          .

          .

     } else {

         # . . . processing after unsuccessful match . . .

          .

          .

          .

     }

You can also assign the result to a scalar for inspection later:

     my $success = $target =~ m/···/;

       .

       .

       .

     if ($success) {

       .

       .

       .

     }
7.5.3.2 Normal "pluck data from a string"—list context, without /g

A list context without /g is the normal way to pluck information from a string. The return value is a list with an element for each set of capturing parentheses in the regex. A simple example is processing a date of the form 69/8/31, using:

     my ($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x;

The three matched numbers are then available in the three variables (and $1, $2, and $3 as well). There is one element in the return-value list for each set of capturing parentheses, or an empty list upon failure.

It is possible for a set of capturing parentheses to not participate in the final success of a match. For example, one of the sets in m/(this)|(that)/ is guaranteed not to be part of the match. Such sets return the undefined value undef. If there are no sets of capturing parentheses to begin with, a successful list-context match without /g returns the list (1).

A list context can be provided in a number of ways, including assigning the results to an array, as with:

     my @parts = $text =~ m/^(\d+)-(\d+)-(\d+)$/;

If you're assigning to just one scalar variable, take care to provide a list context to the match if you want the captured parts instead of just a Boolean indicating the success. Compare the following tests:

     my ($word)  = $text =~ m/(\w+)/;

     my $success = $text =~ m/(\w+)/;

The parentheses around the variable in the first example cause its my to provide a list context to the assignment (in this case, to the match). The lack of parentheses in the second example provides a scalar context to the match, so $success merely gets a true/false result.

This example shows a convenient idiom:

     if ( my ($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x ) {

         # Process for when we have a match: $year and such are available

     } else {

         # here if no match . . .

     }

The match is in a list context (provided by the " my (···) = "), so the list of variables is assigned their respective $1, $2, etc., if the match is successful. However, once that's done, since the whole combination is in the scalar context provided by the while conditional, Perl must contort the list to a scalar. To do that, it takes the number of items in the list, which is conveniently zero if the match wasn't successful, and non-zero (i.e., true) if it was.

7.5.3.3 "Pluck all matches"—list context, with the /g modifier

This useful construct returns a list of all text matched within capturing parentheses (or if there are no capturing parentheses, the text matched by the whole expression), not only for one match, as in the previous section, but for all matches in the string.

A simple example is the following, to fetch all integers in a string:

     my @nums = $text =~ m/\d+/g;

If $text contains an IP address like '64.156.215.240', @nums then receives four elements, '64', '156', '215', and '240'. Combined with other constructs, here's an easy way to turn an IP address into an eight-digit hexadecimal number such as '409cd7f0', which might be convenient for creating compact log files:

     my $hex_ip = join '', map { sprintf("%02x", $_) } $ip =~ m/\d+/g;

You can convert it back with a similar technique:

     my $ip = join '.', map { hex($_) } $hex_ip =~ m/../g

As another example, to match all floating-point numbers on a line, you might use:

     my @nums = $text =~ m/\d+(?:.\d+)?|\.\d+/g;

The use of non-capturing parentheses here is very important, since adding capturing ones changes what is returned. Here's an example showing how one set of capturing parentheses can be useful:

     my @Tags = $Html =~ m/<(\w+)/g;

This sets @Tags to the list of HTML tags, in order, found in $Html, assuming it contains no stray '<' characters.

Here's an example with multiple sets of capturing parentheses: consider having the entire text of a Unix mailbox alias file in a single string, where logical lines look like:

     alias  Jeff     jfriedl@regex.info

     alias  Perlbug  perl5-porters@perl.org

     alias  Prez     president@whitehouse.gov

To pluck an alias and full address from one of the logical lines, you can use m/^alias\s+(\S+)\s+(.+)/m (without /g). In a list context, this returns a list of two elements, such as ('Jeff', 'jfriedl@regex.info') . Now, to match all such sets, add /g. This returns a list like:

     ( 'Jeff', 'jfriedl@regex.info', 'Perlbug',

       'perl5-porters@perl.org', 'Prez', 'president@whitehouse.gov' )

If the list happens to fit a key/value pair pattern as in this example, you can actually assign it directly to an associative array. After running

     my %alias = $text =~ m/^alias\s+(\S+)\s+(.+)/mg;

you can access the full address of 'Jeff' with $alias{Jeff}.

7.5.4 Iterative Matching: Scalar Context, with /g

A scalar-context m/···/g is a special construct quite different from the others. Like a normal m/···/, it does just one match, but like a list-context m/···/g, it pays attention to where previous matches occurred. Each time a scalar-context m/···/g is reached, such as in a loop, it finds the "next" match. If it fails, it resets the "current position," causing the next application to start again at the beginning of the string.

Here's a simple example:

     $text = "WOW! This is a SILLY test.";
     $text =~ m/\b([a-z]+\b)/g;

     print "The first all-lowercase word: $1\n";
     $text =~ m/\b([A-Z]+\b)/g;

     print "The subsequent all-uppercase word: $1\n";

With both scalar matches using the /g modifier, it results in:

     The first all-lowercase word: is

     The subsequent all-uppercase word: SILLY

The two scalar-/g matches work together: the first sets the "current position" to just after the matched lowercase word, and the second picks up from there to find the first uppercase word that follows. The /g is required for either match to pay attention to the "current position," so if either didn't have /g, the second line would refer to 'WOW'.

A scalar context /g match is quite convenient as the conditional of a while loop. Consider:

     while ($ConfigData =~ m/^(\w+)=(.*)/mg) {

         my($key, $value) = ($1, $2);

          .

          .

          .

     }

All matches are eventually found, but the body of the while loop is executed between the matches (well, after each match). Once an attempt fails, the result is false and the while loop finishes. Also, upon failure, the /g state is reset, which means that the next /g match starts over at the start of the string.

Compare

     while ($text =~ m/(\d+)/) { # dangerous!

        print "found: $1\n";

     }

and:

     while ($text =~ m/(\d+)/g) {

        print "found: $1\n";

     }

The only difference is /g, but it's a huge difference. If $text contained, say, our earlier IP example, the second prints what we want:

     found: 64

     found: 156

     found: 215

     found: 240

The first, however, prints "found: 64" over and over, forever. Without the /g, the match is simply "find the first figs/boxdr.jpg(\d+)figs/boxul.jpg in $text," which is '64' no matter how many times it's checked. Adding the /g to the scalar-context match turns it into "find the next figs/boxdr.jpg(\d+)figs/boxul.jpg in $text," which finds each number in turn.

7.5.4.1 The "current match location" and the pos() function

Every string in Perl has associated with it a "current match location" at which the transmission first attempts the match. It's a property of the string, and not associated with any particular regular expression. When a string is created or modified, the "current match location" starts out at the beginning of the string, but when a /g match is successful, it's left at the location where the match ended. The next time a /g match is applied to the string, the match begins inspecting the string at that same "current match location."

You have access to the target string's "current match location" via the pos(···) function. For example:

     my $ip = "64.156.215.240";

     while ($ip =~ m/(\d+)/g) {

        printf "found '$1' ending at location %d\n", pos($ip);

     }

This produces:

     found '64' ending at location 2

     found '156' ending at location 6

     found '215' ending at location 10

     found '240' ending at location 14

(Remember, string indices are zero-based, so "location 2" is just before the 3rd character into the string.) After a successful /g match, $+[0] (the first element of @+ see Section 7.3.3) is the same as the pos of the target string.

The default argument to the pos() function is the same default argument for the match operator: the $_ variable.

7.5.4.2 Pre-setting a string's pos

The real power of pos() is that you can write to it, to tell the regex engine where to start the next match (if that next match uses /g, of course). For example, the web server logs I work with at Yahoo! are in a custom format that contains 32 bytes of fixed-width data, followed by the page being requested, followed by other information. One way to pick out the page is to use figs/boxdr.jpg^.{32}figs/boxul.jpg to skip over the fixed-width data:

     if ($logline =~ m/^.{32}(\S+)/) {

         $RequestedPage = $1;

     }

This brute-force method isn't elegant, and forces the regex engine to work to skip the first 32 bytes. That's less efficient and less clear than doing it explicitly ourself:


pos($logline) = 32; # The page starts at the 32nd character, so start the next match there . . .

if ($logline =~ m/(\S+)/g) {

    $RequestedPage = $1;

}

This is better, but isn't quite the same. It has the regex start where we want it to start, but doesn't require a match at that position the way the original does. If for some reason the 32nd character can't be matched by figs/boxdr.jpg\Sfigs/boxul.jpg , the original version correctly fails, but the new version, without anything to anchor it to a particular position in the string, is subject to the transmission's bump-along. Thus, it could return, in error, a match of figs/boxdr.jpg\S+figs/boxul.jpg from later in the string. Luckily, the next section shows that this is an easy problem to fix.

7.5.4.3 Using figs/boxdr.jpg\Gfigs/boxul.jpg

figs/boxdr.jpg\Gfigs/boxul.jpg is the "anchor to where the previous match ended" metacharacter. It's exactly what we need to solve the problem in the previous section:

pos($logline) = 32; # The page starts at the 32nd character, so start the next match there . . .

if ($logline =~ m/\G(\S+)/g) {

    $RequestedPage = $1;

}

figs/boxdr.jpg\Gfigs/boxul.jpg tells the transmission "don't bump-along with this regex — if you can't match successfully right away, fail."

There are discussions of figs/boxdr.jpg\Gfigs/boxul.jpg in previous chapters: see the general discussion in Chapter 3 (see Section 3.4.3.3), and the extended example in Chapter 5 (see Section 5.4.2.3).

Note that Perl's figs/boxdr.jpg\Gfigs/boxul.jpg is restricted in that it works predictably only when it is the first thing in the regex, and there is no top-level alternation. For example, in Chapter 6 when the CSV example is being optimized (see Section 6.6.7.3), the regex begins with figs/boxdr.jpg \G(?:^|,)···figs/boxul.jpg . Because there's no need to check for figs/boxdr.jpg\Gfigs/boxul.jpg if the more restrictive figs/boxdr.jpg^figs/boxul.jpg matches, you might be tempted to change this to figs/boxdr.jpg(?:^|\G,)···figs/boxul.jpg . Unfortunately, this doesn't work in Perl; the results are unpredictable.[7]

[7] This would work with most other flavors that support figs/boxdr.jpg\Gfigs/boxul.jpg , but even so, I would generally not recommend using it, as the optimization gains by having figs/boxdr.jpg\Gfigs/boxul.jpg at the start of the regex usually outweigh the small gain by not testing figs/boxdr.jpg\Gfigs/boxul.jpg an extra time (see Section 6.4.5).

7.5.4.4 "Tag-team" matching with /gc

Normally, a failing m/···/g match attempt resets the target string's pos to the start of the string, but adding the /c modifier to /g introduces a special twist, causing a failing match to not reset the target's pos. (/c is never used without /g, so I tend to refer to it as /gc.)

m/···/gc is most commonly used in conjunction with figs/boxdr.jpg\Gfigs/boxul.jpg to create a "lexer" that tokenizes a string into its component parts. Here's a simple example to tokenize the HTML in variable $html:

     while (not $html =~ m/\G\z/gc) # While we haven't worked to the end . . .

     {

      if    ($html =~ m/\G( <[^>]+>    )/xgc) { print "TAG: $1\n"            }

      elsif ($html =~ m/\G( &\w+;      )/xgc) { print "NAMED ENTITY: $1\n"   }

      elsif ($html =~ m/\G( &\#\d+;    )/xgc) { print "NUMERIC ENTITY: $1\n" }

      elsif ($html =~ m/\G( [^<>&\n]+  )/xgc) { print "TEXT: $1\n"           }

      elsif ($html =~ m/\G  \n          /xgc) { print "NEWLINE\n"            }

      elsif ($html =~ m/\G( .          )/xgc) { print "ILLEGAL CHAR: $1\n"   }

      else {

          die "$0: oops, this shouldn't happen!";

      }

     }

The bold part of each regex matches one type of HTML construct. Each is checked in turn starting from the current position (due to /gc), but can match only at the current position (due to figs/boxdr.jpg\Gfigs/boxul.jpg ). The regexes are checked in order until the construct at that current position has been found and reported. This leaves $html's pos at the start of the next token, which is found during the next iteration of the loop.

The loop ends when m/\G\z/gc is able to match, which is when the current position ( figs/boxdr.jpg\Gfigs/boxul.jpg ) has worked its way to the very end of the string ( figs/boxdr.jpg\zfigs/boxul.jpg ).

An important aspect of this approach is that one of the tests must match each time through the loop. If one doesn't (and if we don't abort), there would be an infinite loop, since nothing would be advancing or resetting $html's pos. This example has a final else clause that will never be invoked as the program stands now, but if we were to edit the program (as we will soon), we could perhaps introduce a mistake, so keeping the else clause is prudent. As it is now, if the data contains a sequence we haven't planned for (such as '<>'), it generates one warning message per unexpected character.

Another important aspect of this approach is the ordering of the checks, such as the placement of figs/boxdr.jpg\G(.)figs/boxul.jpg as the last check. Or, consider extending this application to recognize <script> blocks with:

     $html =~ m/\G ( <script[^>]*>.*?</script> )/xgcsi

(Wow, we've used five modifiers!) To work properly, this must be inserted into the program before the currently-first figs/boxdr.jpg<[^>]+>figs/boxul.jpg . Otherwise, figs/boxdr.jpg<[^>]+>figs/boxul.jpg would match the opening <script> tag "out from under" us.

There's a somewhat more advanced example of /gc in Chapter 3 (see Section 3.4.3.4).

7.5.4.5 Pos-related summary

Here's a summary of how the match operator interacts with the target string's pos:

Type of matchWhere match starts pos upon success pos upon failure

m/···/
m/···/g
m/···/gc

start of string (pos ignored)
starts at target's pos
starts at target's pos

reset to undef
set to end of match
set to end of match

reset to undef
reset to undef
left unchanged

Also, modifying a string in any way causes its pos to be reset to undef (which is the initial value, meaning the start of the string).

7.5.5 The Match Operator's Environmental Relations

The following sections summarize what we've seen about how the match operator influences the Perl environment, and vice versa.

7.5.5.1 The match operator's side effects

Often, the side effects of a successful match are more important than the actual return value. In fact, it is quite common to use the match operator in a void context (i.e., in such a way that the return value isn't even inspected), just to obtain the side effects. (In such a case, it acts as if given a scalar context.) The following summarizes the side effects of a successful match attempt:

  • After-match variables like $1 and @+ are set for the remainder of the current scope (see Section 7.3.3).

  • The default regex is set for the remainder of the current scope (see Section 7.5.1.3).

  • If m?···? matches, it (the specific m?···? operator) is marked as unmatchable, at least until the next call of reset in the same package (see Section 7.5.1.4).

Again, these side effects occur only with a match that is successful—an unsuccessful match attempt has no influence on them. However, the following side effects happen with any match attempt:

  • pos is set or reset for the target string (see Section 7.5.4.1).

  • If /o is used, the regex is "fused" to the operator so that re-evaluation does not occur (see Section 7.9.2.3).

7.5.5.2 Outside influences on the match operator

What a match operator does is influenced by more than just its operands and modifiers. This list summarizes the outside influences on the match operator:

context

The context that a match operator is applied in (scalar, array, or void) has a large influence on how the match is performed, as well as on its return value and side effects.



pos(···)

The pos of the target string (set explicitly or implicitly by a previous match) indicates where in the string the next /g -governed match should begin. It is also where figs/boxdr.jpg\Gfigs/boxul.jpg matches.



default regex

The default regex is used if the provided regex is empty (see Section 7.5.1.3).



study

It has no effect on what is matched or returned, but if the target string has been studied, the match might be faster (or slower). See "The Study Function" (see Section 7.9.4).



m?···? and reset

The invisible "has/hasn't matched" status of m?···? operators is set when m?···? matches or reset is called (see Section 7.5.1.4).



7.5.5.3 Keeping your mind in context (and context in mind)

Before leaving the match operator, I'll put a question to you. Particularly when changing among the while, if, and foreach control constructs, you really need to keep your wits about you. What do you expect the following to print?


     while ("Larry Curly Moe" =~ m/\w+/g) {

        print "WHILE stooge is $&.\n";

     }

     print "\n";

     

     if ("Larry Curly Moe" =~ m/\w+/g) {

        print "IF stooge is $&.\n";

     }

     print "\n";

     

     foreach ("Larry Curly Moe" =~ m/\w+/g) {

        print "FOREACH stooge is $&.\n";

     }

It's a bit tricky. figs/bullet.jpgClick here to check your answer.

    Previous Section  < Free Open Study >  Next Section