Previous Section  < Free Open Study >  Next Section

7.4 The qr/···/ Operator and Regex Objects

Introduced briefly in Chapter 2 and Chapter 6 (see Section 2.3.6.7; Section 6.7), qr/···/ is a unary operator that takes a regex operand and returns a regex object. The returned object can then be used as a regex operand of a later match, substitution, or spilt, or can be used as a sub-part of a larger regex.

Regex objects are used primarily to encapsulate a regex into a unit that can be used to build larger expressions, and for efficiency (to gain control over exactly when a regex is compiled, discussed later).

As described in Section 7.2.1.2, you can pick your own delimiters, such as qr{···} or qr!···!. It supports the core modifiers /i, /x, /s, /m, and /o.

7.4.1 Building and Using Regex Objects

Consider the following, with expressions adapted from Chapter 2 (see Section 2.3.6.7):

     my $HostnameRegex = qr/[-a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info)/i;
     my $HttpUrl = qr{

        http:// $HostnameRegex \b # Hostname

        (?:

             / [-a-z0-9_:\@&?=+,.!/~*'%\$]* # Optional path

                (?<![.,?!]) # Not allowed to end with [.,?!]

        )?

     }ix;

The first line encapsulates our simplistic hostname-matching regex into a regularexpr ession object, and saves it to the variable $HostnameRegex. The next lines then use that in building a regex object to match an HTTP URL, saved to the variable $HttpUrl. Once constructed, they can be used in a variety of ways, such as

     if ($text =~ $HttpUrl) {

        print "There is a URL\n";

     }

to merely inspect, or perhaps

     while ($text =~ m/($HttpUrl)/g) {

        print "Found URL: $1\n";

     }

to find and display all HTTP URLs.

Now, consider changing the definition of $HostnameRegex to this, derived from Chapter 5 (see Section 5.3.5):

     my $HostnameRegex = qr{

        # One or more dot-separated parts···

        (?: [a-z0-9]\. | [a-z0-9][-a-z0-9]{0,61}[-a-z0-9]\. )*

        # Followed by the final suffix part···

        (?: com|edu|gov|int|mil|net|org|biz|info|···|aero|[a-z][a-z] )

     }xi;

This is intended to be used in the same way as our previous version (for example, it doesn't have a leading figs/boxdr.jpg^figs/boxul.jpg and trailing figs/boxdr.jpg$figs/boxul.jpg , and has no capturing parentheses), so we're free to use it as a drop-in replacement. Doing so gives us a stronger $HttpUrl.

7.4.1.1 Match modes (or lack thereof) are very sticky

qr/···/ supports the core modifiers described in Section 7.2.3. Once a regex object is built, the match modes of the regex it represents can't be changed, even if that regex object is used inside a subsequent m/···/ that has its own modifiers. For example, the following does not work:

     my $WordRegex = qr/\b \w+ \b/; # Oops, missing the /x modifier!

        .

        .

        .

     if ($text =~ m/^($WordRegex)/x) {

         print "found word at start of text: $1\n";

     }

The /x modifiers are used here ostensibly to modify how $WordObject is applied, but this does not work because the modifiers (or lack thereof) are locked in by the qr/···/ when $WordObject is created. So, the appropriate modifiers must be used at that time.

Here's a working version of the previous example:

     my $WordRegex = qr/\b \w+ \b/x; # This works!

        .

        .

        .

     if ($text =~ m/^($WordRegex)/) {

         print "found word at start of text: $1\n";

     }

Now, contrast the original snippet with the following:

     my $WordRegex = '\b \w+ \b'; # Normal string assignment

        .

        .

        .

     if ($text =~ m/^($WordRegex)/x) {

         print "found word at start of text: $1\n";

     }

Unlike the original, this one works even though no modifiers are associated with $WordRegex when it is created. That's because in this case, $WordRegex is a normal variable holding a simple string that is interpolated into the m/···/ regex literal. Building up a regex in a string is much less convenient than using regex objects, for a variety of reasons, including the problem in this case of having to remember that this $WordRegex must be applied with /x to be useful.

Actually, you can solve that problem even when using strings by putting the regex into a mode-modified span (see Section 3.4.4.2 ) when creating the string:

     my $WordRegex = '(?x:\b \w+ \b)'; # Normal string assignment

        .

        .

        .

     if ($text =~ m/^($WordRegex)/) {

         print "found word at start of text: $1\n";

     }

In this case, after the m/···/ regex literal interpolates the string, the regex engine is presented with figs/boxdr.jpg^((?x:\b• \w+• \b))figs/boxul.jpg , which works the way we want.

In fact, this is what logically happens when a regex object is created, except that a regex object always explicitly defines the "on" or "off" for each of the /i, /x, /m, and /s modes. Using qr/\b• \w+• \b/x creates figs/boxdr.jpg (?x-ism:\b• \w+• \b)figs/boxul.jpg . Notice how the mode-modified span, figs/boxdr.jpg(?x-ism:···)figs/boxul.jpg , has /x turned on, while /i, /s, and /m are turned off. Thus, qr/···/ always "locks in" each mode, whether given a modifier or not.

7.4.2 Viewing Regex Objects

The previous paragraph talks about how regex objects logically wrap their regular expression with mode-modified spans like figs/boxdr.jpg(?x-ism:···)figs/boxul.jpg . You can actually see this for yourself, because if you use a regex object where Perl expects a string, Perl kindly gives a textual representation of the regex it represents. For example:


     % perl -e 'print qr/\b \w+ \b/x, "\n"'

     (?ix-sm:\b \w+ \b)

Here's what we get when we print the $HttpUrl from Section 7.4.1.1:

     (?ix-sm:

        http:// (?ix-sm:

        # One or more dot-separated parts···

        (?: [a-z0-9]\. | [a-z0-9][-a-z0-9]{0,61}[-a-z0-9]\. )*

        # Followed by the final suffix part···

        (?: com|edu|gov|int|mil|net|org|biz|info|···|aero|[a-z][a-z] )

     ) \b          # hostname

        (?:

             / [-a-z0-9_:\@&?=+,.!/~*'%\$]* # Optional path

                (?<![.,?!]) # Not allowed to end with [.,?!]

        )?

     )

The ability to turn a regex object into a string is very useful for debugging.

7.4.3 Using Regex Objects for Efficiency

One of the main reasons to use regex objects is to gain control, for efficiency reasons, of exactly when Perl compiles a regex to an internal form. The general issue of regex compilation was discussed briefly in Chapter 6, but the more complex Perl-related issues, including regex objects, are discussed in "Regex Compilation, the /o Modifier, qr/···/, and Efficiency" (see Section 7.9.2).

    Previous Section  < Free Open Study >  Next Section