Previous Section  < Free Open Study >  Next Section

7.9 Perl Efficiency Issues

For the most part, efficiency with Perl regular expressions is achieved in the same way as with any tool that uses a Traditional NFA. Use the techniques discussed in Chapter 6—the internal optimizations, the unrolling methods, the "Think" section —all apply to Perl.

There are, of course, Perl-specific issues as well, and in this section, we'll look at the following topics:

  • There's More Than One Way To Do It Perl is a toolbox offering many approaches to a solution. Knowing which problems are nails comes with understanding The Perl Way, and knowing which hammer to use for any particular nail goes a long way toward making more efficient and more understandable programs. Sometimes efficiency and understandability seem to be mutually exclusive, but a better understanding allows you to make better choices.

  • Regex Compilation, qr/···/, the /o Modifier, and Efficiency The interpolation and compilation of regex operands are fertile ground for saving time. The /o modifier, which I haven't discussed much yet, along with regex objects (qr/···/), gives you some control over when the costly re-compilation takes place.

  • The $& Penalty The three match side effect variables, $', $&, and $', can be convenient, but there's a hidden efficiency gotcha waiting in store for any script that uses them, even once, anywhere. Heck, you don't even have to use them—the entire script is penalized if one of these variables even appears in the script.

  • The Study Function Since ages past, Perl has provided the study (···) function. Using it supposedly makes regexes faster, but it seems that no one really understands if it does, or why. We'll see whether we can figure it out.

  • Benchmarking When it comes down to it, the fastest program is the one that finishes first. (You can quote me on that.) Whether a small routine, a major function, or a whole program working with live data, benchmarking is the final word on speed. Benchmarking is easy and painless with Perl, although there are various ways to go about it. I'll show you the way I do it, a simple method that has served me well for the hundreds of benchmarks I've done while preparing this book.

  • Perl's Regex Debugging Perl's regex-debug flag can tell you about some of the optimizations the regex engine and transmission do, or don't do, with your regexes. We'll look at how to do this and see what secrets Perl gives up.

7.9.1 "There's More Than One Way to Do It"

There are often many ways to go about solving any particular problem, so there's no substitute for really knowing all that Perl has to offer when balancing efficiency and readability. Let's look at the simple problem of padding an IP address like '18.181.0.24' such that each of the four parts becomes exactly three digits: '018.181.000.024'. One simple and readable solution is:

     $ip = sprintf("%03d.%03d.%03d.%03d", split(/\./, $ip));

This is a fine solution, but there are certainly other ways to do the job. In the interest of comparison, Table 7-6 examines various ways to achieve the same goal, and their relative efficiency (they're listed from the most efficient to the least). This example's goal is simple and not very interesting in and of itself, yet it represents a common text-handling task, so I encourage you to spend some time understanding the various approaches. You may even see some Perl techniques that are new to you.

Each approach produces the same result when given a correct IP address, but fails in different ways if given something else. If there is any chance that the data will be malformed, you'll need more care than any of these solutions provide. That aside, the practical differences lie in efficiency and readability. As for readability, #1 and #13 seem the most straightforward (although it's interesting to see the wide gap in efficiency). Also straightforward are #3 and #4 (similar to #1) and #8 (similar to #13). The rest all suffer from varying degrees of crypticness.

So, what about efficiency? Why are some less efficient than others? It's the interactions among how an NFA works (Chapter 4), Perl's many regex optimizations (Chapter 6), and the speed of other Perl constructs (such as sprintf, and the mechanics of the substitution operator). The substitution operator's /e modifier, while indispensable at times, does seem to be mostly at the bottom of the list.

It's interesting to compare two pairs, #3/#4 and #8/#14. The two regexes of each pair differ only in their use of parentheses — the one without the parentheses is just a bit faster than the one with. But #8's use of $& as a way to avoid parentheses comes at a high cost not shown by these benchmarks (see Section 7.9.3).

7.9.2 Regex Compilation, the /o Modifier, qr/···/, and Efficiency

An important aspect of Perl's regex-related efficiency relates to the setup work Perl must do behind the scenes when program execution reaches a regex operator, before actually applying the regular expression. The precise setup depends on the type of regex operand. In the most common situation, the regex operand is a regex literal, as with m/···/ or s/···/···/ or qr/···/. For these, Perl has to do a few different things behind the scenes, each taking some time we'd like to avoid, if possible. First, let's look at what needs to be done, and then at ways we might avoid it.

Table 6. A Few Ways to Pad an IP Address
RankTimeApproach
1.1.0x
    $ip = sprintf("%03d.%03d.%03d.%03d", split(m/\./, $ip));
2.1.3x
    substr($ip,  0, 0) = '0' if substr($ip,  1, 1) eq '.';

    substr($ip,  0, 0) = '0' if substr($ip,  2, 1) eq '.';

    substr($ip,  4, 0) = '0' if substr($ip,  5, 1) eq '.';

    substr($ip,  4, 0) = '0' if substr($ip,  6, 1) eq '.';

    substr($ip,  8, 0) = '0' if substr($ip,  9, 1) eq '.';

    substr($ip,  8, 0) = '0' if substr($ip, 10, 1) eq '.';

    substr($ip, 12, 0) = '0' while length($ip) < 15;
3.1.6x
    $ip = sprintf("%03d.%03d.%03d.%03d", $ip =~ m/\d+/g);
4.1.8x
    $ip = sprintf("%03d.%03d.%03d.%03d", $ip =~ m/(\d+)/g);
5.1.8x
    $ip = sprintf("%03d.%03d.%03d.%03d",

                  $ip =~ m/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/);
6.2.3x
    $ip =~ s/\b(?=\d\b)/00/g;

    $ip =~ s/\b(?=\d\d\b)/0/g;
7.3.0x
    $ip =~ s/\b(\d(\d?)\b)/$2 eq '' ? "00$1" : "0$1"/eg;
8.3.3x
    $ip =~ s/\d+/sprintf("%03d", $&)/eg;
9.3.4x
    $ip =~ s/(?:(?<=\.)|^)(?=\d\b)/00/g;

    $ip =~ s/(?:(?<=\.)|^)(?=\d\d\b)/0/g;
10.3.4x
    $ip =~ s/\b(\d\d?\b)/'0' x (3-length($1)) . $1/eg;
11.3.4x
    $ip =~ s/\b(\d\b)/00$1/g;

    $ip =~ s/\b(\d\d\b)/0$1/g;
12.3.4x
    $ip =~ s/\b(\d\d?\b)/sprintf("%03d", $1)/eg;
13.3.5x
    $ip =~ s/\b(\d{1,2}\b)/sprintf("%03d", $1)/eg;
14.3.5x
    $ip =~ s/(\d+)/sprintf("%03d", $1)/eg;
15.3.6x
    $ip =~ s/\b(\d\d?(?!\d))/sprintf("%03d", $1)/eg;
16.4.0x
    $ip =~ s/(?:(?<=\.)|^)(\d\d?(?!\d))/sprintf("%03d", $1)/eg;

7.9.2.1 The internal mechanics of preparing a regex

The behind-the-scenes work done to prepare a regex operand is discussed generally in Chapter 6 (see Section 6.4.3), but Perl has its unique twists.

Perl's pre-processing of regex operands happens in two general phases.

  1. Regex-literal processing If the operand is a regex literal, it's processed as described in "How Regex Literals Are Parsed" (see Section 7.2.2). One of the benefits provided by this stage is variable interpolation.

  2. Regex Compilation The regex is inspected, and if valid, compiled into an internal form appropriate for its actual application by the regex engine. (If invalid, the error is reported to the user.)

Once Perl has a compiled regex in hand, it can actually apply it to the target string, as per Chapter 4, Chapter 5, Chapter 6.

All that pre-processing doesn't necessarily need be done every time each regex operator is used. It must always be done the first time a regex literal is used in a program, but if execution reaches the same regex literal more than once (such as in a loop, or in a function that's called more than once), Perl can sometimes re-use some of the previously-done work. The next sections show when and how Perl might do this, and additional techniques available to the programmer to further increase efficiency.

7.9.2.2 Perl steps to reduce regex compilation

In the next sections, we'll look at two ways in which Perl avoids some of the prepr ocessing associated with regex literals: unconditional caching and on-demand recompilation.

7.9.2.2.1 Unconditional caching

If a regex literal has no variable interpolation, Perl knows that the regex can't change from use to use, so after the regex is compiled once, that compiled form is saved ("cached") for use whenever execution again reaches the same code. The regex is examined and compiled just once, no matter how often it's used during the program's execution. Most regular expressions shown in this book have no variable interpolation, and so are perfectly efficient in this respect.

Variables within embedded code and dynamic regex constructs don't count, as they're not interpolated into the value of the regex, but rather part of the unchanging code the regex executes. When my variables are referenced from within embedded code, there may be times that you wish it were interpreted every time: see the warning in Section 7.8.4.

Just to be clear, caching lasts only as long as the program executes — nothing is cached from one run to the next.

7.9.2.2.2 On-demand recompilation

Not all regex operands can be cached. Consider this snippet:

     my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];

     # $today now holds the day ("Mon", "Tue", etc., as appropriate)

     

     while (<LOGFILE>) {

         if (m/^$today:/i) {

             .

             .

             .

The regex in m/^$today:/ requires interpolation, but the way it's used in the loop ensures that the result of that interpolation will be the same every time. It would be inefficient to recompile the same thing over and over each time through the loop, so Perl automatically does a simple string check, comparing the result of the interpolation against the result the last time through. If they're the same, the cached regex that was used the previous time is used again this time, eliminating the need to recompile. But if the result of the interpolation turns out to be different, the regex is recompiled. So, for the price of having to redo the interpolation and check the result with the cached value, the relatively expensive compile is avoided whenever possible.

How much do these features actually save? Quite a lot. As an example, I benchmarked the cost of pre-processing three forms of the $HttpUrl example from Section 7.4.1 (using the extended $HostnameRegex). I designed the benchmarks to show the overhead of regex pre-processing (the interpolation, string check, compilation, and other background tasks), not the actual application of the regex, which is the same regardless of how you get there.

The results are pretty interesting. I ran a version that has no interpolation (the entire regex manually spelled out within m/···/), and used that as the basis of comparison. The interpolation and check, if the regex doesn't change each time, takes about 25x longer. The full pre-processing (which adds the recompilation of the regex each time) takes about 1,000x longer! Wow.

Just to put these numbers into context, realize that even the full pre-processing, despite being over 1,000x slower than the static regex literal pre-processing, still takes only about 0.00026 seconds on my system. (It benchmarked at a rate of about 3,846 per second; on the other hand, the static regex literal's pre-processing benchmarked at a rate of about 3.7 million per second.) Still, the savings of not having to do the interpolation are impressive, and the savings of not having to recompile are down right fantastic. In the next sections, we'll look at how you can take action to enjoy these savings in even more cases.

7.9.2.3 The "compile once" /o modifier

Put simply, if you use the /o modifier with a regex literal operand, the regex literal will be inspected and compiled just once, regardless of whether it uses interpolation. If there's no interpolation, adding /o doesn't buy you anything because expressions without interpolation are always cached automatically. If there is interpolation, the first time execution arrives at the regex literal, the normal full prepr ocessing happens, but because of /o, the internal form is cached. If execution comes back again to the same regex operator, that cached form is used directly.

Here's the example from the previous section, with the addition of /o:

     my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];

     

     while (<LOGFILE>) {

         if (m/^$today:/io) {

             .

             .

             .

This is now much more efficient because the regex ignores $today on all but the first iteration through the loop. Not having to interpolate or otherwise pre-process and compile the regex every time represents a real savings that Perl couldn't do for us automatically because of the variable interpolation: $today might change, so Perl must play it safe and reinspect it each time. By using /o, we tell Perl to "lock in" the regex after the regex literal is first pre-processed and compiled. It's safe to do this when we know that the variables interpolated into a regex literal won't change, or when we don't want Perl to use the new values even if they do change.

7.9.2.3.1 Potential "gotchas" of /o

There's an important "gotcha" to watch out for with /o. Consider putting our example into a function:

     sub CheckLogfileForToday()

     {

       my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];

       

       while (<LOGFILE>) {

           if (m/^$today:/io) { #dangerous -- has a gotcha

               .

               .

               .

           }

       }

     }

Remember, /o indicates that the regex operand should be compiled once. The first time CheckLogfileForToday() is called, a regex operand representing the current day is locked in. If the function is called again some time later, even though $today may change, it will not be not inspected again; the original locked-in regex is used every time for the duration of execution.

This is a major shortcoming, but as we'll see in the next section, regex objects provide a best-of-both-worlds way around it.

7.9.2.4 Using regex objects for efficiency

All the discussion of pre-processing we've seen so far applies to regex literals. The goal has been to end up with a compiled regex with as little work as possible. Another approach to the same end is to use a regex object, which is basically a ready-to-use compiled regex encapsulated into a variable. They're created with the qr/···/ operator (see Section 7.4.1).

Here's a version of our example using a regex object:

     sub CheckLogfileForToday()

     {

       my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];

       my $RegexObj = qr/^$today:/i;

 # compiles once per function call

       while (<LOGFILE>) {

           if ($_ =~ $RegexObj) {

                .

                .

                .

           }

       }

     }

Here, a new regex object is created each time the function is called, but it is then used directly for each line of the log file. When a regex object is used as an operand, it undergoes none of the pre-processing discussed throughout this section. The pre-processing is done when the regex object is created, not when it's later used. You can think of a regex object, then, as a "floating regex cache," a ready-to-use compiled regex that you can apply whenever you like.

This solution has the best of both worlds: it's efficient, since only one regex is compiled during each function call (not with each line in the log file), but, unlike the previous example where /o was used inappropriately, this example actually works correctly with multiple calls to CheckLogfileForToday().

Be sure to realize that there are two regex operands in this example. The regex operand of the qr/···/ is not a regex object, but a regex literal supplied to qr/···/ to create a regex object. The object is then used as the regex operand for the =~ match operator in the loop.

7.9.2.4.1 Using m/···/ with regex objects

The use of the regex object,

     if ($_ =~ $RegexObj) {

can also be written as:

     if (m/$RegexObj/) {

This is not a normal regex literal, even though it looks like one. When the only thing in the "regex literal" is a regex object, it's just the same as using a regex object. This is useful for several reasons. One is simply that the m/···/ notation may be more familiar, and perhaps more comfortable to work with. It also relieves you from explicitly stating the target string $_, which makes things look better in conjunction with other operators that use the same default. Finally, it allows you to use the /g modifier with regex objects.

7.9.2.4.2 Using /o with qr/···/

The /o modifier can be used with qr/···/, but you'd certainly not want to in this example. Just as when /o is used with any of the other regex operators, qr/···/o locks in the regex the first time it's used, so if used here, $RegexObj would get the same regex object each time the function is called, regardless of the value of $today. That would be the same mistake as when we used m/···/o in Section 7.9.2.3.

7.9.2.5 Using the default regex for efficiency

The default regex (see Section 7.5.2) feature of regex operators can be used for efficiency, although the need for it has mostly been eliminated with the advent of regex objects. Still, I'll describe it quickly. Consider:

     sub CheckLogfileForToday()

     {

       my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];



       # Keep trying until one matches, so the default regex is set.

       "Sun:" =~ m/^$today:/i or

       "Mon:" =~ m/^$today:/i or

       "Tue:" =~ m/^$today:/i or

       "Wed:" =~ m/^$today:/i or

       "Thu:" =~ m/^$today:/i or

       "Fri:" =~ m/^$today:/i or

       "Sat:" =~ m/^$today:/i;



       while (<LOGFILE>) {

           if (m//) { # Now use the default regex

                .

                .

                .

           }

       }

     }

The key to using the default regex is that a match must be successful for it to be set, which is why this example goes to such trouble to get a match after $today has been set. As you can see, it's fairly kludgey, and I wouldn't recommend it.

7.9.3 Understanding the "Pre-Match" Copy

While doing matches and substitutions, Perl sometimes must spend extra time and memory to make a pre-match copy of the target text. As we'll see, sometimes this copy is used in support of important features, but sometimes it's not. When the copy is made but not used, the wasted effort is an inefficiency we'd like to avoid, especially in situations where the target text is very long, or speed particularly important.

In the next sections, we'll look at when and why Perl might make a pre-match copy of the target text, when the copy is actually used, and how we might avoid the copy when efficiency is at a premium.

7.9.3.1 Pre-match copy supports $1, $&, $', $+, . . .

Perl makes a pre-match copy of the original target text of a match or substitution to support $1, $&, and the other after-match variables that actually hold text (see Section 7.3.3). After each match, Perl doesn't actually create each of these variables because many (or all) may never be used by the program. Rather, Perl just files away a copy of the original text, remembers where in that original string the various matches happened, and then refers to that if and when $1 or the like is actually used. This requires less work up-front, which is good, because often, some or all of these after-match variables are not even used. This is a form of "lazy evaluation," and successfully avoids a lot of unneeded work.

Although Perl saves work by not creating $1 and the like until they're used, it still has to do the work of saving the extra copy of the target text. But why does this really need to be done? Why can't Perl just refer to that original text to begin with? Well, consider:

     $Subject =~ s/^(?:Re:\s*)+//;

After this, $& properly refers to the text that was removed from $Subject, but since it was removed from $Subject, Perl can't refer to $Subject itself when providing for a subsequent use of $&. The same logic applies for something like:

     if ($Subject =~ m/^SPAM:(.+)/i) {

         $Subject = "-- spam subject removed --";

         $SpamCount{$1}++;

     }

By the time $1 is referenced, the original $Subject has been erased. Thus, Perl must make an internal pre-match copy.

7.9.3.2 The pre-match copy is not always needed

In practice, the primary "users" of the pre-match copy are $1, $2, $3, and the like. But what if a regex doesn't even have capturing parentheses? If it doesn't, there's no need to even worry about $1, so any work needed to support it can be bypassed. So, at least those regexes that don't have capturing parentheses can avoid the costly copy? Not always . . .

7.9.3.2.1 The variables $', $&, and $' are naughty

The three variables $', $&, and $' aren't related to capturing parentheses. As the text before, of, and after the match, they can potentially apply to every match and substitution. Since it's impossible for Perl to tell which match any particular use of one of these variables refers to, Perl must make the pre-match copy every time.

It might sound like there's no opportunity to avoid the copy, but Perl is smart enough to realize that if these variables do not appear in the program, anywhere (including in any library that might be used) the blind copying to support them is no longer needed. Thus, ensuring that you don't use $', $&, and $' allows all matches without capturing parentheses to dispense with the pre-match copy — a handsome optimization! Having even one $', $&, or $' anywhere in the program means the optimization is lost. How unsociable! For this reason, I call these three variables "naughty."

7.9.3.3 How expensive is the pre-match copy?

I ran a simple benchmark, checking m/c/ against each of the 130,000 lines of C that make up the main Perl source. The benchmark noted whether a 'c' appeared on each line, but didn't do anything further, since the goal was to determine the effect of the behind-the-scenes copying. I ran the test two different ways: once where I made sure not to trigger the pre-match copy, and once where I made sure to do so. The only difference, therefore, was in the extra copy overhead.

The run with the pre-match copying consistently took over 40 percent longer than the one without. This represents an "average worst case," so to speak, since the benchmark didn't do any "real work," whose time would reduce the relative relevance of (and perhaps overshadow) the extra overhead.

On the other hand, in true worst-case scenarios, the extra copy might truly be an overwhelming portion of the work. I ran the same test on the same data, but this time as one huge line incorporating the more than 3.5 megabytes of data, rather than the 50,000 or so reasonably sized lines. Thus, the relative performance of a single match can be checked. The match without the pre-match copy returned almost immediately, since it was sure to find a 'c' somewhere near the start of the string. Once it did, it was finished. The test with the pre-match copy is the same except that it had to make a copy of the huge string first. It took over 7,000 times longer! Knowing the ramifications, therefore, of certain constructs allows you to tweak your code for better efficiently.

7.9.3.4 Avoiding the pre-match copy

It would be nice if Perl knew the programmer's intentions and made the copy only as necessary. But remember, the copies are not "bad" — Perl's handling of these bookkeeping drudgeries behind the scenes is why we use it and not, say, C or assembly language. Indeed, Perl was first developed in part to free users from the mechanics of bit fiddling so they could concentrate on creating solutions to problems.

Never use naughty variables. Still, it's nice to avoid the extra work if possible. Foremost, of course, is to never use $', $&, or $' anywhere in your code. Often, $& is easy to eliminate by wrapping the regex with capturing parentheses, and using $1 instead. For example, rather than using s/<\w+>/\L$&\E/g to lowercase certain HTML tags, use s/(<\w+>)/\L$1\E/g instead.

$' and $' can often be easily mimicked if you still have an unmodified copy of the original target string. After a match against a given target, the following shows valid replacements:

Variable Mimicked with
$'

$&

$'
substr(target, 0, $-[0])

substr(target, $-[0], $+[0] - $-[0])

substr(target, $+[0])

Since @- and @+ (see Section 7.3.3) are arrays of positions in the original target string, rather than actual text in it, they can be safely used without an efficiency penalty.

I've included a substitute for $& in there as well. This may be a better alternative to wrapping with capturing parentheses and using $1, as it may allow you to eliminate capturing parentheses altogether. Remember, the whole point of avoiding $& and friends is to avoid the copy for matches that have no capturing parentheses. If you make changes to your program to eliminate $&, but end up adding capturing parentheses to every match, you haven't saved anything.

Don't use naughty modules. Of course, part of not using $', $&, or $' is to not use modules that use them. The core modules that come with Perl do not use them, except for the English module. If you wish to use that module, you can have it not apply to these three variables by invoking it as:

     use English '-no_match_vars';

This makes it safe. If you download modules from CPAN or elsewhere, you may wish to check to see if they use the variables. See the sidebar below for a technique to check to see if your program is infected with any of these variables.

How to Check Whether Your Code is Tainted by $&

It's not always easy to notice whether your program is naughty (references $&, $', or $'), especially with the use of libraries, but there are several ways to find out. The easiest is perhaps to use the -c and -Mre=debug commandline arguments (see Section 7.9.6) and look toward the end of the output for either 'Enabling $' $& $' support' or 'Omitting $' $& $' support' . If it's enabled, the code is tainted.

It's possible (but unlikely) that the code could be tainted by the use of a naughty variable within an eval that's not known to Perl until it's executed. One option to catch those as well is to install the Devel::SawAmpersand package from CPAN (www.cpan.org):

     END {

        require Devel::SawAmpersand;

        if (Devel::SawAmpersand::sawampersand) {

            print "Naughty variable was used!\n";

        }

     }

Included with Devel::SawAmpersand comes Devel::FindAmpersand, a package that purportedly shows you where the offending variable is located. Unfortunately, it doesn't work reliably with the latest versions of Perl. Also, they both have some installation issues, so your mileage may vary. (Check regex.info/ for possible updates.)

Also, it may be interesting to see how you can check for naughtiness by just checking for the performance penalty:


     use Time::HiRes;

     sub CheckNaughtiness()

     {

       my $text = 'x' x 10_000; # Create some non-small amount of data.



       # Calculate the overhead of a do-nothing loop.

       my $start = Time::HiRes::time();

       for (my $i = 0; $i < 5_000; $i++) { }

       my $overhead = Time::HiRes::time() - $start;



       # Now calculate the time for the same number of simple matches.

       $start = Time::HiRes::time();

       for (my $i = 0; $i < 5_000; $i++) { $text =~ m/^/ }

       my $delta = Time::HiRes::time() - $start;



       # A differential of 5 is just a heuristic.

       printf "It seems your code is %s (overhead=%.2f, delta=%.2f)\n",

         ($delta > $overhead*5) ? "naughty" : "clean", $overhead, $delta;

     }


7.9.4 The Study Function

In contrast to optimizing the regex itself, study(···) optimizes certain kinds of searches of a string. After studying a string, a regex (or multiple regexes) can benefit from the cached knowledge when applied to the string. It's generally used like this:


     while (<>)

     {

        study($_); # Study the default target $_ before doing lots of matches on it

        if (m/regex 1/) { ··· }

        if (m/regex 2/) { ··· }

        if (m/regex 3/) { ··· }

        if (m/regex 4/) { ··· }

     }

What study does is simple, but understanding when it's a benefit can be quite difficult. It has no effect whatsoever on any values or results of a program—the only effects are that Perl uses more memory, and that overall execution time might increase, stay the same, or (here's the goal) decrease.

When a string is studied, Perl takes some time and memory to build a list of places in the string that each character is found. On most systems, the memory required is four times the size of the string). study's benefit can be realized with each subsequent regex match against the string, but only until the string is modified. Any modification of the string invalidates the study list, as does studying a different string.

How helpful it is to have the target string studyied is highly dependent on the regex matching against it, and the optimizations that Perl is able to apply. For example, searching for literal text with m/foo/ can see a huge speedup due to study (with large strings, speedups of 10,000x are possible). But, if /i is used, that speedup evaporates, as /i currently removes the benefit of study (as well as some other optimizations).

7.9.4.1 When not to use study
  • Don't use study on strings you intend to check only with /i, or when all literal text is governed by figs/boxdr.jpg(?i)figs/boxul.jpg or figs/boxdr.jpg(?i:···)figs/boxul.jpg , as these disable the benefits of study.

  • Don't use study when the target string is short. In such cases, the normal fixed-string cognizance optimization should suffice (see Section 6.4.6). How short is "short"? String length is just one part of a large, hard-to-pin-down mix, so when it comes down to it, only benchmarking your expressions on your data will tell you if study is a benefit. But for what it's worth, I generally don't even consider study unless the strings are at least several kilobytes long.

  • Don't use study when you plan only a few matches against the target string before it's modified, or before you study a different string. An overall speedup is more likely if the time spent to study a string is amortized over many matches. With just a few matches, the time spent building the study list can overshadow any savings.

  • Use study only on strings that you intend to search with regular expressions having "exposed" literal text (see Section 6.5.2). Without a known character that must appear in any match, study is useless. (Along these lines, one might think that study would benefit the index function, but it doesn't seem to.)

7.9.4.2 When study can help

study is best used when you have a large string you intend to match many times before the string is modified. A good example is a filter I use in the preparation of this book. I write in a home-grown markup that the filter converts to SGML (which is then converted to troff, which is then converted to PostScript). Within the filter, an entire chapter eventually ends up within one huge string (for instance, this chapter is about 475KB). Before exiting, I apply a bevy of checks to guard against mistaken markup leaking through. These checks don't modify the string, and they often look for fixed strings, so they're what study thrives on.

7.9.5 Benchmarking

If you really care about efficiency, it may be best to try benchmarking. Perl comes standard with the Benchmark module, which has fine documentation ("perldoc Benchmark"). Perhaps more out of habit than anything else, I tend to write my benchmarks from scratch. After

     use Time::HiRes 'time';

I wrap what I want to test in something simple like:


     my $start = time;

       .

       .

       .

     my $delta = time - $start;

     printf "took %.1f seconds\n", $delta;

Important issues with benchmarking include making sure to benchmark enough work to show meaningful times, and to benchmark as much of the work you want to measure while benchmarking as little of the work you don't. This is discussed in more detail in Chapter 6 (see Section 6.3). It might take some time to get used to benchmarking in a reasonable way, but the results can be quite enlightening and rewarding.

7.9.6 Regex Debugging Information

Perl carries out a phenomenal number of optimizations to try to arrive at a match result quickly; some of the less esoteric ones are listed in Chapter 6's "Common Optimizations" (see Section 6.4), but there are many more. Most optimizations apply to only very specific cases, so any particular regex benefits from only some (or none) of them.

Perl has debugging modes that tell you about some of the optimizations. When a regex is first compiled, Perl figures out which optimizations go with the regex, and the debugging mode reports on some of them. The debugging modes can also tell you a lot about how the engine actually applies that expression. A detailed analysis of this debugging information is beyond the scope of even this book, but I'll provide a short introduction here.

You can turn on the debugging information by putting use re 'debug'; in your code, and you can turn it back off with no re 'debug'; but it turns off automatically at the end of the block or file in which the use is placed. (We've seen this use re pragma before, with different arguments, to allow embedded code in interpolated variables see Section 7.8.3.)

Alternatively, if you want to turn it on for the entire script, you can use the -Mre=debug command-line argument. This is particularly useful just for inspecting how a single regex is compiled. Here's an example (edited to remove some lines that are not of interest):

[1] % perl -cw -Mre=debug -e 'm/^Subject: (.*)/'

[2] Compiling REx '^Subject: (.*)'

[3] rarest char j at 3

[4]     1: BOL(2)

[5]     2: EXACT <Subject: >(6)

          ·

          ·

          ·

[6]    12: END(0)

[7] anchored 'Subject: ' at 0 (checking anchored) anchored(BOL) minlen 9

[8] Omitting $' $& $' support.

At [1], I invoke perl at my shell prompt, using the command-line flags -c (which means to check the script, but don't actually execute it), -w (issue warnings about things Perl thinks are dubious — always used as a matter of principle), and -Mre=debug to turn on regex debugging. The -e flag means that the following argument, 'm/^Subject:•(.*)/', is actually a mini Perl program to be run or checked.

Line [3] reports the "rarest" character (the least common, as far as Perl guesses) from among those in the longest fixed substring part of the regex. Perl uses this for some optimizations (such as pre-check of required character/substring see Section 6.4.4.2).

Lines [4] through [6] represents Perl's compiled form of the regex. For the most part, we won't be concerned much about it here. However, in even a casual look, line [5] sticks out as understandable.

Line [7] is where most of the action is. Some of the information that might be shown here includes:

anchored 'string' at offset

Indicates that any match must have the given string, starting offset characters from the start of the match. If '$' is shown immediately after 'string', the string also ends the match.



floating 'string' at from..to

Indicates that any match must have the given string, but that it could start anywhere from from characters into the match, to to characters. If '$' is shown immediately after 'string', the string also ends the match.



stclass 'list'

Shows the list of characters with which a match can begin.



anchored(MBOL), anchored(BOL), anchored(SBOL)

The regex leads with figs/boxdr.jpg^figs/boxul.jpg . The MBOL version appears when the /m modifier is used, while BOL and SBOL appear when it's is not used. (The difference between BOL and SBOL is not relevant for modern Perl. SBOL relates to the regex-related $* variable, which has long been deprecated.)



anchored(GPOS)

The regex leads with figs/boxdr.jpg\Gfigs/boxul.jpg .



implicit

The anchored(MBOL) is an implicit one added by Perl because the regex begins with figs/boxdr.jpg.*figs/boxul.jpg .



minlen length

Any match is at least length characters long.



with eval

The regex has figs/boxdr.jpg(?{···})figs/boxul.jpg or figs/boxdr.jpg(??{···})figs/boxul.jpg .



Line [8] is not related to any particular regex. After loading the whole program, Perl reports if support for $& and friends has been enabled (see Section 7.9.3.2.1).

7.9.6.1 Run-time debugging information

We've already seen how we can use embedded code to get information about how a match progresses (see Section 7.8.2), but Perl's regex debugging can show much more. If you omit the -c compile-only option, Perl displays quite a lot of information detailing just how each match progresses.

If you see "Match rejected by optimizer," it means that one of the optimizations enabled the transmission to realize that the regex could never match the target text, and so the application is bypassed altogether. Here's an example:


     % perl -w -Mre=debug -e '"this is a test" =~ m/^Subject:/;'

         .

         .

         .

     Did not find anchored substr 'Subject:'···

     Match rejected by optimizer

When debugging is turned on, you'll see the debugging information for any regular expressions that are used, not necessarily just your own. For example


     % perl -w -Mre=debug -e 'use warnings'

     . . . lots of debugging information . . .

                      .

                      .

                      .

does nothing more than load the warnings module, but because that module has regular expressions, you see a lot of debugging information.

7.9.6.2 Other ways to invoke debugging messages

I've mentioned that you can use "use re 'debug';" or -Mre=debug to turn on regex debug information. However, if you use debugcolor instead of debug with either of these, and if you are using a terminal that understands ANSI terminal contr ol escape sequences, the information is shown with highlighting that makes the output easier to read.

Another option is that if your perl binary has been compiled with extra debugging support turned on, you can use the -Dr command-line flag as a shorthand for -Mre=debug.

    Previous Section  < Free Open Study >  Next Section