< Free Open Study > |
7.3 Regex-Related PerlismsA variety of general Perl concepts pertain to our study of regular expressions. The next few sections discuss:
7.3.1 Expression ContextThe notion of context is important throughout Perl, and in particular, to the match operator. An expression might find itself in one of three contexts, list, scalar, or void, indicating the type of value expected from the expression. Not surprisingly, a list context is one where a list of values is expected of an expression. A scalar context is one where a single value is expected. These two are very common and of great interest to our use of regular expressions. Void context is one in which no value is expected. Consider the two assignments: $s = expression one; @a = expression two; Because $s is a simple scalar variable (it holds a single value, not a list), it expects a simple scalar value, so the first expression, whatever it may be, finds itself in a scalar context. Similarly, because @a is an array variable and expects a list of values, the second expression finds itself in a list context. Even though the two expressions might be exactly the same, they might return completely different values, and cause completely different side effects while they're at it. Exactly what happens depends on each expression. For example, the localtime function, if used in a list context, returns a list of values representing the current year, month, date, hour, etc. But if used in a scalar context, it returns a textual version of the current time along the lines of 'Mon Jan 20 22:05:15 2003'. As another example, an I/O operator such as <MYDATA> returns the next line of the file in a scalar context, but returns a list of all (remaining) lines in a list context. Like localtime and the I/O operator, many Perl constructs respond to their context. The regex operators do as well — the match operator m/···/, for example, sometimes returns a simple true/false value, and sometimes a list of certain match results. All the details are found later in this chapter. 7.3.1.1 Contorting an expressionNot all expressions are natively context-sensitive, so Perl has rules about what happens when a general expression is used in a context that doesn't exactly match the type of value the expression normally returns. To make the square peg fit into a round hole, Perl "contorts" the value to make it fit. If a scalar value is returned in a list context, Perl makes a list containing the single value on the fly. Thus, @a = 42 is the same as @a = (42) . On the other hand, there's no general rule for converting a list to a scalar. If a literal list is given, such as with $var = ($this, &is, 0xA, 'list'); the comma-operator returns the last element, 'list', for $var. If an array is given, as with $var = @array , the length of the array is returned. Some words used to describe how other languages deal with this issue are cast, promote, coerce, and convert, but I feel they are a bit too consistent (boring?) to describe Perl's attitude in this respect, so I use "contort." 7.3.2 Dynamic Scope and Regex Match EffectsPerl's two types of storage (global and private variables) and its concept of dynamic scoping are important to understand in their own right, but are of particular interest to our study of regular expressions because of how after-match information is made available to the rest of the program. The next sections describe these concepts, and their relation to regular expressions. 7.3.2.1 Global and private variablesOn a broad scale, Perl offers two types of variables: global and private. Private variables are declared using my(···). Global variables are not declared, but just pop into existence when you use them. Global variables are always visible from anywher e and everywhere within the program, while private variables are visible, lexically, only to the end of their enclosing block. That is, the only Perl code that can directly access the private variable is the code that falls between the my declaration and the end of the block of code that encloses the my. The use of global variables is normally discouraged, except for special cases, such as the myriad of special variables like $1, $_, and @ARGV. Regular user variables are global unless declared with my, even if they might "look" private. Perl allows the names of global variables to be partitioned into groups called packages, but the variables are still global. A global variable $Debug within the package Acme::Widget has a fully qualified name of $Acme::Widget::Debug, but no matter how it's referenced, it's still the same global variable. If you use strict; , all (non-special) globals must either be referenced via fully-qualified names, or via a name declared with our (our declares a name, not a new variable—see the Perl documentation for details). 7.3.2.2 Dynamically scoped valuesDynamic scoping is an interesting concept that few programming languages provide. We'll see the relevance to regular expressions soon, but in a nutshell, you can have Perl save a copy of the value of a global variable that you intend to modify within a block, and restore the original copy automatically at the time when the block ends. Saving a copy is called creating a new dynamic scope, or localizing. One reason that you might want to do this is to temporarily update some kind of global state that's maintained in a global variable. Let's say that you're using a package, Acme::Widget, and it provides a debugging flag via the global variable $Acme::Widget::Debug. You can temporarily ensure that debugging is turned on with code like: . . . { local($Acme::Widget::Debug) = 1; # Ensure it's turned on # work with Acme::Widget while debugging is on . . . } # $Acme::Widget::Debug is now back to whatever it had been before . . . It's that extremely ill-named function local that creates a new dynamic scope. Let me say up front that the call to local does not create a new variable. local is an action, not a declaration. Given a global variable, local does three things:
This means that "local" refers only to how long any changes to the variable will last. The localized value lasts as long as the enclosing block is executing. Even if a subroutine is called from within that block, the localized value is seen. (After all, the variable is still a global variable.) The only difference from a non-localized global variable is that when execution of the enclosing block finally ends, the previous value is automatically restored. An automatic save and restore of a global variable's value is pretty much all there is to local. For all the misunderstanding that has accompanied local, it's no more complex than the snippet on the right of Table 7-4 illustrates. As a matter of convenience, you can assign a value to local($SomeVar), which is exactly the same as assigning to $SomeVar in place of the undef assignment. Also, the parentheses can be omitted to force a scalar context. As a practical example, consider having to call a function in a poorly written library that generates a lot of "Use of uninitialized value" warnings. You use Perl's -w option, as all good Perl programmers should, but the library author apparently didn't. You are exceedingly annoyed by the warnings, but if you can't change the library, what can you do short of stop using -w altogether? Well, you could set a local value of $^W, the in-code debugging flag (the variable name ^W can be either the two characters, caret and 'W', or an actual control-W character):
{ local $^W = 0; # Ensure warnings are off. UnrulyFunction(···); } # Exiting the block restores the original value of $^W. The call to local saves an internal copy of the value of the global variable $^W, whatever it might be. Then that same $^W receives the new value of zero that we immediately scribble in. When UnrulyFunction is executing, Perl checks $^W and sees the zero we wrote, so doesn't issue warnings. When the function returns, our value of zero is still in effect. So far, everything appears to work just as if local isn't used. However, when the block is exited right after the subroutine returns, the original value of $^W is restored. Your change of the value was local, in time, to the life of the block. You'd get the same effect by making and restoring a copy yourself, as in Table 7-4, but local conveniently takes care of it for you. For completeness, let's consider what happens if I use my instead of local.[4] Using my creates a new variable with an initially undefined value. It is visible only within the lexical block it is declared in (that is, visible only by the code written between the my and the end of the enclosing block). It does not change, modify, or in any other way refer to or affect other variables, including any global variable of the same name that might exist. The newly created variable is not visible elsewhere in the program, including from within UnrulyFunction. In our example snippet, the new $^W is immediately set to zero but is never again used or referenced, so it's pretty much a waste of effort. (While executing UnrulyFunction and deciding whether to issue warnings, Perl checks the unrelated global variable $^W.)
7.3.2.3 A better analogy: clear transparenciesA useful analogy for local is that it provides a clear transparency (like used with an overhead projector) over a variable on which you scribble your own changes. You (and anyone else that happens to look, such as subroutines and signal handlers) will see the new values. They shadow the previous value until the point in time that the block is finally exited. At that point, the transparency is automatically removed, in effect, removing any changes that might have been made since the local. This analogy is actually much closer to reality than saying "an internal copy is made." Using local doesn't actually make a copy, but instead puts your new value earlier in the list of those checked whenever a variable's value is accessed (that is, it shadows the original). Exiting a block removes any shadowing values added since the block started. Values are added manually, with local, but here's the whole reason we've been looking localization: regex side-effect variables have their values dynamically scoped automatically. 7.3.2.4 Regex side effects and dynamic scopingWhat does dynamic scoping have to do with regular expressions? A lot. A number of variables like $& (refers to the text matched) and $1 (refers to the text matched by the first parenthesized subexpression) are automatically set as a side effect of a successful match. They are discussed in detail in the next section. These variables have their value dynamically scoped automatically upon entry to every block. To see the benefit of this design choice, realize that each call to a subroutine involves starting a new block, which means a new dynamic scope is created for these variables. Because the values before the block are restored when the block exits (that is, when the subroutine returns), the subroutine can't change the values that the caller sees. As an example, consider: if ( m/(···)/ ) { DoSomeOtherStuff(); print "the matched text was $1.\n"; } Because the value of $1 is dynamically scoped automatically upon entering each block, this code snippet neither cares, nor needs to care, whether the function DoSomeOtherStuff changes the value of $1 or not. Any changes to $1 by the function are contained within the block that the function defines, or perhaps within a sub-block of the function. Therefore, they can't affect the value this snippet sees with the print after the function returns. The automatic dynamic scoping is helpful even when not so apparent: if ($result =~ m/ERROR=(.*)/) { warn "Hey, tell $Config{perladmin} about $1!\n"; } The standard library module Config defines an associative array %Config, of which the member $Config{perladmin} holds the email address of the local Perlmaster. This code could be very surprising if $1 were not automatically dynamically scoped, because %Config is actually a tied variable. That means any reference to it involves a behind-the-scenes subroutine call, and the subroutine within Config that fetches the appropriate value when $Config{···} is used invokes a regex match. That match lies between your match and your use of $1, so if $1 were not dynamically scoped, it would be destroyed before you used it. As it is, any changes in $1 during the $Config{···} processing are safely hidden by dynamic scoping. 7.3.2.5 Dynamic scoping versus lexical scopingDynamic scoping provides many rewards if used effectively, but haphazard dynamic scoping with local can create a maintenance nightmare, as readers of a program find it difficult to understand the increasingly complex interactions among the lexically disperse local, subroutine calls, and references to localized variables. As I mentioned, the my(···) declaration creates a private variable with lexical scope. A private variable's lexical scope is the opposite of a global variable's global scope, but it has little to do with dynamic scoping (except that you can't local the value of a my variable). Remember, local is just an action, while my is both an action and, importantly, a declaration. 7.3.3 Special Variables Modified by a MatchA successful match or substitution sets a variety of global, read-only variables that are always automatically dynamically scoped. These values never change if a match attempt is unsuccessful, and are always set when a match is successful. When appropriate, they are set to the empty string (a string with no characters in it), or undefined (a "no value" value, similar to, yet testably distinct from, an empty string). Table 7-5 shows examples. In more detail, here are the variables set after a match:
When a regex is applied repeatedly with the /g modifier, each iteration sets these variables afresh. That's why, for instance, you can use $1 within the replacement operand of s/···/···/g and have it represent a new slice of text with each match. 7.3.3.1 Using $1 within a regex?The Perl man page makes a concerted effort to point out that \1 is not available as a backreference outside of a regex. (Use the variable $1 instead.) The variable $1 refers to a string of static text matched during some previously completed successful match. On the other hand, \1 is a true regex metacharacter that matches text similar to that matched within the first parenthesized subexpression at the time that the regex-directed NFA reaches the \1 . What it matches might change over the course of an attempt as the NFA tracks and backtracks in search of a match. The opposite question is whether $1 and other after-match variables are available within a regex operand. They are commonly used within the code parts of embedded- code and dynamic-regex constructs (see Section 7.8), but otherwise make little sense within a regex. A $1 appearing in the "regex part" of a regex operand is treated exactly like any other variable: its value is interpolated before the match or substitution operation even begins. Thus, as far as the regex is concerned, the value of $1 has nothing to do with the current match, but rather is left over from some previous match. |
< Free Open Study > |