![]() |
< Free Open Study > |
![]() |
2.2 Matching Text with Regular ExpressionsPerl uses regular expressions in many ways, the simplest being to check if a regex matches text (or some part thereof) held in a variable. This snippet checks the string held in variable $reply and reports whether it contains only digits:
if ($reply =~ m/^[0-9]+$/) {
print "only digits\n";
} else {
print "not only digits\n";
}
The mechanics of the first line might seem a bit strange: the regular expression is
Don't confuse =~ with = or == . The operator == tests whether two numbers are the same. (The operator eq , as we will soon see, is used to test whether two strings are the same.) The = operator is used to assign a value to a variable, as with $celsius = 20 . Finally, =~ links a regex search with the target string to be searched. In the example, the search is m/^[0-9]+$/ and the target is $reply. Other languages approach this differently, and we'll see examples in the next chapter. It might be convenient to read =~ as "matches," such that if ($reply =~ m/^[0-9]+$/) becomes: if the text contained in the variable $reply
matches the regex
The whole result of
$reply =~ m/^[0-9]+$/
is a true value if the
Note that a test such as
$reply =~ m/[0-9]+/
(the same as before except the
wrapping caret and dollar have been removed) would be true if $reply contained
at least one digit anywhere. The surrounding
Let's combine the last two examples. We'll prompt the user to enter a value, accept that value, and then verify it with a regular expression to make sure it's a number. If it is, we calculate and display the Fahrenheit equivalent. Otherwise, we issue a warning message: print "Enter a temperature in Celsius:\n"; $celsius = <STDIN>; # this reads one line from the user chomp($celsius); # this removes the ending newline from $celsius if ($celsius =~ m/^[0-9]+$/) { $fahrenheit = ($celsius * 9 / 5) + 32; # calculate Fahrenheit print "$celsius C is $fahrenheit F\n"; } else { print "Expecting a number, so I don't understand \"$celsius\".\n"; } Notice in the last print how we escaped the quotes to be printed, to distinguish them from the quotes that delimit the string? As with literal strings in most languages, there are occasions to escape some items, and this is very similar to escaping a metacharacter in a regex. The relationship between a string and a regex isn't quite as important with Perl, but is extremely important with languages like Java, Python, and the like. The section "A short aside — metacharacters galore" (see Section 2.2.3.1) discusses this in a bit more detail. (One notable exception is VB.NET, which requires '""' rather than '\"' to get a double quote into a string literal.) If we put this program into the file c2f, we might run it and see: % perl -w c2f Enter a temperature in Celsius: 22 22 C is 71.599999999999994316 F Oops. As it turns out, Perl's simple print isn't so good when it comes to floatingpoint numbers. I don't want to get bogged down describing all the details of Perl in this chapter, so I'll just say without further comment that you can use printf ("print formatted") to make this look better: printf "%.2f C is %.2f F\n", $celsius, $fahrenheit; The printf function is similar to the C language's printf, or the format of Pascal, Tcl, elisp, and Python. It doesn't change the values of the variables, but merely how they are displayed. The result are now much nicer:
Enter a temperature in Celsius:
22
22.00 C is 71.60 F
2.2.1 Toward a More Real-World ExampleLet's extend this example to allow negative and fractional temperature values. The
math part of the program is fine — Perl normally makes no distinction between
integers and floating-point numbers. We do, however, need to modify the regex to
let negative and floating-point values pass. We can insert a leading
To allow an optional decimal part, we add
Putting this all together, we get if ($celsius =~ m/^[-+]?[0-9]+(\.[0-9]*)?$/) { as our check line. It allows numbers such as 32, -3.723, and +98.6. It is actually not quite perfect: it doesn't allow a number that begins with a decimal point (such as .357). Of course, the user can just add a leading zero to allow it to match (e.g., 0.357), so I don't consider it a major shortcoming. This floating-point problem can have some interesting twists, and I look at it in detail in Chapter 5 (see Section 5.2.5). 2.2.2 Side Effects of a Successful MatchLet's extend the example further to allow someone to enter a value in either
Fahrenheit or Celsius. We'll have the user append a C or F to the temperature
entered. To let this pass our regular expression, we can simply add
In Chapter 1, we saw how some versions of egrep support
We'll see examples of how other languages do this in the next chapter (see Section 3.4.5), but Perl provides the access via the variables $1, $2, $3, etc., which refer to the text matched by the first, second, third, etc., parenthesized subexpression. As odd as it might seem, these are variables. The variable names just happen to be numbers. Perl sets them every time the application of a regex is successful. To summarize, use the metacharacter
To keep the example uncluttered and focus on what's new, I'll remove the fractional- value part of the regex for now, but we'll return to it again soon. So, to see $1 in action, compare: $celsius =~ m/^[-+]?[0-9]+[CF]$/ $celsius =~ m/^([-+]?[0-9]+)([CF])$/ Do the added parentheses change the meaning of the expression? Well, to answer
that, we need to know whether they provide grouping for star or other quantifiers,
or provide an enclosure for
Figure 1. Capturing parentheses![]() Figure 2. Temperature-conversion program's logic flow![]() Temperature-conversion programprint "Enter a temperature (e.g., 32F, 100C):\n"; $input = <STDIN>; #This reads one line from the user. chomp($input); #This removes the ending newline from $input. if ($input =~ m/^([-+]?[0-9]+)([CF])$/) { # If we get in here, we had a match. $1 is the number, $2 is "C" or "F". $InputNum = $1; # Save to named variables to make the ... $type = $2; # ... rest of the program easier to read. if ($type eq "C") { # 'eq' tests if two strings are equal # The input was Celsius, so calculate Fahrenheit $celsius = $InputNum; $fahrenheit = ($celsius * 9 / 5) + 32; } else { # If not "C", it must be an "F", so calculate Celsius $fahrenheit = $InputNum; $celsius = ($fahrenheit - 32) * 5 / 9; } # At this point we have both temperatures, so display the results: printf "%.2f C is %.2f F\n", $celsius, $fahrenheit; } else { # The initial regex did not match, so issue a warning. print "Expecting a number followed by \"C\" or \"F\",\n"; print "so I don't understand \"$input\".\n"; } If the above program is named convert, we can use it like this: % perl -w convert Enter a temperature (e.g., 32F, 100C): 39F 3.89 C is 39.00 F % perl -w convert Enter a temperature (e.g., 32F, 100C): 39C 39.00 C is 102.20 F % perl -w convert Enter a temperature (e.g., 32F, 100C): oops Expecting a number followed by "C" or "F", so I don't understand "oops". 2.2.3 Intertwined Regular ExpressionsWith advanced programming languages like Perl, regex use can become quite intertwined with the logic of the rest of the program. For example, let's make three useful changes to our program: allow floating-point numbers as we did earlier, allow for the f or c entered to be lowercase, and allow spaces between the number and letter. Once all these changes are done, input such as '98.6•f' will be allowed. Earlier, we saw how we can allow floating-point numbers by adding
if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)([CF])$/)
Notice that it is added inside the first set of parentheses. Since we use that first set to capture the number to compute, we want to make sure that they capture the fractional portion as well. However, the added set of parentheses, even though ostensibly used only to group for the question mark, also has the side effect of capturing into a variable. Since the opening parenthesis of the pair is the second (from the left), it captures into $2. This is illustrated in Figure 2-3. Figure 3. Nesting parentheses![]()
Figure 2-3 illustrates how closing parentheses nest with opening ones. Adding a
set of parentheses earlier in the expression doesn't influence the meaning of
Next, allowing spaces between the number and letter is easier. We know that an
unadorned space in a regex requires exactly one space in the matched text, so
if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?) *([CF])$/)
This does give a limited amount of flexibility to the user of our program, but since
we are trying to make something useful in the real world, let's construct the regex
to also allow for other kinds of whitespace as well. Tabs, for instance, are quite
common. Writing
Compare that with
In this book, spaces and tabs are easy to notice because of the • and Some other Perl convenience metacharacters are
2.2.3.1 A short aside—metacharacters galoreWe saw \n in earlier examples, but in those cases, it was in a string, not a regular
expression. Like most languages, Perl strings have metacharacters of their own,
and these are completely distinct from regular expression metacharacters. It is a
common mistake for new programmers to get them confused. (VB.NET is a
notable language that has very few string metacharacters.) Some of these string
metacharacters conveniently look exactly the same as some comparable regex
metacharacters. You can use the string metacharacter \t to get a tab into your
string, while you can use the regex metacharacter
The similarity is convenient, but I can't stress enough how important it is to maintain the distinction between the different types of metacharacters. It may not seem important for such a simple example as \t, but as we'll later see when looking at numerous different languages and tools, knowing which metacharacters are being used in each situation is extremely important. We have already seen multiple sets of metacharacters conflict. In Chapter 1, while working with egrep, we generally wrapped our regular expressions in single quotes. The whole egrep command line is written at the command-shell prompt, and the shell recognizes several of its own metacharacters. For example, to the shell, the space is a metacharacter that separates the command from the arguments and the arguments from each other. With many shells, single quotes are metacharacters that tell the shell to not recognize other shell metacharacters in the text between the quotes. (DOS uses double quotes.) Using the quotes for the shell allows us to use spaces in our regular expression. Without the quotes, the shell would interpret the spaces in its own way instead of passing them through to egrep to interpret in its way. Many shells also recognize metacharacters such as $, *, ?, and so on—characters that we are likely to want to use in a regex. Now, all this talk about other shell metacharacters and Perl's string metacharacters has nothing to do with regular expressions themselves, but it has everything to do with using regular expressions in real situations. As we move through this book, we'll see numerous (sometimes complex) situations where we need to take advantage of multiple levels of simultaneously interacting metacharacters. And what about this
2.2.3.2 Generic "whitespace" with \sWhile discussing whitespace, we left off with
Our test now looks like: $input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/ Lastly, we want to allow a lowercase letter as well as uppercase. This is as easy as
adding the lowercase letters to the class:
$input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i The added i is called a modifier, and placing it after the m/···/ instructs Perl to do the match in a case-insensitive manner. It's not actually part of the regex, but part of the m/···/ syntactic packaging that tells Perl what you want to do (apply a regex), and which regex to do it with (the one between the slashes). We've seen this type of thing before, with egrep 's -i option (see Section 1.4.6). It's a bit too cumbersome to say "the i modifier" all the time, so normally "/i" is used even though you don't add an extra / when actually using it. This /i notation is one way to specify modifiers in Perl — in the next chapter, we'll see other ways to do it in Perl, and also how other languages allow for the same functionality. We'll also see other modifiers as we move along, including /g ("global match") and /x ("free-form expressions") later in this chapter. Well, we've made a lot of changes. Let's try the new program: % perl -w convert Enter a temperature (e.g., 32F, 100C): 32 f 0.00 C is 32.00 F % perl -w convert Enter a temperature (e.g., 32F, 100C): 50 c 10.00 C is 50.00 F Oops! Did you notice that in the second try we thought we were entering 50° Celsius, yet it was interpreted as 50° Fahrenheit? Looking at the program's logic, do you see why? Let's look at that part of the program again: if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) { . . . $type = $3; # save to a named variable to make rest of program more readable if ($type eq "C") { # 'eq' tests if two strings are equal . . . } else { . . . Although we modified the regex to allow a lowercase f, we neglected to update the rest of the program appropriately. As it is now, if $type isn't exactly 'C', we assume the user entered Fahrenheit. Since we now also allow 'c' to mean Celsius, we need to update the $type test: if ($type eq "C" or $type eq "c") { Actually, since this is a book on regular expressions, perhaps I should use: if ($type =~ m/c/i) { In either case, it now works as we want. The final program is shown below. These examples show how the use of regular expressions can become intertwined with the rest of the program. Temperature-conversion program - final listingprint "Enter a temperature (e.g., 32F, 100C):\n"; $input = <STDIN>; # This reads one line from the user. chomp($input); # This removes the ending newline from $input. if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) { # If we get in here, we had a match. $1 is the number, $3 is "C" or "F". $InputNum = $1; # Save to named variables to make the ... $type = $3; # ... rest of the program easier to read. if ($type =~ m/c/i) { # Is it "c" or "C"? # The input was Celsius, so calculate Fahrenheit $celsius = $InputNum; $fahrenheit = ($celsius * 9 / 5) + 32; } else { # If not "C", it must be an "F", so calculate Celsius $fahrenheit = $InputNum; $celsius = ($fahrenheit - 32) * 5 / 9; } # At this point we have both temperatures, so display the results: printf "%.2f C is %.2f F\n", $celsius, $fahrenheit; } else { # The initial regex did not match, so issue a warning. print "Expecting a number followed by \"C\" or \"F\",\n"; print "so I don't understand \"$input\".\n"; } 2.2.4 IntermissionAlthough we have spent much of this chapter coming up to speed with Perl, we've encountered a lot of new information about regexes:
|
![]() |
< Free Open Study > |
![]() |