![]() |
< Free Open Study > |
![]() |
7.7 The Split OperatorThe multifaceted split operator (often called a function in casual conversation) is commonly used as the converse of a list-context m/···/g (see Section 7.5.3.3). The latter returns text matched by the regex, while a split with the same regex returns text separated by matches. The normal match $text =~ m/:/g applied against a $text of ' IO.SYS:225558:95-10-03:-a-sh:optional ', returns the four-element list ( ':', ':', ':', ':' ) which doesn't seem useful. On the other hand, split(/:/, $text) returns the five-element list: ( 'IO.SYS', '225558', '95-10-03', '-a-sh', 'optional' ) Both examples reflect that
That example splits the target string on a single character, but it you can split on any arbitrary regular expression. For example, @Paragraphs = split(m/\s*<p>\s*/i, $html); splits the HTML in $html into chunks, at <p> or <P>, surrounded by optional whitespace. You can even split on locations, as with @Lines = split(m/^/m, $lines); to break a string into its logical lines. In its most simple form with simple data like this, split is as easy to understand as it is useful. However, there are many options, special cases, and special situations that complicate things. Before getting into the details, let me show two particularly useful special cases:
These and other special cases are discussed a bit later, but first, the next sections go over the basics. 7.7.1 Basic Splitsplit is an operator that looks like a function, and takes up to three operands: split(match operand, target string, chunk-limit operand)
The parentheses are optional. Default values (discussed later in this section) are provided for operands left off the end. split is always used in a list context. Common usage patterns include: ($var1, $var2, $var3, ···) = split(···); ........................... @array = split(···); ........................... for my $item (split(···)) { . . . } 7.7.1.1 Basic match operandThe match operand has several special-case situations, but it is normally the same as the regex operand of the match operator. That means that you can use /···/ and m{···} and the like, a regex object, or any expression that can evaluate to a string. Only the core modifiers described in Section 7.2.3 are supported. If you need parentheses for grouping, be sure to use the
7.7.1.2 Target string operandThe target string is inspected, but is never modified by split. The content of $_ is the default if no target string is provided. 7.7.1.3 Basic chunk-limit operandIn its primary role, the chunk-limit operand specifies a limit to the number of chunks that split partitions the string into. With the sample data from the first example, split(/:/, $text, 3) returns: ( 'IO.SYS', '225558', '95-10-03:-a-sh:optional' ) This shows that split stopped after /:/ matched twice, resulting in the requested three-chunk partition. It could have matched additional times, but that's irrelevant because of this example's chunk limit. The limit is an upper bound, so no more than that many elements will ever be returned (unless the regex has capturing parentheses, which is covered in a later section). You may still get fewer elements than the chunk limit; if the data can't be partitioned enough to begin with, nothing extra is produced to "fill the count." With our example data, split(/:/, $text, 99) still returns only a five-element list. However, there is an important difference between split(/:/, $text) and split(/:/, $text, 99) which does not manifest itself with this example — keep this in mind when the details are discussed later. Remember that the chunk-limit operand refers to the chunks between the matches, not to the number of matches themselves. If the limit were to refer to the matches themselves, the previous example with a limit of three would produce ( 'IO.SYS', '225558', '95-10-03', '-a-sh:optional' ) which is not what actually happens. One comment on efficiency: let's say you intended to fetch only the first few fields, such as with: ($filename, $size, $date) = split(/:/, $text); As a performance enhancement, Perl stops splitting after the fields you've requested have been filled. It does this by automatically providing a chunk limit of one more than the number of items in the list. 7.7.1.4 Advanced splitsplit can be simple to use, as with the examples we've seen so far, but it has three special issues that can make it somewhat complex in practice:
The next sections cover these in detail. 7.7.2 Returning Empty ElementsThe basic premise of split is that it returns the text separated by matches, but there are times when that returned text is an empty string (a string of length zero, e.g., ""). For example, consider @nums = split(m/:/, "12:34::78"); This returns ("12", "34", "", "78") The regex
7.7.2.1 Trailing empty elementsNormally, trailing empty elements are not returned. For example, @nums = split(m/:/, "12:34: :78:::"); sets @nums to the same four elements ("12", "34", "", "78") as the previous example, even though the regex was able to match a few extra times at the end of the string. By default, split does not return empty elements at the end of the list. However, you can have split return all trailing elements by using an appropriate chunk-limit operand . . . 7.7.2.2 The chunk-limit operand's second jobIn addition to possibly limiting the number of chunks, any non-zero chunk-limit operand also preserves trailing empty items. (A chunk limit given as zero is exactly the same as if no chunk limit is given at all.) If you don't want to limit the number of chunks returned, but do want to leave trailing empty elements intact, simply choose a very large limit. Or, better yet, use -1, because a negative chunk limit is taken as an arbitrarily large limit: split(/:/, $text, -1) returns all elements, including any trailing empty ones. At the other extreme, if you want to remove all empty items, you could put grep {length} before the split. This use of grep lets pass only list elements with non-zero lengths (in other words, elements that aren't empty): my @NonEmpty = grep { length } split(/:/, $text); 7.7.2.3 Special matches at the ends of the stringA match at the very beginning normally produces an empty element: @nums = split(m/:/, ":12:34::78"); That sets @nums to: ("", "12", "34", "", "78") The initial empty element reflects the fact that the regex matched at the beginning
of the string. However, as a special case, if the regex doesn't actually match any
text when it matches at the start or end of the string, leading and/or trailing empty
elements are not produced. A simple example is
split(/\b/, "a simple test")
,
which can match at the six marked locations in '
7.7.3 Split's Special Regex Operandssplit's match operand is normally a regex literal or a regex object, as with the match operator, but there are some special cases:
7.7.3.1 Split has no side effectsNote that a split match operand often looks like a match operator, but it has none of the side effects of one. The use of a regex with split doesn't affect the default regex for later match or substitution operators. The variables $&, $', $1, and so on are not set or otherwise affected by a split. A split is completely isolated from the rest of the program with respect to side effects.[8]
7.7.4 Split's Match Operand with Capturing ParenthesesCapturing parentheses change the whole face of split. When they are used, the returned list has additional, independent elements interjected for the item(s) captur ed by the parentheses. This means that some or all text normally not returned by split is now included in the returned list. For example, as part of HTML processing, split(/(<[^>]*>)/) turns ···•and•<B>very•<FONT•color=red>very></FONT>•much</B>•effort··· into: ( '...•and ', '<B>', 'very•', '<FONT•color=red>', 'very', '</FONT>', '•much', '</B>', '•effort...' ) With the capturing parentheses removed, split(/<[^>]*>/) returns: ( '...•and ', 'very•', 'very', '•much', '•effort...' ) The added elements do not count against a chunk limit. (The chunk limit limits the chunks that the original string is partitioned into, not the number of elements returned.) If there are multiple sets of capturing parentheses, multiple items are added to the list with each match. If there are sets of capturing parentheses that don't contribute to a match, undef elements are inserted for them. |
![]() |
< Free Open Study > |
![]() |