7.7 The Split Operator

The multifaceted split operator (often called a function in casual conversation) is commonly used as the converse of a list-context m/···/g (see Section 7.5.3.3). The latter returns text matched by the regex, while a split with the same regex returns text separated by matches. The normal match $text =~ m/:/g applied against a $text of ' IO.SYS:225558:95-10-03:-a-sh:optional ', returns the four-element list

     ( ':', ':', ':', ':' )

which doesn't seem useful. On the other hand, split(/:/, $text) returns the five-element list:

     ( 'IO.SYS', '225558', '95-10-03', '-a-sh', 'optional' )

Both examples reflect that : matches four times. With split, those four matches partition a copy of the target into five chunks, which are returned as a list of five strings.

That example splits the target string on a single character, but it you can split on any arbitrary regular expression. For example,

     @Paragraphs = split(m/\s*<p>\s*/i, $html);

splits the HTML in $html into chunks, at <p> or <P>, surrounded by optional whitespace. You can even split on locations, as with

     @Lines = split(m/^/m, $lines);

to break a string into its logical lines.

In its most simple form with simple data like this, split is as easy to understand as it is useful. However, there are many options, special cases, and special situations that complicate things. Before getting into the details, let me show two particularly useful special cases:

The special match operand // causes the target string to be split into its component characters. Thus, split(//, "short test") returns a list of ten elements: ("s", "h", "o", ···, "s", "t") .
The special match operand "•" (a normal string with a single space) causes the target string to be split on whitespace, similar to using m/\s+/ as the operand, except that any leading and trailing whitespace are ignored. Thus, split("•", "•••a•short•••test•••") returns the strings 'a', 'short', and 'test'.

These and other special cases are discussed a bit later, but first, the next sections go over the basics.

7.7.1 Basic Split

split is an operator that looks like a function, and takes up to three operands:

     split(match operand, target string, chunk-limit operand)

The parentheses are optional. Default values (discussed later in this section) are provided for operands left off the end.

split is always used in a list context. Common usage patterns include:

     ($var1, $var2, $var3, ···) = split(···);

     ...........................

     @array = split(···);

     ...........................

     for my $item (split(···)) {

        .

        .

        .

     }

7.7.1.1 Basic match operand

The match operand has several special-case situations, but it is normally the same as the regex operand of the match operator. That means that you can use /···/ and m{···} and the like, a regex object, or any expression that can evaluate to a string. Only the core modifiers described in Section 7.2.3 are supported.

If you need parentheses for grouping, be sure to use the (?:···) non-capturing kind. As we'll see in a few pages, the use of capturing parentheses with split turns on a very special feature.

7.7.1.2 Target string operand

The target string is inspected, but is never modified by split. The content of $_ is the default if no target string is provided.

7.7.1.3 Basic chunk-limit operand

In its primary role, the chunk-limit operand specifies a limit to the number of chunks that split partitions the string into. With the sample data from the first example, split(/:/, $text, 3) returns:

     ( 'IO.SYS', '225558', '95-10-03:-a-sh:optional' )

This shows that split stopped after /:/ matched twice, resulting in the requested three-chunk partition. It could have matched additional times, but that's irrelevant because of this example's chunk limit. The limit is an upper bound, so no more than that many elements will ever be returned (unless the regex has capturing parentheses, which is covered in a later section). You may still get fewer elements than the chunk limit; if the data can't be partitioned enough to begin with, nothing extra is produced to "fill the count." With our example data, split(/:/, $text, 99) still returns only a five-element list. However, there is an important difference between split(/:/, $text) and split(/:/, $text, 99) which does not manifest itself with this example — keep this in mind when the details are discussed later.

Remember that the chunk-limit operand refers to the chunks between the matches, not to the number of matches themselves. If the limit were to refer to the matches themselves, the previous example with a limit of three would produce

     ( 'IO.SYS', '225558', '95-10-03', '-a-sh:optional' )

which is not what actually happens.

One comment on efficiency: let's say you intended to fetch only the first few fields, such as with:

     ($filename, $size, $date) = split(/:/, $text);

As a performance enhancement, Perl stops splitting after the fields you've requested have been filled. It does this by automatically providing a chunk limit of one more than the number of items in the list.

7.7.1.4 Advanced split

split can be simple to use, as with the examples we've seen so far, but it has three special issues that can make it somewhat complex in practice:

Returning empty elements
Special regex operands
A regex with capturing parentheses

The next sections cover these in detail.

7.7.2 Returning Empty Elements

The basic premise of split is that it returns the text separated by matches, but there are times when that returned text is an empty string (a string of length zero, e.g., ""). For example, consider

     @nums = split(m/:/, "12:34::78");

This returns

     ("12", "34", "", "78")

The regex : matches three times, so four elements are returned. The empty third element reflects that the regex matched twice in a row, with no text in between.

7.7.2.1 Trailing empty elements

Normally, trailing empty elements are not returned. For example,

     @nums = split(m/:/, "12:34:

:78:::");

sets @nums to the same four elements

     ("12", "34", "", "78")

as the previous example, even though the regex was able to match a few extra times at the end of the string. By default, split does not return empty elements at the end of the list. However, you can have split return all trailing elements by using an appropriate chunk-limit operand . . .

7.7.2.2 The chunk-limit operand's second job

In addition to possibly limiting the number of chunks, any non-zero chunk-limit operand also preserves trailing empty items. (A chunk limit given as zero is exactly the same as if no chunk limit is given at all.) If you don't want to limit the number of chunks returned, but do want to leave trailing empty elements intact, simply choose a very large limit. Or, better yet, use -1, because a negative chunk limit is taken as an arbitrarily large limit: split(/:/, $text, -1) returns all elements, including any trailing empty ones.

At the other extreme, if you want to remove all empty items, you could put grep {length} before the split. This use of grep lets pass only list elements with non-zero lengths (in other words, elements that aren't empty):

     my @NonEmpty = grep { length } split(/:/, $text);

7.7.2.3 Special matches at the ends of the string

A match at the very beginning normally produces an empty element:

     @nums = split(m/:/, ":12:34::78");

That sets @nums to:

     ("", "12", "34", "", "78")

The initial empty element reflects the fact that the regex matched at the beginning of the string. However, as a special case, if the regex doesn't actually match any text when it matches at the start or end of the string, leading and/or trailing empty elements are not produced. A simple example is split(/\b/, "a simple test"), which can match at the six marked locations in 'a•simple•test'. Even though it matches six times, it doesn't return seven elements, but rather only the five elements: ("a", "", "simple", "", "test"). Actually, we've already seen this special case, with the @Lines = split(m/^/m, $lines) example in Section 7.7.

7.7.3 Split's Special Regex Operands

split's match operand is normally a regex literal or a regex object, as with the match operator, but there are some special cases:

An empty regex for split does not mean "Use the current default regex," but to split the target string into a list of characters. We saw this before at the start of the split discussion, noting that split(//, "short test") returns a list of ten elements: ("s", "h", "o", &bigmidddot, "s", "t").
A match operand that is a string (not a regex) consisting of exactly one space is a special case. It's almost the same as /\s+/, except that leading whitespace is skipped. Trailing whitespace is ignored as well if an appropriately large (or negative) chunk-limit operand is given. This is all meant to simulate the default input-record-separator splitting that awk does with its input, although it can certainly be quite useful for general use.
If you'd like to keep leading whitespace, just use m/\s+/ directly. If you'd like to keep trailing whitespace, use -1 as the chunk-limit operand.
If no regex operand is given, a string consisting of one space (the special case in the previous point) is used as the default. Thus, a raw split without any operands is the same as split('•', $_, 0).
If the regex ^ is used, the /m modifier (for the enhanced line-anchor match mode) is automatically supplied for you. (For some reason, this does not happen for $.) Since it's so easy to just use m/^/m explicitly, I would recommend doing so, for clarity. Splitting on m/^/m is an easy way to break a multiline string into individual lines.

7.7.3.1 Split has no side effects

Note that a split match operand often looks like a match operator, but it has none of the side effects of one. The use of a regex with split doesn't affect the default regex for later match or substitution operators. The variables $&, $', $1, and so on are not set or otherwise affected by a split. A split is completely isolated from the rest of the program with respect to side effects.^[8]

^[8] Actually, there is one side effect remaining from a feature that has been deprecated for many years, but has not actually been removed from the language yet. If split is used in a scalar context, it writes its results to the @_ variable (which is also the variable used to pass function arguments, so be careful not to use split in a scalar context by accident). use warnings or the -w command-line argument will warn you if split is used in a scalar context.

7.7.4 Split's Match Operand with Capturing Parentheses

Capturing parentheses change the whole face of split. When they are used, the returned list has additional, independent elements interjected for the item(s) captur ed by the parentheses. This means that some or all text normally not returned by split is now included in the returned list.

For example, as part of HTML processing, split(/(<[^>]*>)/) turns

     ···•and•<B>very•<FONT•color=red>very></FONT>•much</B>•effort···

into:

     ( '...•and ', '<B>', 'very•', '<FONT•color=red>',

       'very', '</FONT>', '•much', '</B>', '•effort...' )

With the capturing parentheses removed, split(/<[^>]*>/) returns:

     ( '...•and ', 'very•', 'very', '•much', '•effort...' )

The added elements do not count against a chunk limit. (The chunk limit limits the chunks that the original string is partitioned into, not the number of elements returned.)

If there are multiple sets of capturing parentheses, multiple items are added to the list with each match. If there are sets of capturing parentheses that don't contribute to a match, undef elements are inserted for them.

< Free Open Study >