6.20. Arrays

Arrays in awk are called associative arrays because the subscripts can be either numbers or strings. The subscript is often called the key and is associated with the value assigned to the corresponding array element. The keys and values are stored internally in a table where a hashing algorithm is applied to the value of the key in question. Due to the techniques used for hashing, the array elements are not stored in a sequential order, and when the contents of the array are displayed, they may not be in the order you expected.

An array, like a variable, is created by using it, and awk can infer whether it is used to store numbers or strings. Array elements are initialized with numeric value zero and string value null, depending on the context. You do not have to declare the size of an awk array. Awk arrays are used to collect information from records and may be used for accumulating totals, counting words, tracking the number of times a pattern occurred, and so forth.

6.20.1 Subscripts for Associative Arrays

Using Variables As Array Indexes

[See Example 6.135 for a demonstration.

Example 6.135.


(The Input File)

% cat employees

Tom Jones             4424    5/12/66               543354

Mary Adams            5346    11/4/63               28765

Sally Chang           1654    7/22/54               650000

Billy Black           1683    9/23/44               336500



(The Command Line)

1    % nawk '{name[x++]=$2};END{for(i=0; i<NR; i++)\

       print i, name[i]}' employees

     0 Jones

     1 Adams

     2 Chang

     3 Black



2    % nawk '{id[NR]=$3};END{for(x = 1; x <= NR; x++)\

       print id[x]}' employees

     4424

     5346

     1654

     1683

EXPLANATION

The subscript in array name is a user-defined variable, x. The ++ indicates a numeric context. Awk initializes x to 0 and increments x by 1 after (post-increment operator) it is used. The value of the second field is assigned to each element of the name array. In the END block, the for loop is used to loop through the array, printing the value that was stored there, starting at subscript 0. Because the subscript is just a key, it does not have to start at 0. It can start at any value, either a number or a string.
The awk variable NR contains the number of the current record. By using NR as a subscript, the value of the third field is assigned to each element of the array for each record. At the end, the for loop will loop through the array, printing out the values that were stored there.

The Special-`for` Loop

The special-for loop is used to read through an associative array in cases where the for loop is not practical; that is, when strings are used as subscripts or the subscripts are not consecutive numbers. The special-for loop uses the subscript as a key into the value associated with it.

FORMAT


{for(item in arrayname){

    print arrayname[item]

    }

}

Example 6.136.


(The Input File)

% cat db

1    Tom Jones

2    Mary Adams

3    Sally Chang

4    Billy Black

5    Tom Savage

6    Tom Chung

7    Reggie Steel

8    Tommy Tucker



(The Command Line, for Loop)

1   % nawk '/^Tom/{name[NR]=$1};\

      END{for( i = 1; i <= NR; i++ )print name[i]}' db

    Tom







    Tom

    Tom



    Tommy

(The Command Line, Special-for Loop)

2   % nawk '/^Tom/{name[NR]=$1};\

      END{for(i in name){print name[i]}}' db

    Tom

    Tommy

    Tom

    Tom

EXPLANATION

If the regular expression Tom is matched against an input line, the name array is assigned a value. The NR value, the number of the current record, will be used as an index in the name array. Each time Tom is matched on a line, the name array is assigned the value of $1, the first field. When the END block is reached, the name array consists of four elements: name[1], name[5], name[6], and name[8]. Therefore, when printing the values for the name array with the traditional for loop, the values for indexes 2, 3, 4, and 7 are null.
The special-for loop iterates through the array, printing only values where there was a subscript associated with that value. The order of the printout is random because of the way the associative arrays are stored (hashed).

Using Strings As Array Subscripts

A subscript may consist of a variable containing a string or literal string. If the string is a literal, it must be enclosed in double quotes.

Example 6.137.


(The Input File)

% cat datafile3

tom

mary

sean

tom

mary

mary

bob

mary

alex



(The Script)

     # awk.sc script

1    /tom/ { count["tom"]++ }

2    /mary/ { count["mary"]++ }

3    END{print "There are " count["tom"] " Toms in the file and

       " count["mary"]" Marys in the file."}

(The Command Line)

    % nawk -f awk.sc datafile3

    There are 2 Toms in the file and 4 Marys in the file.

EXPLANATION

An array called count consists of two elements, count["tom"] and count["mary"]. The initial value of each of the array elements is 0. Every time tom is matched, the value of the array is incremented by 1.
The same procedure applies to count["mary"]. Note: Only one tom is recorded for each line, even if there are multiple occurrences on the line.
The END pattern prints the value stored in each of the array elements.

Figure 6.1. Using strings as subscripts in an array (Example 6.137).

Using Field Values As Array Subscripts

Any expression can be used as a subscript in an array. Therefore, fields can be used. The program in Example 6.138 counts the frequency of all names appearing in the second field and introduces a new form of the for loop:


for( index_value in array ) statement

The for loop found in the END block of the previous example works as follows: The variable name is set to the index value of the count array. After each iteration of the for loop, the print action is performed, first printing the value of the index, and then the value stored in that element. (The order of the printout is not guaranteed.)

Example 6.138.


(The Input File)

% cat datafile4

4234  Tom     43

4567  Arch    45

2008  Eliza   65

4571  Tom     22

3298  Eliza   21

4622  Tom     53

2345  Mary    24

(The Command Line)

% nawk '{count[$2]++}END{for(name in count)print name,count[name] }' datafile4

Tom 3

Arch 1

Eliza 2

Mary 1

EXPLANATION

The awk statement first will use the second field as an index in the count array. The index varies as the second field varies, thus the first index in the count array is Tom and the value stored in count["Tom"] is 1.

Next, count["Arch"] is set to 1, count["Eliza"] to 1, and count["Mary"] to 1. When awk finds the next occurrence of Tom in the second field, count["Tom"] is incremented, now containing the value 2. The same thing happens for each occurrence of Arch, Eliza, and Mary.

Example 6.139.


(The Input File)

% cat datafile4

4234  Tom    43

4567  Arch   45

2008  Eliza  65

4571  Tom    22

3298  Eliza  21

4622  Tom    53

2345  Mary   24



(The Command Line)

% nawk  '{dup[$2]++; if (dup[$2] > 1){name[$2]++ }}\

  END{print "The duplicates were"\

  for (i in name){print i, name[i]}}' datafile4



(The Output)

Tom 2

Eliza 2

EXPLANATION

The subscript for the dup array is the value in the second field, that is, the name of a person. The value stored there is initially zero, and it is incremented by one each time a new record is processed. If the name is a duplicate, the value stored for that subscript will go up to two, and so forth. If the value in the dup array is greater than one, a new array called name also uses the second field as a subscript and keeps track of the number of names greater than one.

Arrays and the `split` Function

Awk's built-in split function allows you to split a string into words and store them in an array. You can define the field separator or use the value currently stored in FS.

FORMAT


split(string, array, field separator)

split (string, array)

Example 6.140.


(The Command Line)

% nawk BEGIN{ split( "3/15/2004", date, "/");\

  print "The month is " date[1] "and the year is "date[3]"} filename



(The Output)

The month is 3 and the year is 2004.

EXPLANATION

The string 3/15/2004 is stored in the array date, using the forward slash as the field separator. Now date[1] contains 3, date[2] contains 15, and date[3] contains 2004. The field separator is specified in the third argument; if not specified, the value of FS is used as the separator.

The `delete` Function

The delete function removes an array element.

Example 6.141.


% nawk '{line[x++]=$2}END{for(x in line) delete(line[x])}' filename

EXPLANATION

The value assigned to the array line is the value of the second field. After all the records have been processed, the special-for loop will go through each element of the array, and the delete function will in turn remove each element.

Multidimensional Arrays (`nawk`)

Although awk does not officially support multidimensional arrays, a syntax is provided that gives the appearance of a multidimensional array. This is done by concatenating the indexes into a string separated by the value of a special built-in variable, SUBSEP. The SUBSEP variable contains the value "\034", an unprintable character that is so unusual that it is unlikely to be found as an index character. The expression matrix[2,8] is really the array matrix[2 SUBSEP 8], which evaluates to matrix["2\0348"]. The index becomes a unique string for an associative array.

Example 6.142.


(The Input File)

1 2 3 4 5

2 3 4 5 6

6 7 8 9 10



(The Script)

1    {nf=NF

2    for(x = 1; x <= NF; x++ ){

3        matrix[NR, x] = $x

         }

     }

4    END { for (x=1; x <= NR; x++ ){

         for (y = 1; y <= nf; y++ )

              printf "%d ", matrix[x,y]

     printf"\n"

         }

    }



(The Output)

1 2 3 4 5

2 3 4 5 6

6 7 8 9 10

EXPLANATION

The variable nf is assigned the value of NF , the number of fields. (This program assumes a fixed number of five fields per record.)
The for loop is entered, storing the number of each field on the line in the variable x.
The matrix array is a two-dimensional array. The two indexes, NR (number of the current record) and x, are assigned the value of each field.
In the END block, the two for loops are used to iterate through the matrix array, printing out the values stored there. This example does nothing more than demonstrate that multidimensional arrays can be simulated.

6.20.2 Processing Command Arguments (`nawk`)

`ARGV`

Command-line arguments are available to nawk (the new version of awk) with the built-in array called ARGV. These arguments include the command nawk, but not any of the options passed to nawk. The index of the ARGV array starts at zero. (This works only for nawk.)

`ARGC`

ARGC is a built-in variable that contains the number of command-line arguments.

Example 6.143.


(The Script)

# Scriptname: argvs

BEGIN{

    for ( i=0; i < ARGC; i++ ){

        printf("argv[%d] is %s\n", i, ARGV[i])

        }

    printf("The number of arguments, ARGC=%d\n", ARGC)

}



(The Output)

% nawk –f argvs datafile

argv[0] is nawk

argv[1] is datafile

The number of arguments, ARGC=2

EXPLANATION

In the for loop, i is set to zero, i is tested to see if it is less than the number of command-line arguments (ARGC), and the printf function displays each argument encountered, in turn. When all of the arguments have been processed, the last printf statement outputs the number of arguments, ARGC. The example demonstrates that nawk does not count command-line options as arguments.

Example 6.144.


(The Command Line)

% nawk –f argvs datafile "Peter Pan" 12

argv[0] is nawk

argv[1] is datafile

argv[2] is Peter Pan

argv[3] is 12

The number of arguments, ARGC=4

EXPLANATION

As in the last example, each of the arguments is printed. The nawk command is considered the first argument, whereas the –f option and script name, argvs, are excluded.

Example 6.145.


(The Datafile)

% cat datafile5

Tom Jones:123:03/14/56

Peter Pan:456:06/22/58

Joe Blow:145:12/12/78

Santa Ana:234:02/03/66

Ariel Jones:987:11/12/66



(The Script)

% cat arging.sc

# Scriptname: arging.sc

1   BEGIN{FS=":"; name=ARGV[2]

2   print "ARGV[2] is "ARGV[2]

    }

    $1  ~ name { print $0 }



(The Command Line)

% nawk –f arging.sc datafile5 "Peter Pan"

ARGV[2] is Peter Pan

Peter Pan:456:06/22/58

nawk: can't open Peter Pan

input record number 5, file Peter Pan

source line number 2

EXPLANATION

In the BEGIN block, the variable name is assigned the value of ARGV[2], Peter Pan.
Peter Pan is printed, but then nawk tries to open Peter Pan as an input file after it has processed and closed the datafile. Nawk treats arguments as input files.

Example 6.146.


(The Script)

% cat arging2.sc

BEGIN{FS=":"; name=ARGV[2]

   print "ARGV[2] is " ARGV[2]

   delete ARGV[2]

}

$1  ~ name { print $0 }



(The Command Line)

% nawk –f arging2.sc datafile "Peter Pan"

ARGV[2] is Peter Pan

Peter Pan:456:06/22/58

EXPLANATION

Nawk treats the elements of the ARGV array as input files; after an argument is used, it is shifted to the left and the next one is processed, until the ARGV array is empty. If the argument is deleted immediately after it is used, it will not be processed as the next input file.

< Day Day Up >

6.20. Arrays

6.20.1 Subscripts for Associative Arrays

Using Variables As Array Indexes

Example 6.135.

The Special-for Loop

Example 6.136.

Using Strings As Array Subscripts

Example 6.137.

Figure 6.1. Using strings as subscripts in an array (Example 6.137).

Using Field Values As Array Subscripts

Example 6.138.

Example 6.139.

Arrays and the split Function

Example 6.140.

The delete Function

Example 6.141.

Multidimensional Arrays (nawk)

Example 6.142.

6.20.2 Processing Command Arguments (nawk)

ARGV

ARGC

Example 6.143.

Example 6.144.

Example 6.145.

Example 6.146.

The Special-`for` Loop

Arrays and the `split` Function

The `delete` Function

Multidimensional Arrays (`nawk`)

6.20.2 Processing Command Arguments (`nawk`)

`ARGV`

`ARGC`