< Day Day Up > |
6.20. ArraysArrays in awk are called associative arrays because the subscripts can be either numbers or strings. The subscript is often called the key and is associated with the value assigned to the corresponding array element. The keys and values are stored internally in a table where a hashing algorithm is applied to the value of the key in question. Due to the techniques used for hashing, the array elements are not stored in a sequential order, and when the contents of the array are displayed, they may not be in the order you expected. An array, like a variable, is created by using it, and awk can infer whether it is used to store numbers or strings. Array elements are initialized with numeric value zero and string value null, depending on the context. You do not have to declare the size of an awk array. Awk arrays are used to collect information from records and may be used for accumulating totals, counting words, tracking the number of times a pattern occurred, and so forth. 6.20.1 Subscripts for Associative ArraysUsing Variables As Array Indexes[See Example 6.135 for a demonstration. Example 6.135.(The Input File) % cat employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 Sally Chang 1654 7/22/54 650000 Billy Black 1683 9/23/44 336500 (The Command Line) 1 % nawk '{name[x++]=$2};END{for(i=0; i<NR; i++)\ print i, name[i]}' employees 0 Jones 1 Adams 2 Chang 3 Black 2 % nawk '{id[NR]=$3};END{for(x = 1; x <= NR; x++)\ print id[x]}' employees 4424 5346 1654 1683 EXPLANATION
The Special-for LoopThe special-for loop is used to read through an associative array in cases where the for loop is not practical; that is, when strings are used as subscripts or the subscripts are not consecutive numbers. The special-for loop uses the subscript as a key into the value associated with it. FORMAT {for(item in arrayname){ print arrayname[item] } } Example 6.136.(The Input File) % cat db 1 Tom Jones 2 Mary Adams 3 Sally Chang 4 Billy Black 5 Tom Savage 6 Tom Chung 7 Reggie Steel 8 Tommy Tucker (The Command Line, for Loop) 1 % nawk '/^Tom/{name[NR]=$1};\ END{for( i = 1; i <= NR; i++ )print name[i]}' db Tom Tom Tom Tommy (The Command Line, Special-for Loop) 2 % nawk '/^Tom/{name[NR]=$1};\ END{for(i in name){print name[i]}}' db Tom Tommy Tom Tom EXPLANATION
Using Strings As Array SubscriptsA subscript may consist of a variable containing a string or literal string. If the string is a literal, it must be enclosed in double quotes. Example 6.137.(The Input File) % cat datafile3 tom mary sean tom mary mary bob mary alex (The Script) # awk.sc script 1 /tom/ { count["tom"]++ } 2 /mary/ { count["mary"]++ } 3 END{print "There are " count["tom"] " Toms in the file and " count["mary"]" Marys in the file."} (The Command Line) % nawk -f awk.sc datafile3 There are 2 Toms in the file and 4 Marys in the file. EXPLANATION
Figure 6.1. Using strings as subscripts in an array (Example 6.137).Using Field Values As Array SubscriptsAny expression can be used as a subscript in an array. Therefore, fields can be used. The program in Example 6.138 counts the frequency of all names appearing in the second field and introduces a new form of the for loop: for( index_value in array ) statement The for loop found in the END block of the previous example works as follows: The variable name is set to the index value of the count array. After each iteration of the for loop, the print action is performed, first printing the value of the index, and then the value stored in that element. (The order of the printout is not guaranteed.) Example 6.138.(The Input File) % cat datafile4 4234 Tom 43 4567 Arch 45 2008 Eliza 65 4571 Tom 22 3298 Eliza 21 4622 Tom 53 2345 Mary 24 (The Command Line) % nawk '{count[$2]++}END{for(name in count)print name,count[name] }' datafile4 Tom 3 Arch 1 Eliza 2 Mary 1 EXPLANATION The awk statement first will use the second field as an index in the count array. The index varies as the second field varies, thus the first index in the count array is Tom and the value stored in count["Tom"] is 1. Next, count["Arch"] is set to 1, count["Eliza"] to 1, and count["Mary"] to 1. When awk finds the next occurrence of Tom in the second field, count["Tom"] is incremented, now containing the value 2. The same thing happens for each occurrence of Arch, Eliza, and Mary. Example 6.139.(The Input File) % cat datafile4 4234 Tom 43 4567 Arch 45 2008 Eliza 65 4571 Tom 22 3298 Eliza 21 4622 Tom 53 2345 Mary 24 (The Command Line) % nawk '{dup[$2]++; if (dup[$2] > 1){name[$2]++ }}\ END{print "The duplicates were"\ for (i in name){print i, name[i]}}' datafile4 (The Output) Tom 2 Eliza 2 EXPLANATION The subscript for the dup array is the value in the second field, that is, the name of a person. The value stored there is initially zero, and it is incremented by one each time a new record is processed. If the name is a duplicate, the value stored for that subscript will go up to two, and so forth. If the value in the dup array is greater than one, a new array called name also uses the second field as a subscript and keeps track of the number of names greater than one. Arrays and the split FunctionAwk's built-in split function allows you to split a string into words and store them in an array. You can define the field separator or use the value currently stored in FS. FORMAT split(string, array, field separator) split (string, array) Example 6.140.(The Command Line) % nawk BEGIN{ split( "3/15/2004", date, "/");\ print "The month is " date[1] "and the year is "date[3]"} filename (The Output) The month is 3 and the year is 2004. EXPLANATION The string 3/15/2004 is stored in the array date, using the forward slash as the field separator. Now date[1] contains 3, date[2] contains 15, and date[3] contains 2004. The field separator is specified in the third argument; if not specified, the value of FS is used as the separator. The delete FunctionThe delete function removes an array element. Example 6.141.
% nawk '{line[x++]=$2}END{for(x in line) delete(line[x])}' filename
EXPLANATION The value assigned to the array line is the value of the second field. After all the records have been processed, the special-for loop will go through each element of the array, and the delete function will in turn remove each element. Multidimensional Arrays (nawk)Although awk does not officially support multidimensional arrays, a syntax is provided that gives the appearance of a multidimensional array. This is done by concatenating the indexes into a string separated by the value of a special built-in variable, SUBSEP. The SUBSEP variable contains the value "\034", an unprintable character that is so unusual that it is unlikely to be found as an index character. The expression matrix[2,8] is really the array matrix[2 SUBSEP 8], which evaluates to matrix["2\0348"]. The index becomes a unique string for an associative array. Example 6.142.(The Input File) 1 2 3 4 5 2 3 4 5 6 6 7 8 9 10 (The Script) 1 {nf=NF 2 for(x = 1; x <= NF; x++ ){ 3 matrix[NR, x] = $x } } 4 END { for (x=1; x <= NR; x++ ){ for (y = 1; y <= nf; y++ ) printf "%d ", matrix[x,y] printf"\n" } } (The Output) 1 2 3 4 5 2 3 4 5 6 6 7 8 9 10 EXPLANATION
6.20.2 Processing Command Arguments (nawk)ARGVCommand-line arguments are available to nawk (the new version of awk) with the built-in array called ARGV. These arguments include the command nawk, but not any of the options passed to nawk. The index of the ARGV array starts at zero. (This works only for nawk.) ARGCARGC is a built-in variable that contains the number of command-line arguments. Example 6.143.(The Script) # Scriptname: argvs BEGIN{ for ( i=0; i < ARGC; i++ ){ printf("argv[%d] is %s\n", i, ARGV[i]) } printf("The number of arguments, ARGC=%d\n", ARGC) } (The Output) % nawk –f argvs datafile argv[0] is nawk argv[1] is datafile The number of arguments, ARGC=2 EXPLANATION In the for loop, i is set to zero, i is tested to see if it is less than the number of command-line arguments (ARGC), and the printf function displays each argument encountered, in turn. When all of the arguments have been processed, the last printf statement outputs the number of arguments, ARGC. The example demonstrates that nawk does not count command-line options as arguments. Example 6.144.(The Command Line) % nawk –f argvs datafile "Peter Pan" 12 argv[0] is nawk argv[1] is datafile argv[2] is Peter Pan argv[3] is 12 The number of arguments, ARGC=4 EXPLANATION As in the last example, each of the arguments is printed. The nawk command is considered the first argument, whereas the –f option and script name, argvs, are excluded. Example 6.145.(The Datafile) % cat datafile5 Tom Jones:123:03/14/56 Peter Pan:456:06/22/58 Joe Blow:145:12/12/78 Santa Ana:234:02/03/66 Ariel Jones:987:11/12/66 (The Script) % cat arging.sc # Scriptname: arging.sc 1 BEGIN{FS=":"; name=ARGV[2] 2 print "ARGV[2] is "ARGV[2] } $1 ~ name { print $0 } (The Command Line) % nawk –f arging.sc datafile5 "Peter Pan" ARGV[2] is Peter Pan Peter Pan:456:06/22/58 nawk: can't open Peter Pan input record number 5, file Peter Pan source line number 2 EXPLANATION
Example 6.146.(The Script) % cat arging2.sc BEGIN{FS=":"; name=ARGV[2] print "ARGV[2] is " ARGV[2] delete ARGV[2] } $1 ~ name { print $0 } (The Command Line) % nawk –f arging2.sc datafile "Peter Pan" ARGV[2] is Peter Pan Peter Pan:456:06/22/58 EXPLANATION Nawk treats the elements of the ARGV array as input files; after an argument is used, it is shifted to the left and the next one is processed, until the ARGV array is empty. If the argument is deleted immediately after it is used, it will not be processed as the next input file. |
< Day Day Up > |