Previous Section  < Day Day Up >  Next Section

6.25. Odds and Ends

Some data (e.g., that read in from tape or from a spreadsheet) may not have obvious field separators but may instead have fixed-width columns. To preprocess this type of data, the substr function is useful.

6.25.1 Fixed Fields

In the following example, the fields are of a fixed width, but are not separated by a field separator. The substr function is used to create fields.

Example 6.167.

% cat fixed

031291ax5633(408)987–0124

021589bg2435(415)866–1345

122490de1237(916)933–1234

010187ax3458(408)264–2546

092491bd9923(415)134–8900

112990bg4567(803)234–1456

070489qr3455(415)899–1426



% nawk '{printf substr($0,1,6)" ";printf substr($0,7,6)" ";\

  print substr($0,13,length)}' fixed

031291  ax5633  (408)987–0124

021589  bg2435  (415)866–1345

122490  de1237  (916)933–1234

010187  ax3458  (408)264–2546

092491  bd9923  (415)134–8900

112990  bg4567  (803)234–1456

070489  qr3455  (415)899–1426


EXPLANATION

The first field is obtained by getting the substring of the entire record, starting at the first character, offset by 6 places. Next, a space is printed. The second field is obtained by getting the substring of the record, starting at position 7, offset by 6 places, followed by a space. The last field is obtained by getting the substring of the entire record, starting at position 13 to the position represented by the length of the line. (The length function returns the length of the current line, $0, if it does not have an argument.)

Empty Fields

If the data is stored in fixed-width fields, it is possible that some of the fields are empty. In the following example, the substr function is used to preserve the fields, regardless of whether they contain data.

Example 6.168.

1   % cat db

    xxx xxx

    xxx abc xxx

    xxx a   bbb

    xxx     xx



    % cat awkfix

    # Preserving empty fields. Field width is fixed.

    {

2   f[1]=substr($0,1,3)

3   f[2]=substr($0,5,3)

4   f[3]=substr($0,9,3)

5   line=sprintf("%-4s%-4s%-4s\n", f[1],f[2], f[3])

6   print line

    }

    % nawk –f awkfix db

    xxx xxx

    xxx abc xxx

    xxx a   bbb

    xxx     xx


EXPLANATION

  1. The contents of the file db are printed. There are empty fields in the file.

  2. The first element of the f array is assigned the substring of the record, starting at position 1 and offset by 3.

  3. The second element of the f array is assigned the substring of the record, starting at position 5 and offset by 3.

  4. The third element of the f array is assigned the substring of the record, starting at position 9 and offset by 3.

  5. The elements of the array are assigned to the user-defined variable line after being formatted by the sprintf function.

  6. The value of line is printed and the empty fields are preserved.

Numbers with $, Commas, or Other Characters

In the following example, the price field contains a dollar sign and comma. The script must eliminate these characters to add up the prices to get the total cost. This is done using the gsub function.

Example 6.169.

% cat vendor

access tech:gp237221:220:vax789:20/20:11/01/90:$1,043.00

alisa systems:bp262292:280:macintosh:new updates:06/30/91:$456.00

alisa systems:gp262345:260:vax8700:alisa talk:02/03/91:$1,598.50

apple computer:zx342567:240:macs:e–mail:06/25/90:$575.75

caci:gp262313:280:sparc station:network11.5:05/12/91:$1,250.75

datalogics:bp132455:260:microvax2:pagestation maint:07/01/90:$1,200.00

dec:zx354612:220:microvax2:vms sms:07/20/90:$1,350.00



% nawk –F: '{gsub(/\$/,"");gsub(/,/,""); cost +=$7};\



END{print "The total is $" cost}' vendor

$7474


EXPLANATION

The first gsub function globally substitutes the literal dollar sign (\$) with the null string, and the second gsub function substitutes commas with a null string. The user-defined cost variable is then totaled by adding the seventh field to cost and assigning the result back to cost. In the END block, the string The total is $ is printed, followed by the value of cost.[a]

[a] For details on how commas are added back into the program, see Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language (Boston: Addison-Wesley, 1988), p. 72.

6.25.2 Multiline Records

In the sample data files used so far, each record is on a line by itself. In the following sample datafile, called checkbook, the records are separated by blank lines and the fields are separated by newlines. To process this file, the record separator (RS) is assigned a value of null, and the field separator (FS) is assigned the newline.

Example 6.170.

(The Input File)

% cat checkbook

1/1/04

#125

–695.00

Mortgage



1/1/04

#126

–56.89

PG&E

1/2/04

#127

–89.99

Safeway



1/3/04

+750.00

Paycheck



1/4/04

#128

–60.00

Visa



(The Script)

    % cat awkchecker

1   BEGIN{RS=""; FS="\n";ORS="\n\n"}

2   {print  NR, $1,$2,$3,$4}



(The Output)

% nawk –f awkchecker checkbook

1 1/1/04  #125  –695.00  Mortgage



2 1/1/04  #126  –56.89  PG&E



3 1/2/04  #127  –89.99  Safeway



4 1/3/04  +750.00  Paycheck



5 1/4/04  #128  –60.00  Visa


EXPLANATION

  1. In the BEGIN block, the record separator (RS) is assigned null, the field separator (FS) is assigned a newline, and the output record separator (ORS) is assigned two newlines. Now each line is a field and each output record is separated by two newlines.

  2. The number of the record is printed, followed by each of the fields.

6.25.3 Generating Form Letters

The following example is modified from a program in The AWK Programming Language.[4] The tricky part of this is keeping track of what is actually being processed. The input file is called data.form. It contains just the data. Each field in the input file is separated by colons. The other file is called form.letter. It is the actual form that will be used to create the letter. This file is loaded into awk's memory with the getline function. Each line of the form letter is stored in an array. The program gets its data from data.form, and the letter is created by substituting real data for the special strings preceded by # and @ found in form.letter. A temporary variable, temp, holds the actual line that will be displayed after the data has been substituted. This program allows you to create personalized form letters for each person listed in data.form.

[4] Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language (Boston: Addison-Wesley, 1988). © 1988 Bell Telephone Laboratories, Inc. Reprinted by permission of Pearson Education, Inc.

Example 6.171.

(The Awk Script)

% cat form.awk

# form.awk is an awk script that requires access to 2 files: The

# first file is called "form.letter." This file contains the

# format for a form letter. The awk script uses another file,

# "data.form," as its input file. This file contains the

# information that will be substituted into the form letters in

# the place of the numbers preceded by pound signs. Today's date

# is substituted in the place of "@date" in "form.letter."

1   BEGIN{ FS=":"; n=1

2   while(getline < "form.letter" >  0)

3        form[n++] = $0   # Store lines from form.letter in an array

4   "date" | getline d; split(d, today, " ")

         # Output of date is Fri Mar 2 14:35:50   PST 2004

5   thisday=today[2]". "today[3]", "today[6]

6   }

7   { for( i = 1; i < n; i++ ){

8       temp=form[i]

9       for ( j = 1; j <=NF; j++ ){

             gsub("@date", thisday, temp)

10           gsub("#" j, $j , temp )

        }

11  print temp

    }

    }

% cat form.letter

    The form letter, form.letter, looks like this:

*********************************************************

    Subject: Status Report for Project "#1"

    To: #2

    From: #3

    Date: @date

    This letter is to tell you, #2, that project "#1" is up to

    date.

    We expect that everything will be completed and ready for

    shipment as scheduled on #4.



    Sincerely,



    #3

**********************************************************



The file, data.form, is awk's input file containing the data that will replace the #1–4

 and the @date in form.letter.



% cat data.form

    Dynamo:John Stevens:Dana Smith, Mgr:4/12/2004

    Gallactius:Guy Sterling:Dana Smith, Mgr:5/18/2004



(The Command Line)



    % nawk  –f form.awk  data.form

    *********************************************************

    Subject: Status Report for Project "Dynamo"

    To: John Stevens

    From: Dana Smith, Mgr

    Date: Mar. 2, 2004

    This letter is to tell you, John Stevens, that project

    "Dynamo" is up to date.

    We expect that everything will be completed and ready for

    shipment as scheduled on 4/12/2001.

    Sincerely,



    Dana Smith, Mgr

    Subject: Status Report for Project "Gallactius"

    To: Guy Sterling

    From: Dana Smith, Mgr

    Date: Mar. 2, 2004

    This letter is to tell you, Guy Sterling, that project "Gallactius"

    is up to date.

    We expect that everything will be completed and ready for

    shipment as scheduled on 5/18/2004.



    Sincerely,



    Dana Smith, Mgr


EXPLANATION

  1. In the BEGIN block, the field separator (FS) is assigned a colon, and a user-defined variable n is assigned 1.

  2. In the while loop, the getline function reads a line at a time from the file called form.letter. If getline fails to find the file, it returns a –1. When it reaches the end of file, it returns 0. Therefore, by testing for a return value of greater than 1, we know that the function has read in a line from the input file.

  3. Each line from form.letter is assigned to an array called form.

  4. The output from the UNIX/Linux date command is piped to the getline function and assigned to the user-defined variable d. The split function then splits up the variable d with whitespace, creating an array called today.

  5. The user-defined variable thisday is assigned the month, day, and year.

  6. The BEGIN block ends.

  7. The for loop will loop n times.

  8. The user-defined variable temp is assigned a line from the form array.

  9. The nested for loop is looping through a line from the input file, data.form, NF number of times. Each line stored in the temp variable is checked for the string @date. If @date is matched, the gsub function replaces it with today's date (the value stored in thisday).

  10. If a # and a number are found in the line stored in temp, the gsub function will replace the # and number with the value of the corresponding field in the input file, data.form. For example, if the first line stored is being tested, #1 would be replaced with Dynamo, #2 with John Stevens, #3 with Dana Smith, #4 with 4/12/2004, and so forth.

  11. The line stored in temp is printed after the substitutions.

6.25.4 Interaction with the Shell

Now that you have seen how awk works, you will find that awk is a very powerful utility when writing shell scripts. You can embed one-line awk commands or awk scripts within your shell scripts. The following is a sample of a Korn shell program embedded with awk commands.

Example 6.172.

!#/bin/ksh

# This korn shell script will collect data for awk to use in

# generating form letter(s). See above.

print "Hello $LOGNAME. "

print "This report is for the month and year:"

1   cal | nawk 'NR==1{print $0}'



    if [[ –f data.form  || –f formletter? ]]

    then

        rm data.form formletter?  2> /dev/null

    fi

    integer num=1

    while true

    do

        print "Form letter #$num:"

        read project?"What is the name of the project? "

        read sender?"Who is the status report from? "

        read recipient?"Who is the status report to? "

        read due_date?"What is the completion date scheduled? "

        echo $project:$recipient:$sender:$due_date > data.form

        print –n "Do you wish to generate another form letter? "

        read answer

        if [[ "$answer" != [Yy]* ]]

        then

              break

        else

2             nawk –f form.awk  data.form  > formletter$num

        fi

        (( num+=1 ))

    done

    nawk –f form.awk data.form > formletter$num


EXPLANATION

  1. The UNIX cal command is piped to nawk. The first line that contains the current month and year is printed.

  2. The nawk script form.awk generates form letters, which are redirected to a UNIX file.

    Previous Section  < Day Day Up >  Next Section