Classic Shell Scripting - Arnold Robbins [43]
rm quotas.sorted sales.sorted
The first step is to remove the comment lines with sed, and then to sort each file. The sorted temporary files become the input to the join command, and finally the script removes the temporary files. Here is what happens when it's run:
$ ./merge-sales.sh
chris 95 300
herman 80 150
jane 75 200
joe 50 100
Rearranging Fields with awk
awk is a useful programming language in its own right. In fact, we devote Chapter 9 to covering the most important parts of the language. Although you can do quite a lot with awk, it was purposely designed to be useful in shell scripting—for doing simple text manipulation, such as field extraction and rearrangement. In this section, we examine the basics of awk so that you can understand such "one-liners" when you see them.
Patterns and actions
awk's basic paradigm is different from many programming languages. It is similar in many ways to sed:
awk 'program' [ file ... ]
awk reads records (lines) one at a time from each file named on the command line (or standard input if none). For each line, it applies the commands as specified by the program to the line. The basic structure of an awk program is:
pattern { action }
pattern { action }
...
The pattern part can be almost any expression, but in one-liners, it's typically an ERE enclosed in slashes. The action can be any awk statement, but in one-liners, it's typically a plain print statement. (Examples are coming up.)
Either the pattern or the action may be omitted (but, of course, not both). A missing pattern executes the action for every input record. A missing action is equivalent to { print }, which (as we shall see shortly) prints the entire record. Most one-liners are of the form:
... | awk '{ print some-stuff }' | ...
For each record, awk tests each pattern in the program. If the pattern is true (e.g., the record matches the regular expression, or the general expression evaluates to true), then awk executes the code in the action.
Fields
awk has fields and records as a central part of its design. awk reads input records (usually just lines) and automatically splits each record into fields. It sets the built-in variable NF to the number of fields in each record.
By default, whitespace separates fields—i.e., runs of spaces and/or tab characters (like join). This is usually what you want, but you have other options. By setting the variable FS to a different value, you can change how awk separates fields. If you use a single character, then each occurrence of that character separates fields (like cut -d). Or, and here is where awk stands out, you can set it to a full ERE, in which case each occurrence of text that matches that ERE acts as a field separator.
Field values are designated as such with the $ character. Usually $ is followed by a numeric constant. However, it can be followed by an expression; most typically the name of a variable. Here are some examples:
awk '{ print $1 }' Print first field (no pattern)
awk '{ print $2, $5 }' Print second and fifth fields (no pattern)
awk '{ print $1, $NF }' Print first and last fields (no pattern)
awk 'NF > 0 { print $0 }' Print nonempty lines (pattern and action)
awk 'NF > 0' Same (no action, default is to print)
A special case is field number zero, which represents the whole record.
Setting the field separators
For simple programs, you can change the field separator with the -F option. For example, to print the username and full name from the /etc/passwd file:
$ awk -F: '{ print $1, $5 }' /etc/passwd
Process /etc/passwd
root root Administrative accounts
...
tolstoy Leo Tolstoy Real users
austen Jane Austen
camus Albert Camus
...
The -F option sets the FS variable automatically. Note how the program does not have to reference FS directly, nor does it have to manage reading records and splitting them into fields; awk does it all automatically.
You may have noticed that each field in the output is separated with a space, even though the input field separator is a colon. Unlike