Classic Shell Scripting - Arnold Robbins [42]
Caveats
On POSIX systems, cut understands multibyte characters. Thus, "character" is not synonymous with "byte." See the manual pages for cut(1) for the details.
Some systems have limits on the size of an input line, particularly when multibyte characters are involved.
* * *
For example, the following command prints the login name and full name of each user on the system:
$ cut -d : -f 1,5 /etc/passwd
Extract fields
root:root Administrative accounts
...
tolstoy:Leo Tolstoy Real users
austen:Jane Austen
camus:Albert Camus
...
By choosing a different field number, we can extract each user's home directory:
$ cut -d : -f 6 /etc/passwd
Extract home directory
/root Administrative accounts
...
/home/tolstoy Real users
/home/austen
/home/camus
...
Cutting by character list can occasionally be useful. For example, to pull out just the permissions field from ls -l:
$ ls -l | cut -c 1-10
total 2878
-rw-r--r--
drwxr-xr-x
-r--r--r--
-rw-r--r--
...
However, this is riskier than using fields, since you're not guaranteed that each field in a line will always have the exact same width in every line. In general, we prefer field-based commands for extracting data.
Joining Fields with join
The join command lets you merge files, where the records in each file share a common key—that is, the field which is the primary one for the record. Keys are often things such as usernames, personal last names, employee ID numbers, and so on. For example, you might have two files, one which lists how many items a salesperson sold and one which lists the salesperson's quota:
* * *
join
Usage
join [ options ... ] file1 file2
Purpose
To merge records in sorted files based on a common key.
Major options
-1 field1
-2 field2
Specifies the fields on which to join. -1 field1 specifies field1 from file1, and -2 field2 specifies field2 from file2. Fields are numbered from one, not from zero.
-o file.field
Make the output consist of field field from file file. The common field is not printed unless requested explicitly. Use multiple -o options to print multiple output fields.
-t separator
Use separator as the input field separator instead of whitespace. This character becomes the output field separator as well.
Behavior
Read file1 and file2, merging records based on a common key. By default, runs of whitespace separate fields. The output consists of the common key, the rest of the record from file1, followed by the rest of the record from file2. If file1 is -, join reads standard input. The first field of each file is the default key upon which to join; this can be changed with -1 and -2. Lines without keys in both files are not printed by default. (Options exist to change this; see the manual pages for join(1).)
Caveats
The -1 and -2 options are relatively new. On older systems, you may need to use -j1 field1 and -j2 field2.
* * *
$ cat sales
Show sales file
# sales data Explanatory comments
# salesperson amount
joe 100
jane 200
herman 150
chris 300
$ cat quotas
Show quotas file
# quotas
# salesperson quota
joe 50
jane 75
herman 80
chris 95
Each record has two fields: the salesperson's name and the corresponding amount. In this instance, there are multiple spaces between the columns so that they line up nicely.
In order for join to work correctly, the input files must be sorted. The program in Example 3-2, merge-sales.sh, merges the two files using join.
Example 3-2. merge-sales.sh
#! /bin/sh
# merge-sales.sh
#
# Combine quota and sales data
# Remove comments and sort datafiles
sed '/^#/d' quotas | sort > quotas.sorted
sed '/^#/d' sales | sort > sales.sorted
# Combine on first key, results to standard output
join quotas.sorted sales.sorted
# Remove temporary files