Classic Shell Scripting - Arnold Robbins [41]
Text File Conventions
Because Unix encourages the use of textual data, it's common to store data in a text file, with each line representing a single record. There are two conventions for separating fields within a line from each other. The first is to just use whitespace (spaces or tabs):
$ cat myapp.data
# model units sold salesperson
xj11 23 jane
rj45 12 joe
cat6 65 chris
...
In this example, lines beginning with a # character represent comments, and are ignored. (This is a common convention. The ability to have comment lines is helpful, but it requires that your software be able to ignore such lines.) Each field is separated from the next by an arbitrary number of space or tab characters. The second convention is to use a particular delimiter character to separate fields, such as a colon:
$ cat myapp.data
# model:units sold:salesperson
xj11:23:jane
rj45:12:joe
cat6:65:chris
...
Each convention has advantages and disadvantages. When whitespace is the separator, it's difficult to have real whitespace within the fields' contents. (If you use a tab as the separator, you can use a space character within a field, but this is visually confusing, since you can't easily tell the difference just by looking at the file.) On the flip side, if you use an explicit delimiter character, it then becomes difficult to include that delimiter within your data. Often, though, it's possible to make a careful choice, so that the need to include the delimiter becomes minimal or nonexistent.
* * *
Tip
One important difference between the two approaches has to do with multiple occurrences of the delimiter character(s). When using whitespace, the convention is that multiple successive occurrences of spaces or tabs act as a single delimiter. However, when using a special character, each occurrence separates a field. Thus, for example, two colon characters in the second version of myapp.data (a "::") delimit an empty field.
* * *
The prime example of the delimiter-separated field approach is /etc/passwd. There is one line per user of the system, and the fields are colon-separated. We use /etc/passwd for many examples throughout the book, since a large number of system administration tasks involve it. Here is a typical entry:
tolstoy:x:2076:10:Leo Tolstoy:/home/tolstoy:/bin/bash
The seven fields of a password file entry are:
The username.
The encrypted password. (This can be an asterisk if the account is disabled, or possibly a different character if encrypted passwords are stored separately in /etc/shadow.)
The user ID number.
The group ID number.
The user's personal name and possibly other relevant data (office number, telephone number, and so on).
The home directory.
The login shell.
Some Unix tools work better with whitespace-delimited fields, others with delimiter-separated fields, and some utilities are equally adept at working with either kind of file, as we're about to see.
Selecting Fields with cut
The cut command was designed for cutting out data from text files. It can work on either a field basis or a character basis. The latter is useful for cutting out particular columns from a file. Beware, though: a tab character counts as a single character![8]
* * *
cut
Usage
cut -c list [ file ... ]
cut -f list [ -d delim ] [ file ... ]
Purpose
To select one or more fields or groups of characters from an input file, presumably for further processing within a pipeline.
Major options
-c list
Cut based on characters. list is a comma-separated list of character numbers or ranges, such as 1,3,5-12,42.
-d delim
Use delim as the delimiter with the -f option. The default delimiter is the tab character.
-f list
Cut based on fields. list is a comma-separated list of field numbers or ranges.
Behavior
Cut out the named fields