Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [48]

By Root 894 0

daemon:x:2:2:daemon:/sbin:/sbin/nologin

chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash

groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh

gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93

Notice that the output is shorter: three users are in group 1000, but only one of them was output. We show another way to select unique records later in Section 4.2.

Sorting Text Blocks

Sometimes you need to sort data composed of multiline records. A good example is an address list, which is conveniently stored with one or more blank lines between addresses. For data like this, there is no constant sort-key position that could be used in a -k option, so you have to help out by supplying some extra markup. Here's a simple example:

$ cat my-friends

Show address file

# SORTKEY: Schloß, Hans Jürgen

Hans Jürgen Schloß

Unter den Linden 78

D-10117 Berlin

Germany

# SORTKEY: Jones, Adrian

Adrian Jones

371 Montgomery Park Road

Henley-on-Thames RG9 4AJ

UK

# SORTKEY: Brown, Kim

Kim Brown

1841 S Main Street

Westchester, NY 10502

USA

The sorting trick is to use the ability of awk to handle more-general record separators to recognize paragraph breaks, temporarily replace the line breaks inside each address with an otherwise unused character, such as an unprintable control character, and replace the paragraph break with a newline. sort then sees lines that look like this:

# SORTKEY: Schloß, Hans Jürgen^ZHans Jürgen Schloß^ZUnter den Linden 78^Z...

# SORTKEY: Jones, Adrian^ZAdrian Jones^Z371 Montgomery Park Road^Z...

# SORTKEY: Brown, Kim^ZKim Brown^Z1841 S Main Street^Z...

Here, ^Z is a Ctrl-Z character. A filter step downstream from sort restores the line breaks and paragraph breaks, and the sort key lines are easily removed, if desired, with grep. The entire pipeline looks like this:

cat my-friends | Pipe in address file

awk -v RS="" { gsub("\n", "^Z"); print }' | Convert addresses to single lines

sort -f | Sort address bundles, ignoring case

awk -v ORS="\n\n" '{ gsub("^Z", "\n"); print }' | Restore line structure

grep -v '# SORTKEY' Remove markup lines

The gsub( ) function performs "global substitutions." It is similar to the s/x/y/g construct in sed. The RS variable is the input Record Separator. Normally, input records are separated by newlines, making each line a separate record. Using RS="" is a special case, whereby records are separated by blank lines; i.e., each block or "paragraph" of text forms a separate record. This is exactly the form of our input data. Finally, ORS is the Output Record Separator; each output record printed with print is terminated with its value. Its default is also normally a single newline; setting it here to "\n\n" preserves the input format with blank lines separating records. (More detail on these constructs may be found in Chapter 9.)

The output of this pipeline on our address file is:

Kim Brown

1841 S Main Street

Westchester, NY 10502

USA

Adrian Jones

371 Montgomery Park Road

Henley-on-Thames RG9 4AJ

UK

Hans Jürgen Schloß

Unter den Linden 78

D-10117 Berlin

Germany

The beauty of this approach is that we can easily include additional keys in each address that can be used for both sorting and selection: for example, an extra markup line of the form:

# COUNTRY: UK

in each address, and an additional pipeline stage of grep '# COUNTRY: UK' just before the sort, would let us extract only the UK addresses for further processing.

You could, of course, go overboard and use XML markup to identify the parts of the address in excruciating detail:

Hans Jürgen

Schloß

Unter den Linden

78

D-10117

Berlin

Germany

With fancier data-processing filters, you could then please your post office by presorting your mail by country and postal code, but our minimal markup and simple pipeline are often good enough to get the job done.

Return Main Page Previous Page Next Page

®Online Book Reader