Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [65]

By Root 973 0
tr. A third possibility is a manual-page reference in the form tr(1).

The taglist program in Example 5-6 provides a solution. It finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files. Additionally, it flags with an arrow cases where the same word is marked up in more than one way. Here is a fragment of its output from just the file for a version of this chapter:

$ taglist ch05.xml

...

2 cut command ch05.xml

1 cut emphasis ch05.xml <----

...

2 uniq command ch05.xml

1 uniq emphasis ch05.xml <----

1 vfstab filename ch05.xml

...

The tag listing task is reasonably complex, and would be quite hard to do in most conventional programming languages, even ones with large class libraries, such as C++ and Java, and even if you started with the Knuth or Hanson literate programs for the somewhat similar word-frequency problem. Yet, just nine steps in a Unix pipeline with by-now familiar tools suffice.

The word-frequency program did not deal with named files: it just assumed a single data stream. That is not a serious limitation because we can easily feed it multiple input files with cat. Here, however, we need a filename, since it does us no good to report a problem without telling where the problem is. The filename is taglist's single argument, available in the script as $1.

We feed the input file into the pipeline with cat. We could, of course, eliminate this step by redirecting the input of the next stage from $1, but we find in complex pipelines that it is clearer to separate data production from data processing. It also makes it slightly easier to insert yet another stage into the pipeline if the program later evolves.cat "$1" | ...

We apply sed to simplify the otherwise-complex markup needed for web URLs:... | sed -e 's#systemitem *role="url"#URL#g' \

-e 's#/systemitem#/URL#' | ...

This converts tags such as and into simpler and tags, respectively.

The next stage uses tr to replace spaces and paired delimiters by newlines:... | tr ' ( ){ }[ ]' '\n\n\n\n\n\n\n' | ...

At this point, the input consists of one "word" per line (or empty lines). Words are either actual text or SGML/XML tags. Using egrep, the next stage selects tag-enclosed words:... | egrep '>[^<>]+

This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag).

At this point, the input consists of lines with tags. The first awk stage uses angle brackets as field separators, so the input tr is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the -v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:... | awk -F'[<>]' -v FILE="$1" \

'{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ...

The sort stage sorts the lines into word order:... | sort | ...

The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:... | uniq -c | ...

A second sort orders the output by word and tag (the second and third fields):... | sort -k2,2 -k3,3 | ...

The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line. This arrow then clearly indicates instances where words have been marked up differently, and thus deserve closer inspection by the authors, the editors, or the book-production staff:... | awk '{

print ($2 = = Last) ? ($0 " <----") : $0

Last = $2

}'

The full program is provided in Example 5-6. Example 5-6. Making an SGML tag list

#! /bin/sh -

# Read an HTML/SGML/XML file given on the command

# line containing markup like word and output on

# standard output

Return Main Page Previous Page Next Page

®Online Book Reader