Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [175]

By Root 1027 0
print column two (lines unique to file2).

-3

Do not print column three (lines common to both files).

Behavior

Read the two files line by line. The input files must be sorted. Produce three columns of output: lines that are only in file1, lines that are only in file2, and lines that are in both files. Either filename can be -, in which case comm reads standard input.

Caveats

The options are not intuitive; it is hard to remember to add an option in order to remove an output column!

* * *

Bentley then goes on to discuss a spellchecker developed by Doug McIlroy at Bell Labs in 1981—its design and implementation; how it stores the dictionary in minimal memory; and why checking spelling is hard, especially for a language as muddled as English.

The modern spell is written in C for efficiency. However, the original pipeline was in use at Bell Labs for quite a while.

* * *

[1] See http://www.math.utah.edu/pub/tex/bib/index-table-s.html#spell for an extensive bibliography.

[2] Jon Louis Bentley, Programming Pearls, Addison-Wesley, 1986, ISBN 0-201-10331-1.

[3] Brian W. Kernighan and P. J. Plauger, Software Tools in Pascal, Addison-Wesley, 1981, ISBN 0-201-10342-7.

Improving ispell and aspell

Unix spell supports several options, most of which are not helpful for day-to-day use. One exception is the -b option, which causes spell to prefer British spelling: "centre" instead of "center," "colour" instead of "color," and so on.[4] See the manual page for the other options.

One nice feature is that you can provide your own local spelling list of valid words. For example, it often happens that there may be words from a particular discipline that are spelled correctly, but that are not in spell's dictionary (for example, "POSIX"). You can create, and over time maintain, your own list of valid but unusual words, and then use this list when running spell. You indicate the pathname to the local spelling list by supplying it before the file to be checked, and by preceding it with a + character:

spell +/usr/local/lib/local.words myfile > myfile.errs

Private Spelling Dictionaries

We feel that it is an important Best Practice to have a private spelling dictionary for every document that you write: a common one for many documents is not useful because the vocabulary becomes too big and errors are likely to be hidden: "syzygy" might be correct in a math paper, but in a novel, it perhaps ought to have been "soggy." We have found, based on a several-million-line corpus of technical text with associated spelling dictionaries, that there tends to be about one spelling exception every six lines. This tells us that spelling exceptions are common and are worth the trouble of managing along with the rest of a project.

There are some nuisances with spell: only one + option is permitted, and its dictionaries must be sorted in lexicographic order, which is poor design. It also means that most versions of spell break when the locale is changed. (While one might consider this to be bad design, it is really just an unanticipated consequence of the introduction of locales. The code for spell on these systems probably has not changed in more than 20 years, and when the underlying libraries were updated to do locale-based sorting, no one realized that this would be an effect.) Here is an example:

$ env LC_ALL=en_GB spell +ibmsysj.sok < ibmsysj.bib | wc -l

3674

$ env LC_ALL=en_US spell +ibmsysj.sok < ibmsysj.bib | wc -l

3685

$ env LC_ALL=C spell +ibmsysj.sok < ibmsysj.bib | wc -l

2163

However, if the sorting of the private dictionary matches that of the current locale, spell works properly:

$ env LC_ALL=en_GB sort ibmsysj.sok > /tmp/foo.en_GB

$ env LC_ALL=en_GB spell +/tmp/foo.en_GB < ibmsysj.bib | wc -l

2163

The problem is that the default locale can change from one release of an operating system to the next. Thus, it is best to set the LC_ALL environment variable to a consistent value for private dictionary sorting, and for running spell. We provide a workaround for spell's sorted dictionary

Return Main Page Previous Page Next Page

®Online Book Reader