Classic Shell Scripting - Arnold Robbins [174]
Chapter 12. Spellchecking
This chapter uses the task of spellchecking to demonstrate several different dimensions of shell scripting. After introducing the spell program, we show how a simple but useful spellchecker can be constructed almost entirely out of stock Unix tools. We then proceed to show how simple shell scripts can be used to modify the output of two freely available spellchecking programs to produce results similar to those of the traditional Unix spell program. Finally, we present a powerful spellchecker written in awk, which nicely demonstrates the elegance of that language.
The spell Program
The spell program does what you think it does: it checks a file for spelling errors. It reads through all the files named on the command line, producing, on standard output, a sorted list of words that are not in its dictionary or that cannot be derived from such words by the application of standard English grammatical rules (e.g., "words" from "word"). Interestingly enough, POSIX does not standardize spell. The Rationale document has this to say:
This utility is not useful from shell scripts or typical application programs. The spell utility was considered, but was omitted because there is no known technology that can be used to make it recognize general language for user-specified input without providing a complete dictionary along with the input file.
We disagree with the first part of this statement. Consider a script for automated bug or trouble reporting: one might well want to have something along these lines:
#! /bin/sh -
# probreport --- simple problem reporting program
file=/tmp/report.$$
echo "Type in the problem, finish with Control-D."
cat > $file
while true
do
printf "[E]dit, Spell [C]heck, [S]end, or [A]bort: "
read choice
case $choice in
[Ee]*) ${EDITOR:-vi} $file
;;
[Cc]*) spell $file
;;
[Aa]*) exit 0
;;
[Ss]*) break # from loop
;;
esac
done
... Send report
In this chapter, we examine spellchecking from several different angles, since it's an interesting problem, and it gives us an opportunity to solve the problem in several different ways.
The Original Unix Spellchecking Prototype
Spellchecking has been the subject of more than 300 research papers and books.[1] In his book Programming Pearls,[2] Jon Bentley reported: Steve Johnson wrote the first version of spell in an afternoon in 1975. Bentley then sketched a reconstruction credited to Kernighan and Plauger[3] of that program as a Unix pipeline that we can rephrase in modern terms like this:
prepare
filename | Remove formatting commands
tr A-Z a-z | Map uppercase to lowercase
tr -c a-z '\n' | Remove punctuation
sort | Put words in alphabetical order
uniq | Remove duplicate words
comm -13 dictionary - Report words not in dictionary
Here, prepare is a filter that strips whatever document markup is present; in the simplest case, it is just cat. We assume the argument syntax for the GNU version of the tr command.
The only program in this pipeline that we have not seen before is comm: it compares two sorted files and selects, or rejects, lines common to both. Here, with the -13 option, it outputs only lines from the second file (the piped input) that are not in the first file (the dictionary). That output is the spelling-exception report.
* * *
comm
Usage
comm [ options ... ] file1 file2
Purpose
To indicate which lines in the two input files are unique or common.
Major options
-1
Do not print column one (lines unique to file1).
-2
Do not