Classic Shell Scripting - Arnold Robbins [177]
We make a strong distinction between checking and correcting spelling. The latter requires knowledge of the format of the text, and invariably requires human confirmation, making it completely unsuited to batch processing. The automatic spelling correction offered by some web browsers and word processors is even worse because it is frequently wrong, and its second-guessing your typing quickly becomes extremely annoying.
The emacs text editor offers three good solutions to spelling assistance during text entry: dynamic word completion can be invoked on demand to expand a partial word, spelling verification of the current word can be requested by a single keystroke, and the flyspell library can be used to request unobtrusive colored highlighting of suspect words.
As long as you can recognize misspellings when they are pointed out to you, it is better to have a spellchecker that reports a list of suspect words, and that allows you to provide a private list of special words not normally present in its dictionary, to reduce the size of that report. You can then use the report to identify errors, repair them, regenerate the report (which should now contain only correct words), and then add its contents to your private dictionary. Because our writing deals with technical material, which is often full of unusual words, in practice we keep a private and document-specific supplemental dictionary for every document that we write.
To guide the programming, here are the desired design goals for our spellchecker. Following the practice of ISO standards, we use shall to indicate a requirement and should to mark a desire:
The program shall be able to read a text stream, isolate words, and report instances of words that are not in a list of known words, called the spelling dictionary.
There shall be a default word list, collected from one or more system dictionaries.
It shall be possible to replace the default word list.
It shall be possible to augment the standard word list with entries from one or more user-provided word lists. These lists are particularly necessary for technical documents, which contain acronyms, jargon, and proper nouns, most of which would not be found in the standard list.
Word lists shall not require sorting, unlike those for Unix spell, which behaves incorrectly when the locale is changed.
Although the default word lists are to be in English, with suitable alternate word lists, the program shall be capable of handling text in any language that can be represented by ASCII-based character sets encoded in streams of 8-bit bytes, and in which words are separated by whitespace. This eliminates the difficult case of languages, such as Lao and Thai, that lack interword spaces, and thus require extensive linguistic analysis to identify words.
Lettercase shall be ignored to keep the word-list sizes manageable, but exceptions shall be reported in their original lettercase.
Punctuation and digits shall be ignored, but the apostrophe shall be considered a letter.
The default report shall be a sorted list of unique words that are not found in the combined word lists, displayed one word per line. This is the spelling exception list.
There shall be an option to augment the exception-list report with location information, such as filename and line number, to facilitate finding and correcting misspelled words. The report shall be sorted by location and, when there are multiple exceptions at one location, sorted further by exception words.
User-specifiable suffix reduction should be supported to keep word-list sizes manageable.
In Example 12-4 near the end of this section, we present a complete program that meets all of these goals, and more. This program does quite a lot, so in the rest of this section, we describe it in detail as a semiliterate program with explanatory prose and code fragments.
With a test input file containing