Classic Shell Scripting - Arnold Robbins [184]
for (k = 1; k <= NF; k++)
{
word = $k
sub("^'+", "", word) # strip leading apostrophes
sub("'+$", "", word) # strip trailing apostrophes
if (word != "")
spell_check_word(word)
}
}
function spell_check_word(word, key, lc_word, location, w, wordlist)
{
lc_word = tolower(word)
if (lc_word in Dictionary) # acceptable spelling
return
else # possible exception
{
if (Strip)
{
strip_suffixes(lc_word, wordlist)
for (w in wordlist)
if (w in Dictionary)
return
}
location = Verbose ? (FILENAME ":" FNR ":") : ""
if (lc_word in Exception)
Exception[lc_word] = Exception[lc_word] "\n" location word
else
Exception[lc_word] = location word
}
}
function strip_suffixes(word, wordlist, ending, k, n, regexp)
{
split("", wordlist)
for (k = 1; k <= NOrderedSuffix; k++)
{
regexp = OrderedSuffix[k]
if (match(word, regexp))
{
word = substr(word, 1, RSTART - 1)
if (Replacement[regexp] = = "")
wordlist[word] = 1
else
{
split(Replacement[regexp], ending)
for (n in ending)
{
if (ending[n] = = "\"\"")
ending[n] = ""
wordlist[word ending[n]] = 1
}
}
break
}
}
}
function swap(a, i, j, temp)
{
temp = a[i]
a[i] = a[j]
a[j] = temp
}
Retrospective on Our Spellchecker
The first version of a Unix spellchecker was the pipeline that we presented at the beginning of the chapter. The first Unix spelling program in C that we could find in The Unix Heritage Society archives[7] is the 1975 Version 6 Unix typo command; it is about 350 lines of C code. spell first appeared in the 1979 Version 7 Unix release, and took about 700 lines of C code. It was accompanied by a 940-word common English dictionary, supplemented by another 320 words each of American and British spelling variations. spell was omitted from the 1995 4.4 BSD-Lite source code release, presumably because of trade secret or copyright issues.
The modern OpenBSD spell is about 1100 lines of C code, with about 30 more words in each of its three basic dictionaries.
GNU ispell version 3.2 is about 13,500 lines of C code, and GNU aspell version 0.60 is about 29,500 lines of C++ and C code. Both have been internationalized, with dictionaries for 10 to 40 languages. ispell has significantly enlarged English dictionaries, with about 80,000 common words, plus 3750 or so American and British variations. The aspell dictionaries are even bigger: 142,000 English words plus about 4200 variations for each of American, British, and Canadian.
Our spellchecker, spell.awk, is a truly remarkable program, and you will appreciate it even more and understand awk even better if you reimplement the program in another programming language. Like Johnson's original 1975 spell command, its design and implementation took less than an afternoon.
In about 190 lines of code, made up of three pattern/action one-liners and 11 functions, it does most of what traditional Unix spell does, and more:
With the -verbose option, it reports location information for the spelling exceptions.
User control of dictionaries allows it to be readily applied to complex technical documents, and to text written in languages other than English.
User-definable suffix lists assist in the internationalization of spelling checks, and provide user control over suffix reduction, something that few other spellcheckers on any platform provide.
All of the associated dictionary and suffix files are simple text files that can be processed with any text editor, and with most Unix text utilities. Some spellcheckers keep their dictionaries in binary form, making the word lists hard to inspect, maintain, and update, and nearly impossible to use for other purposes.
The major dependence on character sets is the assumption in the initialization of NonWordChars of ASCII ordering in the lower 128 slots. Although IBM mainframe EBCDIC is not supported, European 8-bit character sets pose no problem, and even the two-million-character Unicode set in the multibyte UTF-8 encoding can be handled reasonably, although proper recognition and removal of non-ASCII Unicode punctuation