Classic Shell Scripting - Arnold Robbins [182]
If suffix stripping is not requested, or if we did not find any replacement words in the dictionary, then the word is definitely a spelling exception. However, it is a bad idea to write a report at this point because we usually want to produce a sorted list of unique spelling exceptions. The word awk, for example, occurs more than 30 times in this chapter, but is not found in any of the standard Unix spelling dictionaries. Instead, we store the word in the global Exception array, and when verbose output is requested, we prefix the word with a location defined by a colon-terminated filename and line number. Reports of that form are common to many Unix tools and are readily understandable both to humans and smart text editors. Notice that the original lettercase is preserved in the report, even though it was ignored during the dictionary lookup:
function spell_check_word(word, key, lc_word, location, w, wordlist)
{
lc_word = tolower(word)
if (lc_word in Dictionary) # acceptable spelling
return
else # possible exception
{
if (Strip)
{
strip_suffixes(lc_word, wordlist)
for (w in wordlist)
if (w in Dictionary)
return
}
location = Verbose ? (FILENAME ":" FNR ":") : ""
if (lc_word in Exception)
Exception[lc_word] = Exception[lc_word] "\n" location word
else
Exception[lc_word] = location word
}
}
strip_suffixes( )
When a word has been found that is not in the dictionary, and the -strip option has been specified, we call strip_suffixes( ) to apply the suffix rules. It loops over the suffix regular expressions in order of decreasing suffix length. If the word matches, the suffix is removed to obtain the root word. If there are no replacement suffixes, the word is stored as an index of the wordlist array. Otherwise, we split the replacement list into its members and append each replacement in turn to the root word, adding it to the wordlist array. We need one special case in the inner loop, to check for the special two-character string "", which we replace with an empty string. If we have a match, the break statement leaves the loop, and the function returns to the caller. Otherwise, the loop continues with the next suffix regular expression.
We could have made this function do a dictionary lookup for each candidate that we store in wordlist, and return a match indication. We chose not to because it mixes lookup with suffix processing and makes it harder to extend the program to display replacement candidates (Unix spell has the -x option to do that: for every input word that can take suffixes, it produces a list of correctly spelled words with the same root).
While suffix rules suffice for many Indo-European languages, others do not need them at all, and still others have more complex changes in spelling as words change in case, number, or tense. For such languages, the simplest solution seems to be a larger dictionary that incorporates all of the common word forms.
Here is the code:
function strip_suffixes(word, wordlist, ending, k, n, regexp)
{
split("", wordlist)
for (k = 1; k <= NOrderedSuffix; k++)
{
regexp = OrderedSuffix[k]
if (match(word, regexp))
{
word = substr(word, 1, RSTART - 1)
if (Replacement[regexp] = = "")
wordlist[word] = 1
else
{
split(Replacement[regexp], ending)
for (n in ending)
{
if (ending[n] = = "\"\"")
ending[n] = ""
wordlist[word ending[n]] = 1
}
}
break
}
}
}
report_exceptions( )
The final job in our program is initiated by the last of the three pattern/action pairs. report_exceptions( ) sets up a pipeline to sort with command-line options that depend on whether the user requested a compact listing of unique exception words, or a verbose report with location information. In either case, we give sort the -f option