Classic Shell Scripting - Arnold Robbins [179]
}
get_dictionaries( )
get_dictionaries() fills in a list of default system dictionaries: we supply two convenient ones. The user can override that choice by providing a list of dictionaries as the value of the command-line variable Dictionaries, or the environment variable DICTIONARIES.
If Dictionaries is empty, we consult the environment array, ENVIRON, and use any value set there. If Dictionaries is still empty, we supply a built-in list. The selection of that list requires some care because there is considerable variation across Unix platforms and because, for small files, most of the runtime of this program is consumed by loading dictionaries. Otherwise, Dictionaries contains a whitespace-separated list of dictionary filenames, which we split and store in the global DictionaryFiles array. We chose the word list used by spell on some of our systems (about 25,000 entries), and a larger list prepared by Donald Knuth (about 110,000 words).[6]
Notice how the dictionary names are stored: they are array indices, rather than array values. There are two reasons for this design choice. First, it automatically handles the case of a dictionary that is supplied more than once: only one instance of the filename is saved. Second, it then makes it easy to iterate over the dictionary list with a for ( key in array ) loop. There is no need to maintain a variable with the count of the number of dictionaries.
Here is the code:
function get_dictionaries( files, key)
{
if ((Dictionaries = = "") && ("DICTIONARIES" in ENVIRON))
Dictionaries = ENVIRON["DICTIONARIES"]
if (Dictionaries = = "") # Use default dictionary list
{
DictionaryFiles["/usr/dict/words"]++
DictionaryFiles["/usr/local/share/dict/words.knuth"]++
}
else # Use system dictionaries from command line
{
split(Dictionaries, files)
for (key in files)
DictionaryFiles[files[key]]++
}
}
scan_options( )
scan_options( ) handles the command line. It expects to find options (-strip and/or -verbose), user dictionaries (indicated with a leading +, a Unix spell tradition), suffix-rule files (marked with a leading =), and files to be spellchecked. Any -v option to set the Dictionaries variable has already been handled by awk, and is not in the argument array, ARGV.
The last statement in scan_options( ) requires explanation. During testing, we found that nawk does not read standard input if empty arguments are left at the end of ARGV, whereas gawk and mawk do. We therefore reduce ARGC until we have a nonempty argument at the end of ARGV:
function scan_options( k)
{
for (k = 1; k < ARGC; k++)
{
if (ARGV[k] = = "-strip")
{
ARGV[k] = ""
Strip = 1
}
else if (ARGV[k] = = "-verbose")
{
ARGV[k] = ""
Verbose = 1
}
else if (ARGV[k] ~ /^=/) # suffix file
{
NSuffixFiles++
SuffixFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
else if (ARGV[k] ~ /^[+]/) # private dictionary
{
DictionaryFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
}
# Remove trailing empty arguments (for nawk)
while ((ARGC > 0) && (ARGV[ARGC-1] = = ""))
ARGC--
}
load_dictionaries( )
load_dictionaries() reads the word lists from all of the dictionaries. Notice how simple the code is: an outer loop over the DictionaryFiles array, and an inner loop that uses getline to read a line at a time. Each line contains exactly one word known to be spelled correctly. The dictionaries are created once, and then used repeatedly, so we assume that lines are free of whitespace, and we make no attempt to remove it. Each word is converted to lowercase and stored as an index of the global Dictionary array. No separate count of the number of entries in this array is needed because the array is used elsewhere only in membership tests. Among all of the data structures provided by various programming languages, associative arrays are the fastest and most concise way to handle such tests:
function load_dictionaries( file, word)
{
for (file in DictionaryFiles)
{
while ((getline word < file) > 0)
Dictionary[tolower(word)]++
close(file)
}
}
load_suffixes( )
In many languages,