Classic Shell Scripting - Arnold Robbins [180]
load_suffixes( ) handles the loading of suffix rules. Unlike dictionary loading, here we have the possibility of supplying built-in rules, instead of reading them from a file. Thus, we keep a global count of the number of entries in the array that holds the suffix-rule filenames.
The suffix rules bear some explanation, and to illustrate them, we show a typical rule set for English in Example 12-3. We match suffixes with regular expressions, each of which ends with $ to anchor it to the end of a word. When a suffix is stripped, it may be necessary to supply a replacement suffix, as for the reduction tr+ied to tr+y. Furthermore, there are often several possible replacements.
Example 12-3. Suffix rules for English: english.sfx
'$ # Jones' -> Jones
's$ # it's -> it
ably$ able # affably -> affable
ed$ "" e # breaded -> bread, flamed -> flame
edly$ ed # ashamedly -> ashamed
es$ "" e # arches -> arch, blues -> blue
gged$ g # debugged -> debug
ied$ ie y # died -> die, cried -> cry
ies$ ie ies y # series -> series, ties -> tie, flies -> fly
ily$ y ily # tidily -> tidy, wily -> wily
ing$ # jumping -> jump
ingly$ "" ing # alarmingly -> alarming or alarm
lled$ l # annulled -> annul
ly$ "" # acutely -> acute
nnily$ n # funnily -> fun
pped$ p # handicapped -> handicap
pping$ p # dropping -> drop
rred$ r # deferred -> defer
s$ # cats -> cat
tted$ t # committed -> commit
The simplest specification of a suffix rule is therefore a regular expression to match the suffix, followed by a whitespace-separated list of replacements. Since one of the possible replacements may be an empty string, we represent it by "". It can be omitted if it is the only replacement. English is both highly irregular and rich in loan words from other languages, so there are many suffix rules, and certainly far more than we have listed in english.sfx. However, the suffix list only reduces the incidence of false reports because it effectively expands the dictionary size; it does not affect the correct operation of the program.
In order to make suffix-rule files maintainable by humans, it is essential that the rules can be augmented with comments to give examples of their application. We follow common Unix practice with comments that run from sharp (#) to end-of-line. load_suffixes( ) therefore strips comments and leading and trailing whitespace, and then discards empty lines. What remains is a regular expression and a list of zero or more replacements that are used elsewhere in calls to the awk built-in string substitution function, sub( ). The replacement list is stored as a space-separated string to which we can later apply the split( ) built-in function.
Suffix replacements can use & to represent matched text, although we have no examples of that feature in english.sfx.
We considered making load_suffixes( ) supply a missing $ anchor in the regular expression, but rejected that idea because it might limit the specification of suffix matching required for other languages. Suffix-rule files need to be prepared with considerable care anyway, and that job needs to be done only once for each language.
In the event that no suffix files are supplied, we load a default set of suffixes with empty replacement values. The split( ) built-in function helps to shorten the code for this initialization:
function load_suffixes( file, k, line, n, parts)
{
if (NSuffixFiles > 0) # load suffix regexps from files
{
for (file