Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [180]

By Root 955 0
words can be reduced to shorter root words by stripping suffixes. For example, in English, jumped, jumper, jumpers, jumpier, jumpiness, jumping, jumps, and jumpy all have the root word jump. Suffixes sometimes change the final letters of a word: try is the root of triable, trial, tried, and trying. Thus, the set of base words that we need to store in a dictionary is several times smaller than the set of words that includes suffixes. Since I/O is relatively slow compared to computation, we suspect that it may pay to handle suffixes in our program, to shorten dictionary size and reduce the number of false reports in the exception list.

load_suffixes( ) handles the loading of suffix rules. Unlike dictionary loading, here we have the possibility of supplying built-in rules, instead of reading them from a file. Thus, we keep a global count of the number of entries in the array that holds the suffix-rule filenames.

The suffix rules bear some explanation, and to illustrate them, we show a typical rule set for English in Example 12-3. We match suffixes with regular expressions, each of which ends with $ to anchor it to the end of a word. When a suffix is stripped, it may be necessary to supply a replacement suffix, as for the reduction tr+ied to tr+y. Furthermore, there are often several possible replacements.

Example 12-3. Suffix rules for English: english.sfx

'$ # Jones' -> Jones

's$ # it's -> it

ably$ able # affably -> affable

ed$ "" e # breaded -> bread, flamed -> flame

edly$ ed # ashamedly -> ashamed

es$ "" e # arches -> arch, blues -> blue

gged$ g # debugged -> debug

ied$ ie y # died -> die, cried -> cry

ies$ ie ies y # series -> series, ties -> tie, flies -> fly

ily$ y ily # tidily -> tidy, wily -> wily

ing$ # jumping -> jump

ingly$ "" ing # alarmingly -> alarming or alarm

lled$ l # annulled -> annul

ly$ "" # acutely -> acute

nnily$ n # funnily -> fun

pped$ p # handicapped -> handicap

pping$ p # dropping -> drop

rred$ r # deferred -> defer

s$ # cats -> cat

tted$ t # committed -> commit

The simplest specification of a suffix rule is therefore a regular expression to match the suffix, followed by a whitespace-separated list of replacements. Since one of the possible replacements may be an empty string, we represent it by "". It can be omitted if it is the only replacement. English is both highly irregular and rich in loan words from other languages, so there are many suffix rules, and certainly far more than we have listed in english.sfx. However, the suffix list only reduces the incidence of false reports because it effectively expands the dictionary size; it does not affect the correct operation of the program.

In order to make suffix-rule files maintainable by humans, it is essential that the rules can be augmented with comments to give examples of their application. We follow common Unix practice with comments that run from sharp (#) to end-of-line. load_suffixes( ) therefore strips comments and leading and trailing whitespace, and then discards empty lines. What remains is a regular expression and a list of zero or more replacements that are used elsewhere in calls to the awk built-in string substitution function, sub( ). The replacement list is stored as a space-separated string to which we can later apply the split( ) built-in function.

Suffix replacements can use & to represent matched text, although we have no examples of that feature in english.sfx.

We considered making load_suffixes( ) supply a missing $ anchor in the regular expression, but rejected that idea because it might limit the specification of suffix matching required for other languages. Suffix-rule files need to be prepared with considerable care anyway, and that job needs to be done only once for each language.

In the event that no suffix files are supplied, we load a default set of suffixes with empty replacement values. The split( ) built-in function helps to shorten the code for this initialization:

function load_suffixes( file, k, line, n, parts)

{

if (NSuffixFiles > 0) # load suffix regexps from files

{

for (file

Return Main Page Previous Page Next Page

®Online Book Reader