Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [181]

By Root 841 0
in SuffixFiles)

{

while ((getline line < file) > 0)

{

sub(" *#.*$", "", line) # strip comments

sub("^[ \t]+", "", line) # strip leading whitespace

sub("[ \t]+$", "", line) # strip trailing whitespace

if (line = = "")

continue

n = split(line, parts)

Suffixes[parts[1]]++

Replacement[parts[1]] = parts[2]

for (k = 3; k <= n; k++)

Replacement[parts[1]] = Replacement[parts[1]] " " \

parts[k]

}

close(file)

}

}

else # load default table of English suffix regexps

{

split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)

for (k in parts)

{

Suffixes[parts[k]] = 1

Replacement[parts[k]] = ""

}

}

}

order_suffixes( )

Suffix replacement needs to be handled carefully: in particular, it should be done with a longest-match-first algorithm. order_suffixes( ) takes the list of suffix rules saved in the global Suffixes array, and copies it into the OrderedSuffix array, indexing that array by an integer that runs from one to NOrderedSuffix.

order_suffixes( ) then uses a simple bubble sort to reorder the entries in OrderedSuffix by decreasing pattern length, using the swap( ) function in the innermost loop. swap( ) is simple: it exchanges elements i and j of its argument array. The complexity of this sorting technique is proportional to the square of the number of elements to be sorted, but NOrderedSuffix is not expected to be large, so this sort is unlikely to contribute significantly to the program's runtime:

function order_suffixes( i, j, key)

{

# Order suffixes by decreasing length

NOrderedSuffix = 0

for (key in Suffixes)

OrderedSuffix[++NOrderedSuffix] = key

for (i = 1; i < NOrderedSuffix; i++)

for (j = i + 1; j <= NOrderedSuffix; j++)

if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))

swap(OrderedSuffix, i, j)

}

function swap(a, i, j, temp)

{

temp = a[i]

a[i] = a[j]

a[j] = temp

}

spell_check_line( )

We have now described all of the initialization code required for the program setup. The second pattern/action pair at the start of the program calls spell_check_line( ) for each line from the input stream.

The first task is to reduce the line to a list of words. The built-in function gsub( ) does the job for us by removing nonalphanumeric characters in just one line of code. The resulting words are then available as $1, $2, ..., $NF, so it just takes a simple for loop to iterate over them, handing them off to spell_check_word( ) for individual treatment.

As a general awk programming convention, we avoid reference to anonymous numeric field names, like $1, in function bodies, preferring to restrict their use to short action-code blocks. We made an exception in this function: $k is the only such anonymous reference in the entire program. To avoid unnecessary record reassembly when it is modified, we copy it into a local variable and then strip outer apostrophes and send any nonempty result off to spell_check_word( ) for further processing:

function spell_check_line( k, word)

{

gsub(NonWordChars, " ") # eliminate nonword chars

for (k = 1; k <= NF; k++)

{

word = $k

sub("^'+", "", word) # strip leading apostrophes

sub("'+$", "", word) # strip trailing apostrophes

if (word != "")

spell_check_word(word)

}

}

It is not particularly nice to have character-specific special handling once a word has been recognized. However, the apostrophe is an overloaded character that serves both to indicate contractions in some languages, as well as provide outer quoting. Eliminating its quoting use reduces the number of false reports in the final spelling-exception list.

Apostrophe stripping poses a minor problem for Dutch, which uses it in the initial position in a small number of words: `n for een, `s for des, and `t for het. Those cases are trivially handled by augmenting the exception dictionary.

spell_check_word( )

spell_check_word() is where the real work happens, but in most cases, the job is done quickly. If the lowercase word is found in the global Dictionary array, it is spelled correctly, and we can immediately return.

If the word is not in the word list,

Return Main Page Previous Page Next Page

®Online Book Reader