Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [178]

By Root 995 0
the first few paragraphs of the manual page for spell, a typical run might look like this:

$ awk -f spell.awk testfile

deroff

eqn

ier

nx

tbl

thier

or in verbose mode, like this:

$ awk -f spell.awk -- -verbose testfile

testfile:7:eqn

testfile:7:tbl

testfile:11:deroff

testfile:12:nx

testfile:19:ier

testfile:19:thier

Introductory Comments

The program begins with an extensive commentary, of which we show only the introduction and usage parts here:

# Implement a simple spellchecker, with user-specifiable exception

# lists. The built-in dictionary is constructed from a list of

# standard Unix spelling dictionaries, which can be overridden on the

# command line.

#

...

#

# Usage:

# awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \

# [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \

# [-strip] [-verbose] [file(s)]

Main Body

The main body of the program is just three lines, typical of many awk programs that initialize, process, and report:

BEGIN { initialize( ) }

{ spell_check_line( ) }

END { report_exceptions( ) }

All of the details are relegated to functions stored in alphabetical order in the remainder of the program file, but described in logical order in the following sections.

initialize( )

The initialize() function handles program startup tasks.

The variable NonWordChars holds a regular expression that is later used to eliminate unwanted characters. Along with the ASCII letters and apostrophe, characters in the range 161 to 255 are preserved as word characters so that files in ASCII, any of the ISO 8859-n character sets, and Unicode in UTF-8 encoding all can be handled without further concern for character sets.

Characters 128 to 160 are ignored because in all of those character sets, they serve as additional control characters and a nonbreaking space. Some of those character sets have a few nonalphabetic characters above 160, but it adds undesirable character-set dependence to deal with them. The nonalphabetic ones are rare enough that their worst effect on our program may be an occasional false report of a spelling exception.

We assume that files to be spellchecked have the same character-set encoding as their associated dictionaries. If that is not the case, then use iconv to convert them to a consistent encoding.

If all awk implementations were POSIX-conformant, we would set NonWordChars like this:

NonWordChars = "[^'[:alpha:]]"

The current locale would then determine exactly which characters could be ignored. However, that assignment is not portable because many awk implementations do not yet support POSIX-style regular expressions.

Before locales were introduced to Unix, we could have assigned NonWordChars the negation of the set of word characters:

NonWordChars = "[^'A-Za-z\241-\377]"

However, in the presence of locales, character ranges in regular expressions are interpreted in a locale-dependent fashion so that value would not give consistent results across platforms. The solution is to replace the ranges by explicit enumerations of characters, writing the assignment as a concatenation of strings, neatly aligned so that a human can readily identify the characters in the negated set. We use octal representation for values above 127, since that is clearer than a jumble of accented characters.

initialize( ) then identifies and loads dictionaries, and processes command-line arguments and suffix rules.

function initialize( )

{

NonWordChars = "[^" \

"'" \

"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \

"abcdefghijklmnopqrstuvwxyz" \

"\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \

"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \

"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \

"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \

"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \

"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \

"]"

get_dictionaries( )

scan_options( )

load_dictionaries( )

load_suffixes( )

order_suffixes(

Return Main Page Previous Page Next Page

®Online Book Reader