Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [174]

By Root 935 0
170 lines of code, including comments! A program in C that solved the same problem would take at least an order of magnitude more code, and most likely considerably longer to write, test, and debug. Furthermore, our solution, by generating commands that are executed separately, provides extra safety, since there is the opportunity for human inspection before making the commitment of changing file ownership. We think it nicely demonstrates the power of the Unix toolset and the Software Tools approach to problem solving.

Chapter 12. Spellchecking

This chapter uses the task of spellchecking to demonstrate several different dimensions of shell scripting. After introducing the spell program, we show how a simple but useful spellchecker can be constructed almost entirely out of stock Unix tools. We then proceed to show how simple shell scripts can be used to modify the output of two freely available spellchecking programs to produce results similar to those of the traditional Unix spell program. Finally, we present a powerful spellchecker written in awk, which nicely demonstrates the elegance of that language.

The spell Program

The spell program does what you think it does: it checks a file for spelling errors. It reads through all the files named on the command line, producing, on standard output, a sorted list of words that are not in its dictionary or that cannot be derived from such words by the application of standard English grammatical rules (e.g., "words" from "word"). Interestingly enough, POSIX does not standardize spell. The Rationale document has this to say:

This utility is not useful from shell scripts or typical application programs. The spell utility was considered, but was omitted because there is no known technology that can be used to make it recognize general language for user-specified input without providing a complete dictionary along with the input file.

We disagree with the first part of this statement. Consider a script for automated bug or trouble reporting: one might well want to have something along these lines:

#! /bin/sh -

# probreport --- simple problem reporting program

file=/tmp/report.$$

echo "Type in the problem, finish with Control-D."

cat > $file

while true

do

printf "[E]dit, Spell [C]heck, [S]end, or [A]bort: "

read choice

case $choice in

[Ee]*) ${EDITOR:-vi} $file

;;

[Cc]*) spell $file

;;

[Aa]*) exit 0

;;

[Ss]*) break # from loop

;;

esac

done

... Send report

In this chapter, we examine spellchecking from several different angles, since it's an interesting problem, and it gives us an opportunity to solve the problem in several different ways.

The Original Unix Spellchecking Prototype

Spellchecking has been the subject of more than 300 research papers and books.[1] In his book Programming Pearls,[2] Jon Bentley reported: Steve Johnson wrote the first version of spell in an afternoon in 1975. Bentley then sketched a reconstruction credited to Kernighan and Plauger[3] of that program as a Unix pipeline that we can rephrase in modern terms like this:

prepare

filename | Remove formatting commands

tr A-Z a-z | Map uppercase to lowercase

tr -c a-z '\n' | Remove punctuation

sort | Put words in alphabetical order

uniq | Remove duplicate words

comm -13 dictionary - Report words not in dictionary

Here, prepare is a filter that strips whatever document markup is present; in the simplest case, it is just cat. We assume the argument syntax for the GNU version of the tr command.

The only program in this pipeline that we have not seen before is comm: it compares two sorted files and selects, or rejects, lines common to both. Here, with the -13 option, it outputs only lines from the second file (the piped input) that are not in the first file (the dictionary). That output is the spelling-exception report.

* * *

comm


Usage

comm [ options ... ] file1 file2

Purpose

To indicate which lines in the two input files are unique or common.

Major options

-1

Do not print column one (lines unique to file1).

-2

Do not

Return Main Page Previous Page Next Page

®Online Book Reader