Classic Shell Scripting - Arnold Robbins [62]
[5] Available at http://www.math.utah.edu/pub/sgml/.
Cheating at Word Puzzles
Crossword puzzles give you clues about words, but most of us get stuck when we cannot think of, say, a ten-letter word that begins with a b and has either an x or a z in the seventh position.
Regular-expression pattern matching with awk or grep is clearly called for, but what files do we search? One good choice is the Unix spelling dictionary, available as /usr/dict/words, on many systems. (Other popular locations for this file are /usr/share/dict/words and /usr/share/lib/dict/words.) This is a simple text file, with one word per line, sorted in lexicographic order. We can easily create other similar-appearing files from any collection of text files, like this:
cat file(s) | tr A-Z a-z | tr -c a-z\' '\n' | sort -u
The second pipeline stage converts uppercase to lowercase, the third replaces nonletters by newlines, and the last sorts the result, keeping only unique lines. The third stage treats apostrophes as letters, since they are used in contractions. Every Unix system has collections of text that can be mined in this way—for example, the formatted manual pages in /usr/man/cat*/* and /usr/local/man/cat*/*. On one of our systems, they supplied more than 1 million lines of prose and produced a list of about 44,000 unique words. There are also word lists for dozens of languages in various Internet archives.[6]
Let us assume that we have built up a collection of word lists in this way, and we stored them in a standard place that we can reference from a script. We can then write the program shown in Example 5-4.
Example 5-4. Word puzzle solution helper
#! /bin/sh
# Match an egrep(1)-like pattern against a collection of
# word lists.
#
# Usage:
# puzzle-help egrep-pattern [word-list-files]
FILES="
/usr/dict/words
/usr/share/dict/words
/usr/share/lib/dict/words
/usr/local/share/dict/words.biology
/usr/local/share/dict/words.chemistry
/usr/local/share/dict/words.general
/usr/local/share/dict/words.knuth
/usr/local/share/dict/words.latin
/usr/local/share/dict/words.manpages
/usr/local/share/dict/words.mathematics
/usr/local/share/dict/words.physics
/usr/local/share/dict/words.roget
/usr/local/share/dict/words.sciences
/usr/local/share/dict/words.unix
/usr/local/share/dict/words.webster
"
pattern="$1"
egrep -h -i "$pattern" $FILES 2> /dev/null | sort -u -f
The FILES variable holds the built-in list of word-list files, customized to the local site. The grep option -h suppresses filenames from the report, the -i option ignores lettercase, and we discard the standard error output with 2> /dev/null, in case any of the word-list files don't exist or they lack the necessary read permission. (This kind of redirection is described in Section 7.3.2.) The final sort stage reduces the report to just a list of unique words, ignoring lettercase.
Now we can find the word that we were looking for:
$ puzzle-help '^b.....[xz]...$' | fmt
bamboozled Bamboozler bamboozles bdDenizens bdWheezing Belshazzar
botanizing Brontozoum Bucholzite bulldozing
Can you think of an English word with six consonants in a row? Here's some help:
$ puzzle-help '[^aeiouy]{6}' /usr/dict/words
Knightsbridge
mightn't
oughtn't
If you don't count y as a vowel, many more turn up: encryption, klystron, porphyry, syzygy, and so on.
We could readily exclude the contractions from the word lists by a final filter step—egrep -i '^[a-z]+$'—but there is little harm in leaving them in the word lists.
* * *
[6] Available at ftp://ftp.ox.ac.uk/pub/wordlists/, ftp://qiclab.scn.rain.com/pub/wordlists/, ftp://ibiblio.org/pub/docs/books/gutenberg/etext96/pgw*, and http://www.phreak.org/html/wordlists.shtml. A search for "word list" in any Internet search engine turns up many more.
Word Lists
From 1983 to 1987, Bell Labs researcher