Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [64]

By Root 853 0
factor is small: for n about 1 million, it is about 20. Thus, in practice, we expect wf to be a few times slower than it would take to just copy its input stream with cat.

Here is an example of applying this script to the text of Shakespeare's most popular play, Hamlet,[9] reformatting the output with pr to a four-column display:

$ wf 12 < hamlet | pr -c4 -t -w80

1148 the 671 of 550 a 451 in

970 and 635 i 514 my 419 it

771 to 554 you 494 hamlet 407 that

The results are about as expected for English prose. More interesting, perhaps, is to ask how many unique words there are in the play:

$ wf 999999 < hamlet | wc -l

4548

and to look at some of the least-frequent words:

$ wf 999999 < hamlet | tail -n 12 | pr -c4 -t -w80

1 yaw 1 yesterday 1 yielding 1 younger

1 yawn 1 yesternight 1 yon 1 yourselves

1 yeoman 1 yesty 1 yond 1 zone

There is nothing magic about the argument 999999: it just needs to be a number larger than any expected count of unique words, and the keyboard repeat feature makes it easy to type.

We can also ask how many of the 4548 unique words were used just once:

$ wf 999999 < hamlet | grep -c '^ *1·'

2634

The · following the digit 1 in the grep pattern represents a tab. This result is surprising, and probably atypical of most modern English prose: although the play's vocabulary is large, nearly 58 percent of the words occur only once. And yet, the core vocabulary of frequently occurring words is rather small:

$ wf 999999 < hamlet | awk '$1 >= 5' | wc -l

740

This is about the number of words that a student might be expected to learn in a semester course on a foreign language, or that a child learns before entering school.

Shakespeare didn't have computers to help analyze his writing,[10] but we can speculate that part of his genius was in making most of what he wrote understandable to the broadest possible audience of his time.

When we applied wf to the individual texts of Shakespeare's plays, we found that Hamlet has the largest vocabulary (4548), whereas Comedy of Errors has the smallest (2443). The total number of unique words in the Shakespeare corpus of plays and sonnets is nearly 23,700, which shows that you need exposure to several plays to enjoy the richness of his work. About 36 percent of those words are used only once, and only one word begins with x: Xanthippe, in Taming of the Shrew. Clearly, there is plenty of fodder in Shakespeare for word-puzzle enthusiasts and vocabulary analysts!

* * *

[7] Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471-483, June (1986), and Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594-599, July (1987). Knuth's paper is also reprinted in his book Literate Programming, Stanford University Center for the Study of Language and Information, 1992, ISBN 0-937073-80-6 (paper) and 0-937073-81-4 (cloth).

[8] Programming Pearls: Associative Arrays, Comm. ACM 28(6), 570-576, June, (1985). This is an excellent introduction to the power of associative arrays (tables indexed by strings, rather than integers), a common feature of most scripting languages.

[9] Available in the wonderful Project Gutenberg archives at http://www.gutenberg.net/.

[10] Indeed, the only word related to the root of "computer" that Shakespeare used is "computation," just once in each of two plays, Comedy of Errors and King Richard III. "Arithmetic" occurs six times in his plays, "calculate" twice, and "mathematics" thrice.

Tag Lists

Use of the tr command to obtain lists of words, or more generally, to transform one set of characters to another set, as in Example 5-5 in the preceding section, is a handy Unix tool idiom to remember. It leads naturally to a solution of a problem that we had in writing this book: how do we ensure consistent markup through about 50K lines of manuscript files? For example, a command might be marked up with tr when we talk about it in the running text, but elsewhere, we might give an example of something that you type, indicated by the markup

Return Main Page Previous Page Next Page

®Online Book Reader