Classic Shell Scripting - Arnold Robbins [185]
Output sort order, which is a complex problem for some languages, is determined entirely by the sort command, which in turn is influenced by the locale set in the current environment. That way, a single tool localizes the sorting complexity so that other software, including our program, can remain oblivious to the difficulties. This is another example of the "Let someone else do the hard part" Software Tools principle discussed in Section 1.2.
Despite being written in an interpreted language, our program is reasonably fast. On a 2 GHz Pentium 4 workstation, with mawk, it took just one second to check spelling in all of the files for this book, just 1.3 times longer than OpenBSD spell, and 2.0 times longer than GNU ispell.
An execution profile (see Section 12.4.14) showed that loading the dictionaries took about 5 percent of the total time, and about one word in 15 was not found in the dictionary. Adding the -strip option increased the runtime by about 25 percent, and reduced the output size by the same amount. Only about one word in 70 made it past the match( ) test inside strip_suffixes( ).
Suffix support accounts for about 90 of the 190 lines of code, so we could have written a usable multilingual spellchecker in about 100 lines of awk.
Notably absent from this attribute list, and our program, is the stripping of document markup, a feature that some spellcheckers provide. We have intentionally not done so because it is in complete violation of the Unix tradition of one (small) tool for one job. Markup removal is useful in many other contexts, and therefore deserves to reside in separate filters, such as dehtml, deroff, desgml, detex, and dexml. Of these, only deroff is commonly found on most Unix systems, but workable implementations of the others require only a few lines of awk.
Also absent from our program, apart from three simple calls to substr( ), is handling of individual characters. The necessity for such processing in C, and many other languages, is a major source of bugs.
All that remains to be done for this program is accumulation of a suitable set of dictionaries and suffix lists for other languages, provision of a shell script wrapper to make its user interface more like conventional Unix programs, and writing a manual page. Although we do not present them here, you can find the wrapper and manual page with this book's sample programs.
Efficiency of awk Programs
We close this section with some observations about awk program efficiency. Like other scripting languages, awk programs are compiled into a compact internal representation, and that representation is then interpreted at runtime by a small virtual machine. Built-in functions are written in the underlying implementation language, currently C in all publicly available versions, and run at native software speeds.
Program efficiency is not just a question of computer time: human time matters as well. If it takes an hour to write a program in awk that runs for a few seconds, compared to several hours to write and debug the same program in a compiled language to shave a few seconds off the runtime, then human time is the only thing that matters. For many software tools, awk wins by a large measure.
With conventional compiled languages like Fortran and C, most inline code is closely related to the underlying machine language, and experienced programmers soon develop a feel for what is cheap and what is expensive. The number of arithmetic and memory operations, and the depth of loop nesting, are important, easily counted, and relate directly to runtimes. With numerical programs, a common rule of thumb is that 90 percent of the runtime is spent in 10 percent of the code: that 10 percent is called the hot spots. Optimizations like pulling common expressions out of innermost loops, and ordering computations to match storage layout,