Beautiful Code [82]
A Regular Expression Matcher > The Practice of Programming
1. A Regular Expression Matcher
Brian Kernighan
Regular expressions are notations for describing patterns of text and, in effect, make up a special-purpose language for pattern matching. Although there are myriad variants, all share the idea that most characters in a pattern match literal occurrences of themselves, but some metacharacters have special meaning, such as * to indicate some kind of repetition or […] to mean any one character from the set within the brackets.
In practice, most searches in programs such as text editors are for literal words, so the regular expressions are often literal strings like print, which will match printf or sprint or printer paper anywhere. In so-called wildcards used to specify filenames in Unix and Windows, a * matches any number of characters, so the pattern *.c matches all filenames that end in .c. There are many, many variants of regular expressions, even in contexts where one would expect them to be the same. Jeffrey Friedl's Mastering Regular Expressions (O'Reilly) is an exhaustive study of the topic.
Stephen Kleene invented regular expressions in the mid-1950s as a notation for finite automata; in fact, they are equivalent to finite automata in what they represent. They first appeared in a program setting in Ken Thompson's version of the QED text editor in the mid-1960s. In 1967, Thompson applied for a patent on a mechanism for rapid text matching based on regular expressions. The patent was granted in 1971, one of the very first software patents [U.S. Patent 3,568,156, Text Matching Algorithm, March 2, 1971].
Regular expressions moved from QED to the Unix editor ed, and then to the quintessential Unix tool grep, which Thompson created by performing radical surgery on ed. These widely used programs helped regular expressions become familiar throughout the early Unix community.
Thompson's original matcher was very fast because it combined two independent ideas. One was to generate machine instructions on the fly during matching so that it ran at machine speed rather than by interpretation. The other was to carry forward all possible matches at each stage, so it did not have to backtrack to look for alternative potential matches. In later text editors that Thompson wrote, such as ed, the matching code used a simpler algorithm that backtracked when necessary. In theory, this is slower, but the patterns found in practice rarely involved backtracking, so the ed and grep algorithm and code were good enough for most purposes.
Subsequent regular expression matchers like egrep and fgrep added richer classes of regular expressions, and focused on fast execution no matter what the pattern. Ever-fancier regular expressions became popular and were included not only in C-based libraries, but also as part of the syntax of scripting languages such as Awk and Perl.
1.1. The Practice of Programming
In 1998, Rob Pike and I were writing The Practice of Programming (Addison-Wesley). The last chapter of the book, "Notation," collected a number of examples where good notation led to better programs and better programming. This included the use of simple data specifications (printf, for instance), and the generation of code from tables.
Because of our Unix backgrounds and nearly 30 years of experience with tools based on regular expression notation, we naturally wanted to include a discussion of regular expressions, and it seemed mandatory to include an implementation as well. Given our