Beautiful Code [85]
A Regular Expression Matcher > Alternatives
1.4. Alternatives
This is a very elegant and well-written piece of code, but it's not perfect. What might we do differently? I might rearrange matchhere to deal with $ before *. Although it makes no difference here, it feels a bit more natural, and a good rule is to do easy cases before difficult ones.
In general, however, the order of tests is critical. For instance, in this test from matchstar:
} while (*text != '\0' && (*text++ == c || c == '.'));
we must advance over one more character of the text string no matter what, so the increment in text++ must always be performed.
This code is careful about termination conditions. Generally, the success of a match is determined by whether the regular expression runs out at the same time as the text does. If they do run out together, that indicates a match; if one runs out before the other, there is no match. This is perhaps most obvious in a line like:
if (regexp[0] == '$' && regexp[1] == '\0')
return *text == '\0';
but subtle termination conditions show up in other cases as well.
The version of matchstar that implements leftmost longest matching begins by identifying a maximal sequence of occurrences of the input character c. Then it uses matchhere to try to extend the match to the rest of the pattern and the rest of the text. Each failure reduces the number of cs by one and tries again, including the case of zero occurrences:
/* matchstar: leftmost longest search for c*regexp */
int matchstar(int c, char *regexp, char *text)
{
char *t;
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)
;
do { /* * matches zero or more */
if (matchhere(regexp, t))
return 1;
} while (t-- > text);
return 0;
}
Consider the regular expression (.*), which matches arbitrary text within parentheses. Given the target text:
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)
a longest match from the beginning will identify the entire parenthesized expression, while a shortest match will stop at the first right parenthesis. (Of course a longest match beginning from the second left parenthesis will extend to the end of the text.)
A Regular Expression Matcher > Building on It
1.5. Building on It
The purpose of The Practice of Programming was to teach good programming. At the time the book was written, Rob and I were still at Bell Labs, so we did not have firsthand experience of how the book would be best used in a classroom. It has been gratifying to discover that some of the material does work well in classes. I have used this code since 2000 as a vehicle for teaching important points about programming.
First, it shows how recursion is useful and leads to clean code in a novel setting; it's not yet another version of Quicksort (or factorial!), nor is it some kind of tree walk.
It's also a good example for performance experiments. Its performance is not very different from the system versions of grep, which shows that the recursive technique is not too costly and that it's not worth trying to tune the code.
On the other hand, it is also a fine illustration of the importance of a good algorithm. If a pattern includes several .* sequences, the straightforward implementation requires a lot of backtracking, and, in some cases, will run very slowly indeed.
The standard Unix grep has the same backtracking properties. For example, the command:
grep 'a.*a.*a.*a.a'
takes about 20 seconds to process a 4 MB text file on a typical machine.
An implementation based on converting a nondeterministic finite automaton to a deterministic automaton, as in egrep, will have much better performance on hard cases; it can process the same pattern and the same input in less than one-tenth of a second, and running time in general is independent of the pattern.
Extensions to the regular expression class can form the basis of a variety of assignments. For example:
Add