Online Book Reader

Home Category

Beautiful Code [84]

By Root 5364 0
basic operation of match is straightforward. If the first character of the regular expression is ^ (an anchored match), any possible match must occur at the beginning of the string. That is, if the regular expression is ^xyz, it matches xyz only if xyz occurs at the beginning of the text, not somewhere in the middle. This is tested by matching the rest of the regular expression against the text starting at the beginning and nowhere else. Otherwise, the regular expression might match anywhere within the string. This is tested by matching the pattern against each character position of the text in turn. If there are multiple matches, only the first (leftmost) one will be identified. That is, if the regular expression is xyz, it will match the first occurrence of xyz regardless of where it occurs.

Notice that advancing over the input string is done with a do-while loop, a comparatively unusual construct in C programs. The occurrence of a do-while instead of a while should always raise a question: why isn't the loop termination condition being tested at the beginning of the loop, before it's too late, rather than at the end after something has been done? But the test is correct here: since the * operator permits zero-length matches, we first have to check whether a null match is possible.

The bulk of the work is done in the function matchhere(regexp, text), which tests whether the regular expression matches the text that begins right here. The function matchhere operates by attempting to match the first character of the regular expression with the first character of the text. If the match fails, there can be no match at this text position and matchhere returns 0. If the match succeeds, however, it's possible to advance to the next character of the regular expression and the next character of the text. This is done by calling matchhere recursively.

The situation is a bit more complicated because of some special cases, and of course the need to stop the recursion. The easiest case is that if the regular expression is at its end (regexp[0]=='\0'), all previous tests have succeeded, and thus the regular expression matches the text.

If the regular expression is a character followed by a *, matchstar is called to see whether the closure matches. The function matchstar(c, regexp, text) tries to match repetitions of the text character c, beginning with zero repetitions and counting up, until it either finds a match of the rest of the text, or it fails and thus concludes that there is no match. This algorithm identifies a "shortest match," which is fine for simple pattern matching as in grep, where all that matters is finding a match as quickly as possible. A "longest match" is more intuitive and almost certain to be better for a text editor where the matched text will be replaced. Most modern regular expression libraries provide both alternatives, and The Practice of Programming presents a simple variant of matchstar for this case, shown below.

If the regular expression consists of a $ at the end of the expression, the text matches only if it too is at its end:

if (regexp[0] == '$' && regexp[1] == '\0')

return *text == '\0';

Otherwise, if we are not at the end of the text string (that is, *text!='\0'), and if the first character of the text string matches the first character of the regular expression, so far so good; we go on to test whether the next character of the regular expression matches the next character of the text by making a recursive call to matchhere. This recursive call is the heart of the algorithm and the reason why the code is so compact and clean.

If all of these attempts to match fail, there can be no match at this point between the regular expression and the text, so matchhere returns 0.

This code uses C pointers intensively. At each stage of the recursion, if something matches, the recursive call that follows uses pointer arithmetic (e.g., regexp+1 and text+1) so that the subsequent function is called with the next character of the regular expression and of the text. The depth of recursion is no

Return Main Page Previous Page Next Page

®Online Book Reader