Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [31]

By Root 1007 0
states that the NUL character (numeric value zero) need not be matchable. This character is used in the C language to indicate the end of a string, and the POSIX standard wanted to make it straightforward to implement its features using regular C strings. In addition, individual utilities may disallow matching of the newline character by the . (dot) metacharacter or by bracket expressions.

Backreferences

BREs provide a mechanism, known as backreferences, for saying "match whatever an earlier part of the regular expression matched." There are two steps to using backreferences. The first step is to enclose a subexpression in \( and \). There may be up to nine enclosed subexpressions within a single pattern, and they may be nested.

The next step is to use \ digit, where digit is a number between 1 and 9, in a later part of the same pattern. Its meaning there is "match whatever was matched by the nth earlier parenthesized subexpression." Here are some examples:

Pattern

Matches

\(ab\)\(cd\)[def]*\2\1

abcdcdab, abcdeeecdab, abcdddeeffcdab, ...

\(why\).*\1

A line with two occurrences of why

\([[:alpha:]_][[:alnum:]_]*\) = \1;

Simple C/C++ assignment statement

Backreferences are particularly useful for finding duplicated words and matching quotes:

\(["']\).*\1 Match single- or double-quoted words, like 'foo' or "bar"

This way, you don't have to worry about whether a single quote or double quote was found first.

Matching multiple characters with one expression

The simplest way to match multiple characters is to list them one after the other (concatenation). Thus, the regular expression ab matches the characters ab, .. (dot dot) matches any two characters, and [[:upper:]][[:lower:]] matches any uppercase character followed by any lowercase one. However, listing characters out this way is good only for short regular expressions.

Although the . (dot) metacharacter and bracket expressions provide a nice way to match one character at a time, the real power of regular expressions comes into play when using the additional modifier metacharacters. These metacharacters come after a single-character regular expression, and they modify the meaning of the regular expression.

The most commonly used modifier is the asterisk or star (*), whose meaning is "match zero or more of the preceding single character." Thus, ab*c means "match an a, zero or more b characters, and a c." This regular expression matches ac, abc, abbc, abbbc, and so on.

* * *

Tip


It is important to understand that "match zero or more of one thing" does not mean "match one of something else." Thus, given the regular expression ab*c, the text aQc does not match, even though there are zero b characters in aQc. Instead, with the text ac, the b* in ab*c is said to match the null string (the string of zero width) in between the a and the c. (The idea of a zero-width string takes some getting used to if you've never seen it before. Nevertheless, it does come in handy, as will be shown later in the chapter.)

* * *

The * modifier is useful, but it is unlimited. You can't use * to say "match three characters but not four," and it's tedious to have to type out a complicated bracket expression multiple times when you want an exact number of matches. Interval expressions solve this problem. Like *, they come after a single-character regular expression, and they let you control how many repetitions of that character will be matched. Interval expressions consist of one or two numbers enclosed between \{ and \}. There are three variants, as follows:

\{ n \}

Exactly n occurrences of the preceding regular expression

\{ n ,\}

At least n occurrences of the preceding regular expression

\{ n,m \}

Between n and m occurrences of the preceding regular expression

Given interval expressions, it becomes easy to express things like "exactly five occurrences of a," or "between 10 and 42 instances of q." To wit: a\{5\} and q\{10,42\}.

The values for n and m must be between 0 and RE_DUP_MAX, inclusive.

Return Main Page Previous Page Next Page

®Online Book Reader