Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [34]

By Root 835 0
hard to follow. This is illustrated in Figure 3-1.

Figure 3-1. Reading a complicated regular expression

The upshot is that this single regular expression matches multiple successive occurrences of either read or write, possibly separated by whitespace characters.

The use of a * after the [[:space:]] is something of a judgment call. By using a * and not a +, the match gets words at the end of a line (or string). However, this opens up the possibility of matching words with no intervening whitespace at all. Crafting regular expressions often requires such judgment calls. How you build your regular expressions will depend on both your input data and what you need to do with that data.

Finally, grouping is helpful when using alternation together with the ^ and $ anchor characters. Because | has the lowest precedence of all the operators, the regular expression ^abcd|efgh$ means "match abcd at the beginning of the string, or match efgh at the end of the string." This is different from ^(abcd|efgh)$, which means "match a string containing exactly abcd or exactly efgh."

Anchoring text matches

The ^ and $ have the same meaning as in BREs: anchor the regular expression to the beginning or end of the text string (or line). There is one significant difference, though. In EREs, ^ and $ are always metacharacters. Thus, regular expressions such as ab^cd and ef$gh are valid, but cannot match anything, since the text preceding the ^ and the text following the $ prevent them from matching "the beginning of the string" and "the end of the string," respectively. As with the other metacharacters, they do lose their special meaning inside bracket expressions.

ERE operator precedence

Operator precedence applies to EREs as it does to BREs. Table 3-6 provides the precedence for the ERE operators, from highest to lowest.

Table 3-6. ERE operator precedence from highest to lowest

Operator

Meaning

[. .] [= =] [: :]

Bracket symbols for character collation

\ metacharacter

Escaped metacharacters

[ ]

Bracket expressions

( )

Grouping

* + ? { }

Repetition of the preceding regular expression

no symbol

Concatenation

^ $

Anchors

|

Alternation

Regular Expression Extensions

Many programs provide extensions to regular expression syntax. Typically, such extensions take the form of a backslash followed by an additional character, to create new operators. This is similar to the use of a backslash in \(...\) and \{...\} in POSIX BREs.

The most common extensions are the operators \< and \>, which match the beginning and end of a "word," respectively. Words are made up of letters, digits, and underscores. We call such characters word-constituent.

The beginning of a word occurs at either the beginning of a line or the first word-constituent character following a nonword-constituent character. Similarly, the end of a word occurs at the end of a line, or after the last word-constituent character before a nonword-constituent one.

In practice, word matching is intuitive and straightforward. The regular expression \ matches the second string, but does not match the first. Note that \ does not match either string.

Although standardized by POSIX only for the ex editor, word matching is universally supported by the ed, ex, and vi editors that come standard with every commercial Unix system. Word matching is also supported on the "clone" versions of these programs that come with GNU/Linux and BSD systems, as well as in emacs, vim, and vile. Most GNU utilities support it as well. Additional Unix programs that support word matching often include grep and sed, but you should double-check the manpages for the commands on your system.

GNU versions of the standard utilities that deal with regular expressions typically support a number of additional operators. These operators are outlined in Table 3-7.

Table 3-7. Additional GNU regular expression operators

Operator

Return Main Page Previous Page Next Page

®Online Book Reader