Classic Shell Scripting - Arnold Robbins [35]
Meaning
\w
Matches any word-constituent character. Equivalent to [[:alnum:]_].
\W
Matches any nonword-constituent character. Equivalent to [^[:alnum:]_].
\< \>
Matches the beginning and end of a word, as described previously.
\b
Matches the null string found at either the beginning or the end of a word. This is a generalization of the \< and \> operators.
Note: Because awk uses \b to represent the backspace character, GNU awk (gawk) uses \y.
\B
Matches the null string between two word-constituent characters.
\' \`
Matches the beginning and end of an emacs buffer, respectively. GNU programs (besides emacs) generally treat these as being equivalent to ^ and $.
Finally, although POSIX explicitly states that the NUL character need not be matchable, GNU programs have no such restriction. If a NUL character occurs in input data, it can be matched by the . metacharacter or a bracket expression.
Which Programs Use Which Regular Expressions?
It is a historical artifact that there are two different regular expression flavors. While the existence of egrep-style extended regular expressions was known during the early Unix development period, Ken Thompson didn't feel that it was necessary to implement such full-blown regular expressions for the ed editor. (Given the PDP-11's small address space, the complexity of extended regular expressions, and the fact that for most editing jobs basic regular expressions are enough, this decision made sense.)
The code for ed then served as the base for grep. (grep is an abbreviation for the ed command g/ re /p: globally match re and print it.) ed's code also served as an initial base for sed.
Somewhere in the pre-V7 timeframe, egrep was created by Al Aho, a Bell Labs researcher who did groundbreaking work in regular expression matching and language parsing. The core matching code from egrep was later reused for regular expressions in awk.
The \< and \> operators originated in a version of ed that was modified at the University of Waterloo by Rob Pike, Tom Duff, Hugh Redelmeier, and David Tilbrook. (Rob Pike was the one who invented those operators.) Bill Joy at UCB adopted it for the ex and vi editors, from whence it became widely used. Interval expressions originated in Programmer's Workbench Unix [6] and they filtered out into the commercial Unix world via System III, and later, System V. Table 3-8 lists the various Unix programs and which flavor of regular expression they use.
Table 3-8. Unix programs and their regular expression type
Type
grep
sed
ed
ex/vi
more
egrep
awk
lex
BRE
·
·
·
·
·
ERE
·
·
·
\< \>
·
·
·
·
·
lex is a specialized tool, generally used for the construction of lexical analyzers for language processors. Even though it's included in POSIX, we don't discuss it further, since it's not relevant for shell scripting. The less and pg pagers, while not part of POSIX, also support regular expressions. Some systems have a page program, which is essentially the same as more, but clears the screen between each screenful of output.
As we mentioned at the beginning of the chapter, to (attempt to) mitigate the multiple grep problem, POSIX mandates a single grep program. By default, POSIX grep uses BREs. With the -E option, it uses EREs, and with the -F option, it uses the fgrep fixed-string matching algorithm. Thus, truly POSIX-conforming programs use grep -E . . . instead of egrep . . . . However, since all Unix systems do have it, and are likely to for many years to come, we continue to use it in our scripts.
A final note is that traditionally, awk did not support interval expressions within its flavor of extended regular expressions. Even as of 2005, support for interval expressions is not universal among different vendor versions of awk. For maximal portability, if you need to match braces from an awk program, you should escape them with a backslash, or enclose them inside a bracket expression.
Making Substitutions in Text Files