Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [30]

By Root 809 0
anything that isn't a lowercase vowel, including the uppercase vowels, all consonants, digits, punctuation, and so on.

Matching lots of characters by listing them all gets tedious—for example, [0123456789] to match a digit or [0123456789abcdefABCDEF] to match a hexadecimal digit. For this reason, bracket expressions may include ranges of characters. The previous two expressions can be shortened to [0-9] and [0-9a-fA-F], respectively.

* * *

Warning


Originally, the range notation matched characters based on their numeric values in the machine's character set. Because of character set differences (ASCII versus EBCDIC), this notation was never 100 percent portable, although in practice it was "good enough," since almost all Unix systems used ASCII.

With POSIX locales, things have gotten worse. Ranges now work based on each character's defined position in the locale's collating sequence, which is unrelated to machine character-set numeric values. Therefore, the range notation is portable only for programs running in the "POSIX" locale. The POSIX character class notation, mentioned earlier in the chapter, provides a way to portably express concepts such as "all the digits," or "all alphabetic characters." Thus, ranges in bracket expressions are discouraged in new programs.

* * *

Earlier, in Section 3.2.1, we briefly mentioned POSIX collating symbols, equivalence classes, and character classes. These are the final components that may appear inside the square brackets of a bracket expression. The following paragraphs explain each of these constructs.

In several non-English languages, certain pairs of characters must be treated, for comparison purposes, as if they were a single character. Such pairs have a defined way of sorting when compared with single letters in the language. For example, in Czech and Spanish, the two characters ch are kept together and are treated as a single unit for comparison purposes.

Collating is the act of giving an ordering to some group or set of items. A POSIX collating element consists of the name of the element in the current locale, enclosed by [. and .]. For the ch just discussed, the locale might use [.ch.]. (We say "might" because each locale defines its own collating elements.) Assuming the existence of [.ch.], the regular expression [ab[.ch.]de] matches any of the characters a, b, d, or e, or the pair ch. It does not match a standalone c or h character.

An equivalence class is used to represent different characters that should be treated the same when matching. Equivalence classes enclose the name of the class between [= and =]. For example, in a French locale, there might be an [=e=] equivalence class. If it exists, then the regular expression [a[=e=]iouy] would match all the lowercase English vowels, as well as the letters è, é, and so on.

As the last special component, character classes represent classes of characters, such as digits, lower- and uppercase letters, punctuation, whitespace, and so on. They are written by enclosing the name of the class in [: and :]. The full list was shown earlier, in Table 3-3. The pre-POSIX range expressions for decimal and hexadecimal digits can (and should) be expressed portably, by using character classes: [[:digit:]] and [[:xdigit:]].

* * *

Tip


Collating elements, equivalence classes, and character classes are only recognized inside the square brackets of a bracket expression. Writing a standalone regular expression such as [:alpha:] matches the characters a, l, p, h, and :. The correct way to write it is [[:alpha:]].

* * *

Within bracket expressions, all other metacharacters lose their special meanings. Thus, [*\.] matches a literal asterisk, a literal backslash, or a literal period. To get a ] into the set, place it first in the list: [ ]*\.] adds the ] to the list. To get a minus character into the set, place it first in the list: [-*\.]. If you need both a right bracket and a minus, make the right bracket the first character, and make the minus the last one in the list: [ ]*\.-].

Finally, POSIX explicitly

Return Main Page Previous Page Next Page

®Online Book Reader