Professional C__ - Marc Gregoire [255]
Wildcards
The wildcard character . can be used to match any character except a newline character. For example, the regular expression a.c will match abc, and a5c, but will not match ab5c, ac and so on.
Repetition
Parts of a regular expression can be repeated by using one of four repeats:
* matches the preceding part zero or more times. For example: a*b will match b, ab, aab, aaaab, and so on.
+ matches the preceding part one or more times. For example: a+b will match ab, aab, aaaab, and so on, but not b.
? matches the preceding part zero or one time. For example: a?b will match b and ab, but nothing else.
{...} represents a bounded repeat. a{n} will match a repeated exactly n times; a{n,} will match a repeated n times or more, and a{n,m} will match a repeated between n and m times inclusive. For example, ^a{3,4}$ will match aaa and aaaa but not a, aa, aaaaa, and so on.
The repeats described in the previous list are called greedy because they will find the longest match. To make them non-greedy, a ? can be added behind the repeat as in *?, +?, ?? and {...}?. The following table gives an example. The first column is the string on which the regular expression will be applied. The second column represents the matches found by the regular expression a+ and the third column shows the matches found by the non-greedy a+?.
SOURCE STRING A+ A+?
"" no match no match
a matches a matches a
aa matches aa matches a
aaa matches aaa matches a
aaaa matches aaaa matches a
Alternation
The | character can be used to specify the “or” relationship. For example, a|b will match a or b.
Grouping
Parentheses () are used to mark sub-expressions, also called capture groups. Capture groups can be used for several purposes:
Capture groups can be used to identify individual sub-sequences of the original string; each marked sub-expression (capture group) will be returned in the result. For example, take the following regular expression: (.*)(ab|cd)(.*). It has three marked sub-expressions. Running a regex_search() with this regular expression on 123cd4 will result in a match with four entries. The first entry is the entire match 123cd4 followed by three entries for the three marked sub-expressions. These three entries are 123, cd and 4. The details on how to use the regex_search() algorithm are shown in a later section.
Capture groups can be used during matching for a purpose called back references (explained later).
Capture groups can be used to identify components during a replace operations (explained later).
Precedence
Just as with mathematical formulas it’s important to know the precedence of the regular expression elements. Precedence is as follows:
Elements: like a are the basic building blocks of a regular expression.
Quantifiers: like +, *, ? and {...} bind tightly to the element on the left, for example b+.
Concatenation: like ab+c binds after quantifiers.
Alternations: like | binds as last.
For example, take the regular expression ab+c|d. This will match abc, abbc, abbbc, and so on and also d. Parentheses can be used to change these precedence rules. For example ab+(c|d) will match abc, abbc, abbbc, ..., abd, abbd, abbbd, and so on. However, by using parentheses you also mark it as a sub-expression or capture group. It is possible to change the precedence rules without creating a new capture group by using (?:...). For example ab+(?:c|d) matches the same as the preceding ab+(c|d) but does not create an additional capture group.
Character Set Matches
Instead of having to write (a|b|c|...|z) which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets, and allows you to write [c1c2c3] which will match any of the characters c1, c2 or c3. For example, [abc] will match any character