Classic Shell Scripting - Arnold Robbins [140]
String Searching
index( string, find ) searches the text in string for the string find. It returns the starting position of find in string, or 0 if find is not found in string. For example, index("abcdef", "de") returns 4.
Subject to the caveats noted in Section 9.9.2, you can make string searches ignore lettercase like this: index(tolower( string ), tolower( find )). Because case insensitivity is sometimes needed in an entire program, gawk provides a useful extension: set the built-in variable IGNORECASE to nonzero to ignore lettercase in string matches, searches, and comparisons.
index( ) finds the first occurrence of a substring, but sometimes, you want to find the last occurrence. There is no standard function to do that, but we can easily write one, shown in Example 9-5.
Example 9-5. Reverse string search
function rindex(string, find, k, ns, nf)
{
# Return index of last occurrence of find in string,
# or 0 if not found
ns = length(string)
nf = length(find)
for (k = ns + 1 - nf; k >= 1; k--)
if (substr(string, k, nf) = = find)
return k
return 0
}
The loop starts at a k value that lines up the ends of the strings string and find, extracts a substring from string that is the same length as find, and compares that substring with find. If they match, then k is the desired index of the last occurrence, and the function returns that value. Otherwise, we back up one character, terminating the loop when k moves past the beginning of string. When that happens, find is known not to be found in string, and we return an index of 0.
String Matching
match( string, regexp ) matches string against the regular expression regexp, and returns the index in string of the match, or 0 if there is no match. This provides more information than the expression ( string ~ regexp ), which evaluates to either 1 or 0. In addition, match() has a useful side effect: it sets the global variables RSTART to the index in string of the start of the match, and RLENGTH to the length of the match. The matching substring is then available as substr( string , RSTART, RLENGTH).
String Substitution
awk provides two functions for string substitution: sub( regexp, replacement, target ) and gsub( regexp, replacement, target ). sub( ) matches target against the regular expression regexp, and replaces the leftmost longest match by the string replacement. gsub( ) works similarly, but replaces all matches (the prefix g stands for global). Both functions return the number of substitutions. If the third argument is omitted, it defaults to the current record, $0. These functions are unusual in that they modify their scalar arguments: consequently, they cannot be written in the awk language itself. For example, a check-writing application might use gsub(/[^$-0-9.,]/, "*", amount) to replace with asterisks all characters other than those that can legally appear in the amount.
In a call to sub( regexp, replacement, target ) or gsub( regexp, replacement, target ), each instance of the character & in replacement is replaced in target by the text matched by regexp. Use \& to disable this feature, and remember to double the backslash if you use it in a quoted string. For example, gsub(/[aeiouyAEIOUY]/, "&&") doubles all vowels in the current record, $0, whereas gsub(/[aeiouyAEIOUY]/, "\&\&") replaces each vowel by a pair of ampersands.
gawk provides a more powerful generalized-substitution function, gensub( ); see the gawk(1) manual pages for details.
Substitution is often a better choice for data reduction than indexing and substring operations. Consider the problem of extracting the string value from an assignment in a file with text like this:
composer = "P. D. Q. Bach"
With substitution, we can use:
value = $0
sub(/^ *[a-z]+ *= *"/, "", value)
sub(/" *$/, "", value)
whereas with indexing using code like this:
start = index($0, "\"") + 1
end = start - 1 + index(substr($0,