Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [124]

By Root 1004 0
is permitted in the language, any number of whitespace characters may be used, so blank lines and indentation can be used for improved readability. However, single statements usually cannot be split across multiple lines, unless the line breaks are immediately preceded with a backslash.

Strings and String Expressions

String constants in awk are delimited by quotation marks: "This is a string constant". Character strings may contain any 8-bit character except the control character NUL (character value 0), which serves as a string terminator in the underlying implementation language, C. The GNU implementation, gawk, removes that restriction, so gawk can safely process arbitrary binary files.

awk strings contain zero or more characters, and there is no limit, other than available memory, on the length of a string. Assignment of a string expression to a variable automatically creates a string, and the memory occupied by any previous string value of the variable is automatically reclaimed.

Backslash escape sequences allow representation of unprintable characters, just like those for the echo command shown in Section 2.5.3. "A\tZ" contains the characters A, tab, and Z, and "\001" and "\x01" each contain just the character Ctrl-A.

Hexadecimal escape sequences are not supported by echo, but were added to awk implementations after they were introduced in the 1989 ISO C Standard. Unlike octal escape sequences, which use at most three digits, the hexadecimal escape consumes all following hexadecimal digits. gawk and nawk follow the C Standard, but mawk does not: it collects at most two hexadecimal digits, reducing "\x404142" to "@4142" instead of to the 8-bit value 0x42 = 66, which is the position of "B" in the ASCII character set. POSIX awk does not support hexadecimal escapes at all.

awk provides several convenient built-in functions for operating on strings; we treat them in detail in Section 9.9. For now, we mention only the string-length function: length( string ) returns the number of characters in string.

Strings are compared with the conventional relational operators: = = (equality), != (inequality), < (less than), <= (less than or equal to), > (greater than), and >= (greater than or equal to). Comparison returns 0 for false and 1 for true. When strings of different lengths are compared and one string is an initial substring of the other, the shorter is defined to be less than the longer: thus, "A" < "AA" evaluates to true.

Unlike most programming languages with string datatypes, awk has no special string concatenation operator. Instead, two strings in succession are automatically concatenated. Each of these assignments sets the scalar variable s to the same four-character string:

s = "ABCD"

s = "AB" "CD"

s = "A" "BC" "D"

s = "A" "B" "C" "D"

The strings need not be constants: if we follow the last assignment with:

t = s s s

then t has the value "ABCDABCDABCD".

Conversion of a number to a string is done implicitly by concatenating the number to an empty string: n = 123, followed by s = "" n, assigns the value "123" to s. Some caution is called for when the number is not exactly representable: we address that later when we show how to do formatted number-to-string conversions in Section 9.9.8.

Much of the power of awk comes from its support of regular expressions. Two operators, ~ (matches) and !~ (does not match), make it easy to use regular expressions: "ABC" ~ "^[A-Z]+$" is true, because the left string contains only uppercase letters, and the right regular expression matches any string of (ASCII) uppercase letters. awk supports Extended Regular Expressions (EREs), as described in Section 3.2.3.

Regular expression constants can be delimited by either quotes or slashes: "ABC" ~ /^[A-Z]+$/ is equivalent to the last example. Which of them to use is largely a matter of programmer taste, although the slashed form is usually preferred, since it emphasizes that the enclosed material is a regular expression, rather than an arbitrary string. However, in the rare cases where a slash delimiter might be confused

Return Main Page Previous Page Next Page

®Online Book Reader