Classic Shell Scripting - Arnold Robbins [141]
value = substr($0, start, end - start)
we need to count characters rather carefully, we do not match the data pattern as precisely, and we have to create two substrings.
String Splitting
The convenient splitting into fields $1, $2, ..., $NF that awk automatically provides for the current input record, $0, is also available as a function: split( string, array, regexp ) breaks string into pieces stored in successive elements of array, where the pieces lie between substrings matched by the regular expression regexp. If regexp is omitted, then the current value of the built-in field-separator variable, FS, is used. The function return value is the number of elements in array. Example 9-6 demonstrates split( ).
Example 9-6. Test program for field splitting
{
print "\nField separator = FS = \"" FS "\""
n = split($0, parts)
for (k = 1; k <= n; k++)
print "parts[" k "] = \"" parts[k] "\""
print "\nField separator = \"[ ]\""
n = split($0, parts, "[ ]")
for (k = 1; k <= n; k++)
print "parts[" k "] = \"" parts[k] "\""
print "\nField separator = \":\""
n = split($0, parts, ":")
for (k = 1; k <= n; k++)
print "parts[" k "] = \"" parts[k] "\""
print ""
}
If we put the test program shown in Example 9-6 into a file and run it interactively, we can see how split() works:
$ awk -f split.awk
Harold and Maude
Field separator = FS = " "
parts[1] = "Harold"
parts[2] = "and"
parts[3] = "Maude"
Field separator = "[ ]"
parts[1] = ""
parts[2] = ""
parts[3] = "Harold"
parts[4] = ""
parts[5] = "and"
parts[6] = "Maude"
Field separator = :
parts[1] = " Harold and Maude"
root:x:0:1:The Omnipotent Super User:/root:/sbin/sh
Field separator = FS = " "
parts[1] = "root:x:0:1:The"
parts[2] = "Omnipotent"
parts[3] = "Super"
parts[4] = "User:/root:/sbin/sh"
Field separator = "[ ]"
parts[1] = "root:x:0:1:The"
parts[2] = "Omnipotent"
parts[3] = "Super"
parts[4] = "User:/root:/sbin/sh"
Field separator = ":"
parts[1] = "root"
parts[2] = "x"
parts[3] = "0"
parts[4] = "1"
parts[5] = "The Omnipotent Super User"
parts[6] = "/root"
parts[7] = "/sbin/sh"
Notice the difference between the default field-separator value of " ", which causes leading and trailing whitespace to be ignored and runs of whitespace to be treated as a single space, and a field-separator value of "[ ]", which matches exactly one space. For most text processing applications, the first of these gives the desired behavior.
The colon field-separator example shows that split( ) produces a one-element array when the field separator is not matched, and demonstrates splitting of a record from a typical Unix administrative file, /etc/passwd.
Recent awk implementations provide a useful generalization: split(string, chars, "") breaks string apart into one-character elements in chars[1], chars[2], ..., chars[length(string)]. Older implementations require less efficient code like this:
n = length(string)
for (k = 1; k <= n; k++)
chars[k] = substr(string, k, 1)
The call split("", array ) deletes all elements in array: it is a faster method for array element deletion than the loop:
for (key in array)
delete array[key]
when delete array is not supported by your awk implementation.
split( ) is an essential function for iterating through multiply subscripted arrays in awk. Here is an example:
for (triple in maildrop)
{
split(triple, parts, SUBSEP)
house_number = parts[1]
street = parts[2]
postal_code = parts[3]
...
}
String Reconstruction
There is no standard built-in awk function that is the inverse of split( ), but it is easy to write one, as shown in Example 9-7. join( ) ensures that the argument array is not referenced unless the index is known to be in bounds. Otherwise, a call with a zero array length might create array[1], modifying the caller's array. The inserted field separator is an ordinary string, rather than a regular expression, so for general regular expressions passed to split( ), join( ) does not reconstruct the original string exactly.
Example 9-7. Joining array