Classic Shell Scripting - Arnold Robbins [40]
If sed is going to be replacing the text matched by a regular expression, it's important to be sure that the regular expression doesn't match too little or too much text. Here's a simple example:
$ echo Tolstoy writes well | sed 's/Tolstoy/Camus/'
Use fixed strings
Camus writes well
Of course, sed can use full regular expressions. This is where understanding the "longest leftmost" rule becomes important:
$ echo Tolstoy is worldly | sed 's/T.*y/Camus/'
Try a regular expression
Camus What happened?
The apparent intent was to match just Tolstoy. However, since the match extends over the longest possible amount of text, it went all the way to the y in worldly! What's needed is a more refined regular expression:
$ echo Tolstoy is worldly | sed 's/T[[:alpha:]]*y/Camus/'
Camus is worldly
In general, and especially if you're still learning the subtleties of regular expressions, when developing scripts that do lots of text slicing and dicing, you'll want to test things very carefully, and verify each step as you write it.
Finally, as we've seen, it's possible to match the null string when doing text searching. This is also true when doing text replacement, allowing you to insert text:
$ echo abc | sed 's/b*/1/'
Replace first match
1abc
$ echo abc | sed 's/b*/1/g'
Replace all matches
1a1c1
Note how b* matches the null string at the front and at the end of abc.
Lines Versus Strings
It is important to make a distinction between lines and strings. Most simple programs work on lines of input data. This includes grep and egrep, and 99 percent of the time, sed. In such a case, by definition there won't be any embedded newline characters in the data being matched, and ^ and $ represent the beginning and end of the line, respectively.
However, programming languages that work with regular expressions, such as awk, Perl, and Python, usually work on strings. It may be that each string represents a single input line, in which case ^ and $ still represent the beginning and end of the line. However, these languages allow you to use different ways to specify how input records are delimited, opening up the possibility that a single input "line" (i.e., record) may indeed have embedded newlines. In such a case, ^ and $ do not match an embedded newline; they represent only the beginning and end of a string. This point is worth bearing in mind when you start using the more programmable software tools.
* * *
[1] The original Unix version from 1992 is at ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z. A current version for Windows systems is at http://www.tgries.de/agrep/337/agrep337.zip. Unlike most downloadable software that we cite in this book, agrep is not freely usable for any arbitrary purpose; see the permissions files that come with the program.
[2] So named as a pun on more. See ftp://ftp.gnu.org/gnu/less/.
[3] The corresponding [^] is not a valid regular expression. Make sure you understand why.
[4] This reflects differences in the historical behavior of the grep and egrep commands, not a technical incapability of regular expression matchers. Such is life with Unix.
[5] An exception is that the meaning of a * as the first character of an ERE is "undefined," whereas in a BRE it means "match a literal *."
[6] Programmer's Workbench (PWB) Unix was a variant used within AT&T to support telephone switch software development. It was also made available for commercial use.
[7] This script does have a flaw: it can't handle directories whose names contain spaces. This can be solved using techniques we haven't seen yet; see Chapter 10.
Working with Fields
For many applications, it's helpful to view your data as consisting of records and fields. A record is a single collection of related