Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [40]

By Root 817 0
"Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string." (Subpatterns are the parts enclosed in parentheses in an ERE. For this purpose, GNU programs often extend this feature to \(...\) in BREs too.)

If sed is going to be replacing the text matched by a regular expression, it's important to be sure that the regular expression doesn't match too little or too much text. Here's a simple example:

$ echo Tolstoy writes well | sed 's/Tolstoy/Camus/'

Use fixed strings

Camus writes well

Of course, sed can use full regular expressions. This is where understanding the "longest leftmost" rule becomes important:

$ echo Tolstoy is worldly | sed 's/T.*y/Camus/'

Try a regular expression

Camus What happened?

The apparent intent was to match just Tolstoy. However, since the match extends over the longest possible amount of text, it went all the way to the y in worldly! What's needed is a more refined regular expression:

$ echo Tolstoy is worldly | sed 's/T[[:alpha:]]*y/Camus/'

Camus is worldly

In general, and especially if you're still learning the subtleties of regular expressions, when developing scripts that do lots of text slicing and dicing, you'll want to test things very carefully, and verify each step as you write it.

Finally, as we've seen, it's possible to match the null string when doing text searching. This is also true when doing text replacement, allowing you to insert text:

$ echo abc | sed 's/b*/1/'

Replace first match

1abc

$ echo abc | sed 's/b*/1/g'

Replace all matches

1a1c1

Note how b* matches the null string at the front and at the end of abc.

Lines Versus Strings

It is important to make a distinction between lines and strings. Most simple programs work on lines of input data. This includes grep and egrep, and 99 percent of the time, sed. In such a case, by definition there won't be any embedded newline characters in the data being matched, and ^ and $ represent the beginning and end of the line, respectively.

However, programming languages that work with regular expressions, such as awk, Perl, and Python, usually work on strings. It may be that each string represents a single input line, in which case ^ and $ still represent the beginning and end of the line. However, these languages allow you to use different ways to specify how input records are delimited, opening up the possibility that a single input "line" (i.e., record) may indeed have embedded newlines. In such a case, ^ and $ do not match an embedded newline; they represent only the beginning and end of a string. This point is worth bearing in mind when you start using the more programmable software tools.

* * *

[1] The original Unix version from 1992 is at ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z. A current version for Windows systems is at http://www.tgries.de/agrep/337/agrep337.zip. Unlike most downloadable software that we cite in this book, agrep is not freely usable for any arbitrary purpose; see the permissions files that come with the program.

[2] So named as a pun on more. See ftp://ftp.gnu.org/gnu/less/.

[3] The corresponding [^] is not a valid regular expression. Make sure you understand why.

[4] This reflects differences in the historical behavior of the grep and egrep commands, not a technical incapability of regular expression matchers. Such is life with Unix.

[5] An exception is that the meaning of a * as the first character of an ERE is "undefined," whereas in a BRE it means "match a literal *."

[6] Programmer's Workbench (PWB) Unix was a variant used within AT&T to support telephone switch software development. It was also made available for commercial use.

[7] This script does have a flaw: it can't handle directories whose names contain spaces. This can be solved using techniques we haven't seen yet; see Chapter 10.

Working with Fields

For many applications, it's helpful to view your data as consisting of records and fields. A record is a single collection of related

Return Main Page Previous Page Next Page

®Online Book Reader