Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [45]

By Root 952 0
datafiles. POSIX mandates a single version with different options to provide the behavior traditionally obtained from the three grep variants: grep, egrep, and fgrep.

Although you can search for plain string constants, regular expressions provide a more powerful way to describe text to be matched. Most characters match themselves, whereas certain others act as metacharacters, specifying actions such as "match zero or more of," "match exactly 10 of," and so on.

POSIX regular expressions come in two flavors: Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs). Which programs use which regular expression flavor is based upon historical practice, with the POSIX specification reducing the number of regular expression flavors to just two. For the most part, EREs are a superset of BREs, but not completely.

Regular expressions are sensitive to the locale in which the program runs; in particular, ranges within a bracket expression should be avoided in favor of character classes such as [[:alnum:]]. Many GNU programs have additional metacharacters.

sed is the primary tool for making simple string substitutions. Since, in our experience, most shell scripts use sed only for substitutions, we have purposely not covered everything sed can do. The sed & awk book listed in the Chapter 16 provides more information.

The "longest leftmost" rule describes where text matches and for how long the match extends. This is important when doing text substitutions with sed, awk, or an interactive text editor. It is also important to understand when there is a distinction between a line and a string. In some programming languages, a single string may contain multiple lines, in which case ^ and $ usually apply to the beginning and end of the string.

For many operations, it's useful to think of each line in a text file as an individual record, with data in the line consisting of fields. Fields are separated by either whitespace or a special delimiter character, and different Unix tools are available to work with both kinds of data. The cut command cuts out selected ranges of characters or fields, and join is handy for merging files where records share a common key field.

awk is often used for simple one-liners, where it's necessary to just print selected fields, or rearrange the order of fields within a line. Since it's a programming language, you have much more power, flexibility, and control, even in small programs.

Chapter 4. Text Processing Tools

Some operations on text files are so widely applicable that standard tools for those tasks were developed early in the Unix work at Bell Labs. In this chapter, we look at the most important ones.

Sorting Text

Text files that contain independent records of data are often candidates for sorting. A predictable record order makes life easier for human users: book indexes, dictionaries, parts catalogs, and telephone directories have little value if they are unordered. Sorted records can also make programming easier and more efficient, as we will illustrate with the construction of an office directory in Chapter 5.

Like awk, cut, and join, sort views its input as a stream of records made up of fields of variable width, with records delimited by newline characters and fields delimited by whitespace or a user-specifiable single character.

* * *

sort


Usage

sort [ options ] [ file(s) ]

Purpose

Sort input lines into an order determined by the key field and datatype options, and the locale.

Major options

-b

Ignore leading whitespace.

-c

Check that input is correctly sorted. There is no output, but the exit code is nonzero if the input is not sorted.

-d

Dictionary order: only alphanumerics and whitespace are significant.

-g

General numeric value: compare fields as floating-point numbers. This works like -n, except that numbers may have decimal points and exponents (e.g., 6.022e+23). GNU version only.

-f

Fold letters implicitly to a common lettercase so that sorting is case-insensitive.

-i

Ignore nonprintable characters.

-k

Return Main Page Previous Page Next Page

®Online Book Reader