Online Book Reader

Home Category

Classic Shell Scripting - Arnold Robbins [61]

By Root 903 0
.... We temporarily assume that no accented characters are required in the directory, but we can easily allow for angle brackets and ampersands in the input stream by adding three initial sed steps. We collect the complete program in Example 5-2.

Example 5-2. Converting an office directory to HTML

#! /bin/sh

# Convert a tab-separated value file to grammar-conformant HTML.

#

# Usage:

# tsv-to-html < infile > outfile

cat << EOFILE Leading boilerplate

<p>Office directory<p>

EOFILE

sed -e 's=&=\&=g' \ Convert special characters to entities

-e 's=<=\<=g' \

-e 's=>=\>=g' \

-e 's=\t=

='

cat << EOFILE Trailing boilerplate

=g' \ And supply table markup

-e 's=^.*$=

&

EOFILE

The << notation is called a here document. It is explained in more detail in Section 7.3.1. Briefly, the shell reads all lines up to the delimiter following the << (EOFILE in this case), does variable and command substitution on the contained lines, and feeds the results as standard input to the command.

There is an important point about the script in Example 5-2: it is independent of the number of columns in the table! This means that it can be used to convert any tab-separated value file to HTML. Spreadsheet programs can usually save data in such a format, so our simple tool can produce correct HTML from spreadsheet data.

We were careful in tsv-to-html to maintain the spacing structure of the original office directory, because that makes it easy to apply further filters downstream. Indeed, html-pretty was written precisely for that reason: standardization of HTML markup layout radically simplifies other HTML tools.

How would we handle conversion of accented characters to HTML entities? We could augment the sed command with extra edit steps like -e 's=é=é=g', but there are about 100 or so entities to cater for, and we are likely to need similar substitutions as we convert other kinds of text files to HTML.

It therefore makes sense to delegate that task to a separate program that we can reuse, either as a pipeline stage following the sed command in Example 5-2, or as a filter applied later. (This is the "detour to build specialized tools" principle in action.) Such a program is just a tedious tabulation of substitution commands, and we need one for each of the local text encodings, such as the various ISO 8859-n code pages mentioned in Section B.2 in Appendix B. We don't show such a filter completely here, but a fragment of one in Example 5-3 gives the general flavor. For readers who need it, we include the complete program for handling the common case of Western European characters in the ISO 8859-1 encoding with this book's sample programs. HTML's entity repertoire isn't sufficient for other accented characters, but since the World Wide Web is moving in the direction of Unicode and XML in place of ASCII and HTML, this problem is being solved in a different way, by getting rid of character set limitations.

Example 5-3. Fragment of iso8859-1-to-html program

#! /bin/sh

# Convert an input stream containing characters in ISO 8859-1

# encoding from the range 128..255 to HTML equivalents in ASCII.

# Characters 0..127 are preserved as normal ASCII.

#

# Usage:

# iso8859-1-to-html infile(s) >outfile

sed \

-e 's= =\ =g' \

-e 's=¡=\¡=g' \

-e 's=¢=\¢=g' \

-e 's=£=\£=g' \

...

-e 's=ü=\ü=g' \

-e 's==\ý=g' \

-e 's==\þ=g' \

-e 's=ÿ=\ÿ=g' \

"$@"

Here is a sample of the use of this filter:

$ cat danish

Show sample Danish text in ISO 8859-1 encoding

Øen med åen lå i læ af én halv⊘,

og én stor ⊘, langs den græske kyst.

$ iso8859-1-to-html danish

Convert text to HTML entities

Øen med åen lå i læ af én halvø,

og én stor ø, langs den græske kyst.

* * *

[4] In addition

Return Main Page Previous Page Next Page

®Online Book Reader