Classic Shell Scripting - Arnold Robbins [60]
An HTML document is structured as an HTML object containing one HEAD and one BODY object.
Inside the HEAD, a TITLE object defines the document title that web browsers display in the window titlebar and in bookmark lists. Also inside the HEAD, the LINK object generally carries information about the web-page maintainer.
The visible part of the document that browsers show is the contents of the BODY.
Whitespace is not significant outside of quoted strings, so we can use horizontal and vertical spacing liberally to emphasize the structure, as the HTML prettyprinter does.
Everything else is just printable ASCII text, with three exceptions. Literal angle brackets must be represented by special encodings, called entities, that consist of an ampersand, an identifier, and a semicolon: < and >. Since ampersand starts entities, it has its own literal entity name: &. HTML supports a modest repertoire of entities for accented characters that cover most of the languages of Western Europe so that we can write, for example, café du bon goût to get café du bon goÛt.
Although not shown in our minimal example, font style changes are accomplished in HTML with B (bold), EM (emphasis), I (italic), STRONG (extra bold), and TT (typewriter (fixed-width characters)) environments: write bold phrase to get bold phrase.
To convert our office directory to proper HTML, we need only one more bit of information: how to format a table, since that is what our directory really is and we don't want to force the use of typewriter fonts to get everything to line up in the browser display.
In HTML 3.0 and later, a table consists of a TABLE environment, inside of which are rows, each of them a table row (TR) environment. Inside each row are cells, called table data, each a TD environment. Notice that columns of data receive no special markup: a data column is simply the set of cells taken from the same row position in all of the rows of the table. Happily, we don't need to declare the number of rows and columns in advance. The job of the browser or formatter is to collect all of the cells, determine the widest cell in each column, and then format the table with columns just wide enough to hold those widest cells.
For our office directory example, we need just three columns, so our sample entry could be marked up like this:
Jones, Adrian W. | 555-0123 | OSD211 |
An equivalent, but compact and hard-to-read, encoding might look like this:
| Jones, Adrian W. | 555-0123 | OSD211 |
Because we chose to preserve special field separators in the text version of the office directory, we have sufficient information to identify the cells in each row. Also, because whitespace is mostly not significant in HTML files (except to humans), we need not be particularly careful about getting tags nicely lined up: if that is needed later, html-pretty can do it perfectly. Our conversion filter then has three steps:
Output the leading boilerplate down to the beginning of the document body.
Wrap each directory row in table markup.
Output the trailing boilerplate.
We have to make one small change from our minimal example: the DOCTYPE command has to be updated to a later grammar level so that it looks like this:
You don't have to memorize this: html-pretty has options to produce output in any of the standard HTML grammar levels, so you can just copy a suitable DOCTYPE command from its output.
Clearly, most of the work is just writing boilerplate, but that is simple since we can just copy text from the minimal HTML example. The only programmatic step required is the middle one, which we could do with only a couple of lines in awk. However, we can achieve it with even less work using a sed stream-editor substitution with two edit commands: one to substitute the embedded tab delimiters with