Classic Shell Scripting - Arnold Robbins [232]
[5] See RFC 2279: UTF-8, a transformation format of ISO 10646, available at ftp://ftp.internic.net/rfc/rfc2279.txt.
[6] EBCDIC = Extended Binary-Coded Decimal Interchange Code, pronounced eb-see-dick, or eb-kih-dick, an 8-bit character set first introduced on the IBM System/360 in 1964, containing the old 6-bit IBM BCD set as a subset. System/360, and its descendants, is by far the longest-running computer architecture in history, and much of the world's business uses it. IBM supports a superb GNU/Linux implementation on it, using the ASCII character set: see http://www.ibm.com/linux/.
[7] The old HFS-type filesystem supported on Mac OS X is case-insensitive, and that can lead to nasty surprises when software is ported to that environment. Mac OS X also supports normal case-sensitive Unix filesystems.
[8] Available on almost all Unix systems, except Mac OS X and FreeBSD (before release 5.0). Source code for getconf can be found in the glibc distribution at ftp://ftp.gnu.org/gnu/glibc/.
What's in a Unix File?
One of the tremendous successes of Unix has been its simple view of files: Unix files are just streams of zero or more anonymous bytes of data.
Most other operating systems have different types of files: binary versus text data, counted-length versus fixed-length versus variable-length records, indexed versus random versus sequential access, and so on. This rapidly produces the nightmarish situation that the conceptually simple job of copying a file must be done differently depending on the file type, and since virtually all software has to deal with files, the complexity is widespread.
A Unix file-copy operation is trivial:
try-to-get-a-byte
while (have-a-byte)
{
put-a-byte
try-to-get-a-byte
}
This sort of loop can be implemented in many programming languages, and its great beauty is that the program need not be aware of where the data is coming from: it could be from a file, or a magnetic tape device, or a pipe, or a network connection, or a kernel data structure, or any other data source that designers dream up in the future.
Ahh, you say, but I need a special file that has a trailing directory of pointers into the earlier data, and that data is itself encrypted. In Unix the answer is: Go for it! Make your application program understand your fancy file format, but don't trouble the filesystem or operating system with that complexity. They do not need to know about it.
There is, however, a mild distinction between files that Unix does admit to. Files that are created by humans usually consist of lines of text, ended by a line break, and devoid of most of the unprintable ASCII control characters. Such files can be edited, displayed on the screen, printed, sent in electronic mail, and transmitted across networks to other computing systems with considerable assurance that the integrity of the data will be maintained. Programs that expect to deal with text files, including many of the software tools that we discuss in this book, may have been designed with large, but fixed-size, buffers to hold lines of text, and they may behave unpredictably if given an input file with unexpectedly long lines, or with nonprintable characters.[9] A good rule of thumb in dealing with text files is to limit line lengths to something that you can read comfortably—say, 50 to 70 characters.
Text files mark line boundaries with the ASCII linefeed (LF) character, decimal value 10 in the ASCII table. This character is referred to as the newline character. Several programming languages represent this character by \n in character strings. This is simpler than the carriage-return/linefeed pair used by some other systems. The widely used C and C++ programming languages, and several others developed later, take the view that text-file lines are terminated by a single newline character; they do so because of their Unix roots.
In a mixed operating-system environment with shared filesystems, there is a frequent need to convert text files between