The Information - James Gleick [108]
To illustrate the differences between these different orders of structure, he wrote down—computed, really—a series of “approximations” of English text. He used an alphabet of twenty-seven characters, the letters plus a space between words, and generated strings of characters with the help of a table of random numbers. (These he drew from a book newly published for such purposes by Cambridge University Press: 100,000 digits for three shillings nine pence, and the authors “have furnished a guarantee of the random arrangement.”♦) Even with random numbers presupplied, working out the sequences was painstaking. The sample texts looked like this:
“Zero-order approximation”—that is, random characters, no structure or correlations.
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
FFJEYVKCQSGHYD GPAAMKBZAACIBZLHJGD.
First order—each character is independent of the rest, but the frequencies are those expected in English: more e’s and t’s, fewer z’s and j’s, and the word lengths look realistic.
OCRO HLI RGWR NIMILWIS EU LL NBNESEBYA
TH EEI ALHENHTTPA OOBTTVA NAH BRL.
Second order—the frequencies of each character match English and so also do the frequencies of each digram, or letter pair. (Shannon found the necessary statistics in tables constructed for use by code breakers.♦ The most common digram in English is th, with a frequency of 168 per thousand words, followed by he, an, re, and er. Quite a few digrams have zero frequency.)
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN
D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY
TOBESEACE CTISBE.
Third order—trigram structure.
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
REGOACTIONA OF CRE.
First-order word approximation.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN
DIFFERENT NATURAL HERE HE THE A IN CAME THE TO
OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD
BE THESE.
Second-order word approximation—now pairs of words appear in the expected frequency, so we do not see “a in” or “to of.”
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED.
These sequences increasingly “look” like English. Less subjectively, it turns out that touch typists can handle them with increasing speed—another indication of the ways people unconsciously internalize a language’s statistical structure.
Shannon could have produced further approximations, given enough time, but the labor involved was becoming enormous. The point was to represent a message as the outcome of a process that generated events with discrete probabilities. Then what could be said about the amount of information, or the rate at which information is generated? For each event, the possible choices each have a known probability (represented as p1, p2, p3, and so on). Shannon wanted to define the measure of information (represented as H) as the measure of uncertainty: “of how much ‘choice’ is involved in the selection of the event or of how uncertain we are of the outcome.”♦ The probabilities might be the same or different, but generally more choices meant more uncertainty—more information. Choices might be broken down into successive choices, with their own probabilities, and the probabilities had to be additive; for example, the probability of a particular digram should be a weighted sum of the probabilities of the individual symbols. When those probabilities were equal, the amount of information conveyed by each symbol was simply the logarithm of the number of possible symbols