Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

The Information - James Gleick [108]

By Root 947 0

letters are the same, they are probably ll, ee, ss, or oo. And structure can extend over long distances: in a message containing the word cow, even after many other characters intervene, the word cow is relatively likely to occur again. As is the word horse. A message, as Shannon saw, can behave like a dynamical system whose future course is conditioned by its past history.

To illustrate the differences between these different orders of structure, he wrote down—computed, really—a series of “approximations” of English text. He used an alphabet of twenty-seven characters, the letters plus a space between words, and generated strings of characters with the help of a table of random numbers. (These he drew from a book newly published for such purposes by Cambridge University Press: 100,000 digits for three shillings nine pence, and the authors “have furnished a guarantee of the random arrangement.”♦) Even with random numbers presupplied, working out the sequences was painstaking. The sample texts looked like this:

“Zero-order approximation”—that is, random characters, no structure or correlations.

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ

FFJEYVKCQSGHYD GPAAMKBZAACIBZLHJGD.

First order—each character is independent of the rest, but the frequencies are those expected in English: more e’s and t’s, fewer z’s and j’s, and the word lengths look realistic.

OCRO HLI RGWR NIMILWIS EU LL NBNESEBYA

TH EEI ALHENHTTPA OOBTTVA NAH BRL.

Second order—the frequencies of each character match English and so also do the frequencies of each digram, or letter pair. (Shannon found the necessary statistics in tables constructed for use by code breakers.♦ The most common digram in English is th, with a frequency of 168 per thousand words, followed by he, an, re, and er. Quite a few digrams have zero frequency.)

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN

D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY

TOBESEACE CTISBE.

Third order—trigram structure.

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID

PONDENOME OF DEMONSTURES OF THE REPTAGIN IS

REGOACTIONA OF CRE.

First-order word approximation.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN

DIFFERENT NATURAL HERE HE THE A IN CAME THE TO

OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD

BE THESE.

Second-order word approximation—now pairs of words appear in the expected frequency, so we do not see “a in” or “to of.”

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH

WRITER THAT THE CHARACTER OF THIS POINT IS

THEREFORE ANOTHER METHOD FOR THE LETTERS THAT

THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN

UNEXPECTED.

These sequences increasingly “look” like English. Less subjectively, it turns out that touch typists can handle them with increasing speed—another indication of the ways people unconsciously internalize a language’s statistical structure.

Shannon could have produced further approximations, given enough time, but the labor involved was becoming enormous. The point was to represent a message as the outcome of a process that generated events with discrete probabilities. Then what could be said about the amount of information, or the rate at which information is generated? For each event, the possible choices each have a known probability (represented as p1, p2, p3, and so on). Shannon wanted to define the measure of information (represented as H) as the measure of uncertainty: “of how much ‘choice’ is involved in the selection of the event or of how uncertain we are of the outcome.”♦ The probabilities might be the same or different, but generally more choices meant more uncertainty—more information. Choices might be broken down into successive choices, with their own probabilities, and the probabilities had to be additive; for example, the probability of a particular digram should be a weighted sum of the probabilities of the individual symbols. When those probabilities were equal, the amount of information conveyed by each symbol was simply the logarithm of the number of possible symbols

Online Book Reader

The Information - James Gleick [108]

®Online Book Reader