Drunkard's Walk - Leonard Mlodinow [84]
At his laboratory, Galton attracted subjects through advertisements and then subjected them to a series of measurements of height, weight, even the dimensions of certain bones. His goal was to find a method for predicting the measurements of children based on those of their parents. One of Galton’s plots showed parents’ heights versus the heights of their offspring. If, say, those heights were always equal, the graph would be a neat line rising at 45 degrees. If that relationship held on average but individual data points varied, then the data would show some scatter above and below that line. Galton’s graphs thus exhibited visually not just the general relationship between the heights of parent and offspring but also the degree to which the relationship holds. That was Galton’s other major contribution to statistics: defining a mathematical index describing the consistency of such relationships. He called it the coefficient of correlation.
The coefficient of correlation is a number between -1 and 1; if it is near ± 1, it indicates that two variables are linearly related; a coefficient of 0 means there is no relation. For example, if data revealed that by eating the latest McDonald’s 1,000-calorie meal once a week, people gained 10 pounds a year and by eating it twice a week they gained 20 pounds, and so on, the correlation coefficient would be 1. If for some reason everyone were to instead lose those amounts of weight, the correlation coefficient would be -1. And if the weight gain and loss were all over the map and didn’t depend on meal consumption, the coefficient would be 0. Today correlation coefficients are among the most widely employed concepts in statistics. They are used to assess such relationships as those between the number of cigarettes smoked and the incidence of cancer, the distance of stars from Earth and the speed with which they are moving away from our planet, and the scores students achieve on standardized tests and the income of the students’ families.
Galton’s work was significant not just for its direct importance but because it inspired much of the statistical work done in the decades that followed, in which the field of statistics grew rapidly and matured. One of the most important of these advances was made by Karl Pearson, a disciple of Galton’s. Earlier in this chapter, I mentioned many types of data that are distributed according to the normal distribution. But with a finite set of data the fit is never perfect. In the early days of statistics, scientists sometimes determined whether data were normally distributed simply by graphing them and observing the shape of the resulting curve. But how do you quantify the accuracy of the fit? Pearson invented a method, called the chi-square test, by which you can determine whether a set of data actually conforms to the distribution you believe it conforms to. He demonstrated his test in Monte Carlo in July 1892, performing a kind of rigorous repeat of Jagger’s work.31 In Pearson’s test, as in Jagger’s, the numbers that came up on a roulette wheel did not follow the distribution they would have followed if the wheel had produced random results. In another test, Pearson examined how many 5s and 6s came up in 26,306 tosses of twelve dice. He found that the distribution was not one you’d see in a chance experiment with fair dice—that is, in an experiment in which the probability of a 5 or a 6 on one roll were 1 in 3, or 0.3333. But it was consistent if the probability of a 5 or a 6 were 0.3377—that is, if the dice were skewed. In the case of the roulette wheel the game may