Proofiness - Charles Seife [76]
Imagine, for example, that some joker in Sitka, Alaska, fills out his census form to say that there are 300 million people living in his household. If the Census Bureau were to take him seriously, it would mean that Alaska would suddenly be the most populous state in the Union by a huge margin; indeed, half of the representatives in the House would be representing this gentleman’s household. Luckily, no census worker is stupid enough to believe him. It’s obvious that the guy is lying—he gave the census a bad piece of data. But what can the Census Bureau do about it? The only choice is to clean up the datum somehow—and doing this means that they must use a statistical technique known as imputation.
In an imputation, a Census Bureau statistician picks out a datum that looks wrong. (Anyone who says that he has seventy-seven children or is 175 years old, for example, is probably lying.) Then the statistician wipes out the questionable answer and replaces it with census data from similar-looking households. The replacement number is a guess, but an educated one—and it’s certainly closer to the truth than the phony datum. And in fact, there’s really no alternative. Wiping out the datum or, more drastically, tossing out the entire census form is also imputation. The act of wiping out a datum is a substitution: the worker is still replacing someone’s answer (seventy-seven children) with another answer (zero children); a null answer is still an answer. Similarly, tossing out a census form is equivalent to imputing that a dwelling is vacant. Instead of making bad imputations by simply wiping out dubious results, the Census Bureau prefers to make an educated guess from the freshest census data it has, a process known as “hot-deck imputation.”76 It’s more likely to be approximately correct, so it does less violence to the validity of the census results. The only other option—the only way to avoid imputation entirely—is to take every single census form at face value. You have to duly record the responses of every 175-year-old woman, every man with seventy-seven children, and, yes, the gentleman in Sitka who has 300 million people in his household. Without imputation, the results of the census become worthless.
The Supreme Court decision about sampling was effectively a ban on using statistical mumbo-jumbo, but imputation is a form of statistical mumbo-jumbo that wasn’t addressed by the previous decision. So when the (sampling-free!) 2000 census results were released, the state of Utah, which was denied an extra representative in Congress, sued. They argued that the bureau’s use of imputation was illegal, and in 2002 this case worked its way up to the Supreme Court.
Utah’s case put the court in a bind. If the justices ruled that imputation was unconstitutional, it would have rendered the Census Bureau powerless to correct spurious data; the one joker from Sitka could theoretically render the entire count meaningless. However, if the Court decided that imputation was permissible, it had to split hairs to explain why one statistical technique—imputation—was acceptable while another—sampling—was illegal.
In another five-to-four decision—the liberals were joined by the usually conservative chief justice, William Rehnquist—the Court decided to take the latter course. In a shining example of how many justices it takes to split a hair, the Census Bureau was allowed to continue using imputation, at least for the moment. However, the minority lobbed grenades at the decision, accusing the bureau of using illegal, and perhaps unconstitutional, statistical witchcraft. Sandra Day O’Connor wrote that imputation was simply a form of sampling and should thus be banned. Clarence Thomas essentially repeated Scalia’s argument from the earlier sampling case (even recycling