Reinventing Discovery_ The New Era of Networked Science - Michael Nielsen [61]
Genome sequencing for humans (and other complex life-forms) works in a similar way. While we can’t directly sequence long strands of DNA, we can make many copies of those strands, then cut the copies up at random locations, and directly sequence the fragments. This can all be done using old-school chemistry, one scientist in their lab, etcetera. We then use our computers to figure out where different fragments overlap, and put everything back together again. (Incidentally, I’ve glossed over some subtleties, such as the repetition of certain DNA sequences throughout the human genome, which makes it harder to reassemble the full DNA sequence. These subtleties can be addressed using other tricks, but you now get the general idea.)
Now, imagine that we want to sequence someone’s DNA today. Perhaps it’s for a paternity test. Or maybe it’s as part of a criminal investigation. It doesn’t matter what the reason is. It turns out that we can actually simplify the above procedure for DNA sequencing, using the facts that (1) a reference human genome is already known, and (2) thanks to the haplotype map, we know where in the genome people may differ, and where, it seems, we’re always the same. To understand how the simplified process works, imagine now that you possess a complete copy of Harry Potter and the Philosopher’s Stone. Then, you’re given a cut-up copy of a book that’s similar, but that has been modified in a few locations. In fact, in real life the book really was changed between its initial release in the United Kingdom and its release in the United States. One change especially stands out, which is that the word Philosopher in the title was changed to Sorcerer, so the title became Harry Potter and the Sorcerer’s Stone. All through the book “philosopher” was replaced by “sorcerer”—presumably, the publisher believed the book would have greater appeal in the United States this way. It’s pretty obvious that having the complete text of the original book to refer to would make it much easier to figure out the text of the modified book. Instead of having to laboriously figure out which fragments matched with which, you could always figure out what part of the book the fragment you’re currently examining is from. In a similar way, the sequencing of a human genome can be done faster and more easily by constantly referring back to the reference genome and the haplotype map.
Incidentally, while the Harry Potter example is fanciful, I can’t resist mentioning that a very similar technique really was used by the author Chuck Hansen to write his book U.S. Nuclear Weapons: The Secret History. Hansen based his history on tens of thousands of declassified documents that had been sanitized by physically cutting out classified information. He discovered that different copies of the same document were sometimes sanitized in different ways, and by comparing different versions he could sometimes reconstruct the deleted information!
The algorithms I’ve described for genome sequencing are good examples of data-driven intelligence. In no sense are these algorithms especially smart. They’re not doing much beyond simple pattern matching and rearrangement. But by combining these simple algorithms with enormous data processing power we can solve a problem that an unaided human being can’t solve at all. Furthermore, by combining data-driven intelligence with the open data in the human genome and the HapMap we can simplify the problem of genetic sequencing. This is the kind of thing we’ll see on a much grander scale when data-driven intelligence is combined with the data web.
Building the Data Web
Today, the data web is in its very early days. Most data is still locked up. To the extent data is shared, many different technologies are being used to do the sharing. The open data sets that are available mostly remain unconnected to one another, still living inside their separate silos. In short,