Beautiful Code [113]
Second, PageRank is applicable only to document collections that are richly populated with links back and forth between the documents. At the moment, two document collections qualify: the World Wide Web and the corpus of peer-reviewed academic publications (which have applied PageRank-like methods for decades).
The ability of large search engines to scale up with the size of data and number of users has been impressive. It is based on the massive application of parallelism: attacking big problems with large numbers of small computers, rather than a few big ones. One of the nice things about postings is that each posting is independent of all the others, so they naturally lend themselves to parallel approaches.
For example, an index based on doing binary search in arrays of postings is fairly straight-forward to partition. In an index containing only English words, you could easily create 26 partitions (the term used in the industry is shards), one for words beginning with each letter. Then you can make as many copies as you need of each shard. Then, a huge volume of word-search queries can be farmed out across an arbitrarily large collection of cooperating search nodes.
This leaves the problem of combining search results for multiword or phrase searches, and this requires some real innovation, but it's easy to see how the basic word-search function could be parallelized.
This discussion is a little unfair in that it glosses over a huge number of important issues, notably including fighting the Internet miscreants who continually try to outsmart search-engine algorithms for commercial gain.
Finding Things > Conclusion
4.5. Conclusion
It is hard to imagine any computer application that does not involve storing data and finding it based on its content. The world's single most popular computer application, web search, is a notable example.
This chapter has considered some of the issues, notably bypassing the traditional "database" domain and the world of search strategies that involve external storage. Whether operating at the level of a single line of text or billions of web documents, search is central. From the programmer's point of view, it also needs to be said that implementing searches of one kind or another is, among other things, fun.
Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers > The Role of XML Validation
5. Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers
Elliotte Rusty Harold
This is the story of two routines that perform input verification for XML, the first in JDOM, and the second in XOM. I was intimately involved in the development of both, and while the two code bases are completely separate and share no common code, the ideas from the first clearly trickled into the second. The code, in my opinion, gradually became more beautiful. It certainly became faster.
Speed was the driving factor in each successive refinement, but in this case the improvements in speed were accompanied by improvements in beauty as well. I hope to dispel the myth that fast code must be illegible, ugly code. On the contrary, I believe that more often than not, improvements in beauty lead to improvements in execution speed, especially taking into account the impact of modern optimizing compilers, just-in-time compilers, RISC (reduced instruction set computer) architectures, and multi-core CPUs.
5.1. The Role of XML Validation
XML achieves interoperability by rigorously enforcing certain rules about what may and may not appear in an XML document. With a few very small exceptions, a conforming processor can process any well-formed XML document and can identify (and not attempt to process) malformed documents. This ensures a high degree of interoperability between platforms, parsers, and programming languages. You don't have to worry that your parser won't read my document because yours was written in C and runs on Unix, while mine was written in Java and runs on Windows.