Beautiful Code [63]
Focus on the edge conditions
Part of the reason that young software engineers should cut their teeth debugging complicated systems is that it inculcates a lifelong skill: the ability to analyze a solution to a problem in terms of the ways that it won't work instead of the ways that it might—the ability to focus on the edge conditions. When conceiving of new software, we software engineers should not try to convince ourselves why our design will work; we should invalidate the reasons why it will not. This is not to advocate overanalysis in lieu of writing code, but rather to suggest that the first code written on any project should be the code in which bugs may invalidate larger design ideas.
If these principles are applied, one will naturally gravitate to implementing the hardest problems at the earliest phase in any given project, and to putting in place the infrastructure to validate that that infrastructure works (and remains working). This won't eliminate the sewage, but it will assure that the most fetid spoonfuls are caught as early as possible, when design changes are still possible—and when the wine can still be saved.
Distributed Programming with MapReduce > A Motivating Example
23. Distributed Programming with MapReduce
Jeffrey Dean and Sanjay Ghemawat
This chapter describes the design and implementation of mapreduce, a programming system for large-scale data processing problems. MapReduce was developed as a way of simplifying the development of large-scale computations at Google. MapReduce programs are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
23.1. A Motivating Example
Suppose that you have 20 billion documents, and you want to generate a count of how often each unique word occurs in the documents. With an average document size of 20 KB, just reading through the 400 terabytes of data on one machine will take roughly four months. Assuming we were willing to wait that long and that we had a machine with sufficient memory, the code would be relatively simple. Example 23-1 (all the examples in this chapter are pseudocode) shows a possible algorithm.
Example 23-1. Naïve, nonparallel word count program
map for each document d { for each word w in d { word_count[w]++; } } ... save word_count to persistent storage ... One way of speeding up this computation is to perform the same computation in parallel across each individual document, as shown in Example 23-2. Example 23-2. Parallelized word count program Mutex lock; // Protects word_count map for each document d in parallel { for each word w in d { lock.Lock(); word_count[w]++; lock.Unlock(); } } ... save word_count to persistent storage ... The preceding code nicely parallelizes the input side of the problem. In reality, the code to start up threads would be a bit more complex, since we've hidden a bunch of details by using pseudocode. One problem with Example 23-2 is that it uses a single global data structure for keeping track of the generated counts. As a result, there is likely to be significant lock contention with the