Online Book Reader

Home Category

Beautiful Code [258]

By Root 5361 0
{

Mutex lock;

map word_count;

};

const int kNumBuckets = 256;

CountTable tables[kNumBuckets];

for each document d in parallel {

for each word w in d {

int bucket = hash(w) % kNumBuckets;

tables[bucket].lock.Lock();

tables[bucket].word_count[w]++;

tables[bucket].lock.Unlock();

}

}

for (int b = 0; b < kNumBuckets; b++) {

... save tables[b].word_count to persistent storage ...

}

The program is still quite simple. However, it cannot scale beyond the number of processors in a single machine. Most affordable machines have eight or fewer processors, so even with perfect scaling, this approach will still require multiple weeks of processing to complete. Furthermore, we have been glossing over the problem of where the input data is stored and how fast it can be read by one machine.

Further scaling requires that we distribute the data and the computation across multiple machines. For the moment, let's assume that the machines do not fail. One way to increase scaling is to start many processes on a cluster of networked machines. We will have many input processes, each one responsible for reading and processing a subset of the documents. We will also have many output processes, each responsible for managing one of the word_count buckets. Example 23-4 shows the algorithm.

Example 23-4. Parallelized word count program with partitioned processors

Code View: Scroll / Show All

const int M = 1000; // Number of input processes

const int R = 256; // Number of output processes

main() {

// Compute the number of documents to assign to each process

const int D = number of documents / M;

for (int i = 0; i < M; i++) {

fork InputProcess(i * D, (i + 1) * D);

}

for (int i = 0; i < R; i++) {

fork OutputProcess(i);

}

... wait for all processes to finish ...

}

void InputProcess(int start_doc, int end_doc) {

map word_count[R]; // Separate table per output process

for each doc d in range [start_doc .. end_doc-1] do {

for each word w in d {

int b = hash(w) % R;

word_count[b][w]++;

}

}

for (int b = 0; b < R; b++) {

string s = EncodeTable(word_count[b]);

... send s to output process b ...

}

}

void OutputProcess(int bucket) {

map word_count;

for each input process p {

string s = ... read message from p ...

map partial = DecodeTable(s);

for each in partial do {

word_count[word] += count;

}

}

... save word_count to persistent storage ...

}

This approach scales nicely on a network of workstations, but is significantly more complicated and hard to understand (even though we've hidden the details of marshaling and unmarshaling, as well as starting and synchronizing different processes). It also does not deal gracefully with machine failures. To deal with failures, we would extend Example 23-4 to re-execute processes that failed before completion. To avoid double-counting data when we re-execute an input process, we would mark each piece of intermediate data with a generation number of the input process and modify the output processing so that it uses these generation numbers to discard duplicates. As you can imagine, adding this failure-handling support would further complicate things.

Distributed Programming with MapReduce > The MapReduce Programming Model

23.2. The MapReduce Programming Model

If you compare Example 23-1 with Example 23-4, you'll find that the simple task of counting words has been buried under lots of details about managing parallelism. If we can somehow separate the details of the original problem from the details of parallelization, we may be able to produce a general parallelization library or system that can be applied not just to this word-counting problem, but other large-scale processing problems. The parallelization pattern that we are using is:

For each input record, extract a set of key/value pairs that we care about from each record.

For each extracted key/value pair, combine it with other values that share the same key (perhaps filtering, aggregating, or transforming values in the process).

Return Main Page Previous Page Next Page

®Online Book Reader