Online Book Reader

Home Category

Beautiful Code [64]

By Root 5089 0
word_count data structure as the bottleneck. This problem can be fixed by partitioning the word_count data structure into a number of buckets with a separate lock per bucket, as shown in Example 23-3.

Example 23-3. Parallelized word count program with partitioned storage

struct CountTable {

Mutex lock;

map word_count;

};

const int kNumBuckets = 256;

CountTable tables[kNumBuckets];

for each document d in parallel {

for each word w in d {

int bucket = hash(w) % kNumBuckets;

tables[bucket].lock.Lock();

tables[bucket].word_count[w]++;

tables[bucket].lock.Unlock();

}

}

for (int b = 0; b < kNumBuckets; b++) {

... save tables[b].word_count to persistent storage ...

}

The program is still quite simple. However, it cannot scale beyond the number of processors in a single machine. Most affordable machines have eight or fewer processors, so even with perfect scaling, this approach will still require multiple weeks of processing to complete. Furthermore, we have been glossing over the problem of where the input data is stored and how fast it can be read by one machine.

Further scaling requires that we distribute the data and the computation across multiple machines. For the moment, let's assume that the machines do not fail. One way to increase scaling is to start many processes on a cluster of networked machines. We will have many input processes, each one responsible for reading and processing a subset of the documents. We will also have many output processes, each responsible for managing one of the word_count buckets. Example 23-4 shows the algorithm.

Example 23-4. Parallelized word count program with partitioned processors

Code View: Scroll / Show All

const int M = 1000; // Number of input processes

const int R = 256; // Number of output processes

main() {

// Compute the number of documents to assign to each process

const int D = number of documents / M;

for (int i = 0; i < M; i++) {

fork InputProcess(i * D, (i + 1) * D);

}

for (int i = 0; i < R; i++) {

fork OutputProcess(i);

}

... wait for all processes to finish ...

}

void InputProcess(int start_doc, int end_doc) {

map word_count[R]; // Separate table per output process

for each doc d in range [start_doc .. end_doc-1] do {

for each word w in d {

int b = hash(w) % R;

word_count[b][w]++;

}

}

for (int b = 0; b < R; b++) {

string s = EncodeTable(word_count[b]);

... send s to output process b ...

}

}

void OutputProcess(int bucket) {

map word_count;

for each input process p {

string s = ... read message from p ...

map partial = DecodeTable(s);

for each in partial do {

word_count[word] += count;

}

}

... save word_count to persistent storage ...

}

This approach scales nicely on a network of workstations, but is significantly more complicated and hard to understand (even though we've hidden the details of marshaling and unmarshaling, as well as starting and synchronizing different processes). It also does not deal gracefully with machine failures. To deal with failures, we would extend Example 23-4 to re-execute processes that failed before completion. To avoid double-counting data when we re-execute an input process, we would mark each piece of intermediate data with a generation number of the input process and modify the output processing so that it uses these generation numbers to discard duplicates. As you can imagine, adding this failure-handling support would further complicate things.

Syntactic Abstraction: The syntax-case Expander > Brief Introduction to syntax-case

25. Syntactic Abstraction: The syntax-case Expander

R. Kent Dybvig

When writing computer programs, certain patterns arise over and over again. For example, programs must often loop through the elements of arrays, increment or decrement the values of variables, and perform multiway conditionals based on numeric or character values. Programming language designers typically acknowledge this by including special-purpose syntactic constructs that handle the most

Return Main Page Previous Page Next Page

®Online Book Reader