Online Book Reader

Home Category

Beautiful Code [259]

By Root 5101 0

Let's rewrite our program to implement the application-specific logic of counting word frequencies for each document and summing these counts across documents in two functions that we'll call Map and Reduce. The result is Example 23-5.

Example 23-5. Division of word counting problem into Map and Reduce

void Map(string document) {

for each word w in document {

EmitIntermediate(w, "1");

}

}

void Reduce(string word, list values) {

int count = 0;

for each v in values {

count += StringToInt(v);

}

Emit(word, IntToString(count));

}

A simple driver program that uses these routines to accomplish the desired task on a single machine would look like Example 23-6.

Example 23-6. Driver for Map and Reduce

map > intermediate_data;

void EmitIntermediate(string key, string value) {

intermediate_data[key].append(value);

}

void Emit(string key, string value) {

... write key/value to final data file ...

}

void Driver(MapFunction mapper, ReduceFunction reducer) {

for each input item do {

mapper(item)

}

for each key k in intermediate_data {

reducer(k, intermediate_data[k]);

}

}

main() {

Driver(Map, Reduce);

}

The Map function is called once for each input record. Any intermediate key/value pairs emitted by the Map function are collected together by the driver code. Then, the Reduce function is called for each unique intermediate key, together with a list of intermediate values associated with that key.

We're now back to an implementation that runs on a single machine. However, with things separated in this manner, we can now change the implementation of the driver program to make it deal with distribution, automatic parallelization, and fault tolerance without affecting the application-specific logic in the Map and Reduce functions. Furthermore, the driver is independent of the particular application logic implemented by the Map and Reduce functions, and therefore the same driver program can be reused with other Map and Reduce functions to solve different problems. Finally, notice that the Map and Reduce functions that implement the application-specific logic are nearly as understandable as the simple sequential code shown in Example 23-1.

Distributed Programming with MapReduce > Other MapReduce Examples

23.3. Other MapReduce Examples

We'll examine the implementation of a much more sophisticated driver program that automatically runs MapReduce programs on large-scale clusters of machines in a moment, but first, let's consider a few other problems and how they can be solved using Map-Reduce:

Distributed grep

The Map function emits a line if it matches a supplied regular expression pattern. The Reduce function is an identity function that just copies the supplied intermediate data to the output.

Reverse web-link graph

A forward web-link graph is a graph that has an edge from node URL1 to node URL2 if the web page found at URL1 has a hyperlink to URL2. A reverse web-link graph is the same graph with the edges reversed. MapReduce can easily be used to construct a reverse web-link graph. The Map function outputs pairs for each link to a target URL found in a document named source. The Reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair .

Term vector per host

A term vector summarizes the most important words that occur in a document or a set of documents as a list of pairs. The Map function emits a pair for each input document (where the hostname is extracted from the URL of the document). The Reduce function is passed all per-document term vectors for a given host. It adds these term vectors, throwing away infrequent terms, and then emits a final pair.

Inverted index

An inverted index is a data structure that maps from each unique word to a list of documents that contain the word (where the documents are typically identified with a numeric identifier to keep the inverted

Return Main Page Previous Page Next Page

®Online Book Reader