Online Book Reader

Home Category

Beautiful Code [260]

By Root 5326 0
index data relatively compact). The Map function parses each document and emits a sequence of pairs. The Reduce function accepts all docids for a given word, sorts the corresponding document IDs, and emits a pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions within each document.

Distributed sort

MapReduce can also be used to sort data by a particular key. The Map function extracts the key from each record, and emits a pair. The Reduce function emits all pairs unchanged (i.e., the identity Reduce function). This computation depends on the partitioning facilities and ordering properties described later in this chapter.

There are many more examples of computations that can easily be expressed as a Map-Reduce computation. For more complex computations, it is often easy to express them as a sequence of MapReduce steps or as an iterative application of a MapReduce computation, where the output of one MapReduce step is the input to the next MapReduce step.

One you start thinking of data processing problems in terms of MapReduce, they are often relatively easy to express. As some testament to this, over the last four years, the number of MapReduce programs at Google has gone from a small handful of candidate problems in March 2003 (when we started to design MapReduce) to more than 6,000 distinct MapReduce programs in December 2006. These programs were written by more than a thousand different software developers, many of whom had never written a parallel or distributed program before using MapReduce.

Distributed Programming with MapReduce > A Distributed MapReduce Implementation

23.4. A Distributed MapReduce Implementation

Much of the benefit of the MapReduce programming model is that it nicely separates the expression of the desired computation from the underlying details of parallelization, failure handling, etc. Indeed, different implementations of the MapReduce programming model are possible for different kinds of computing platforms. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large NUMA multiprocessor, and yet another for an even larger collection of networked machines.

A very simple single-machine implementation that supports the programming model was shown in the code fragment in Example 23-6. This section describes a more complex implementation that is targeted to running large-scale MapReduce jobs on the computing environment in wide use at Google: large clusters of commodity PCs connected together with switched Ethernet (see "Further Reading," at the end of this chapter). In this environment:

Machines are typically dual-processor x86 processors running Linux, with 2–4 GB of memory per machine.

Machines are connected using commodity-networking hardware (typically 1 gigabit/ second switched Ethernet). Machines are organized into racks of 40 or 80 machines. These racks are connected to a central switch for the whole cluster. The bandwidth available when talking to other machines in the same rack is 1 gigabit/second per machine, while the per-machine bandwidth available at the central switch is much smaller (usually 50 to 100 megabits/second per machine).

Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed filesystem called GFS (see the reference to "The Google File System" under "Further Reading," at the end of this chapter) is used to manage the data stored on these disks. GFS uses replication to provide availability and reliability on top of unreliable hardware by breaking files into chunks of 64 megabytes and storing (typically) 3 copies of each chunk on different machines.

Users submit jobs to a scheduling system. Each job consists of a set of tasks and is mapped by the scheduler to a set of available machines within a cluster.

23.4.1. Execution Overview

The Map function invocations are distributed across multiple

Return Main Page Previous Page Next Page

®Online Book Reader