Beautiful Code [257]
subject="New Task Assigned to You",
msg = msg)
ERP5: Designing for Maximum Adaptability > Conclusion
21.6. Conclusion
The ERP5 team was able to implement a highly flexible tool, used for both "traditional" project management and for order planning and execution control, by making substantial reuse of already existing core concepts and code. Actually, reuse is a daily operation in ERP5 development, to the point where entire new modules are created just by changing GUI elements and adjusting workflows.
Because of this emphasis on reuse, queries on the object database can be done at the abstraction levels of portal types or meta classes. In the first case, the specific business domain concept is retrieved, such as a project task. In the second case, all objects related to the UBM generic concepts are retrieved, which is quite interesting for such requirements as statistics gathering.
In this chapter, we have edited some code snippets to make them more readable. All ERP5 code in its raw state is available at http://svn.erp5.org/erp5/trunk.
21.6.1. Acknowledgments
We would like to thank Jean-Paul Smets-Solanes, ERP5 creator and chief architect, and all the guys on the team, especially Romain Courteaud and Thierry Faucher. When the authors say we during the discussion of ERP5 design and implementation, they are refer-ring to all those nice folks at Nexedi.
Distributed Programming with MapReduce > A Motivating Example
23. Distributed Programming with MapReduce
Jeffrey Dean and Sanjay Ghemawat
This chapter describes the design and implementation of mapreduce, a programming system for large-scale data processing problems. MapReduce was developed as a way of simplifying the development of large-scale computations at Google. MapReduce programs are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
23.1. A Motivating Example
Suppose that you have 20 billion documents, and you want to generate a count of how often each unique word occurs in the documents. With an average document size of 20 KB, just reading through the 400 terabytes of data on one machine will take roughly four months. Assuming we were willing to wait that long and that we had a machine with sufficient memory, the code would be relatively simple. Example 23-1 (all the examples in this chapter are pseudocode) shows a possible algorithm.
Example 23-1. Naïve, nonparallel word count program
map for each document d { for each word w in d { word_count[w]++; } } ... save word_count to persistent storage ... One way of speeding up this computation is to perform the same computation in parallel across each individual document, as shown in Example 23-2. Example 23-2. Parallelized word count program Mutex lock; // Protects word_count map for each document d in parallel { for each word w in d { lock.Lock(); word_count[w]++; lock.Unlock(); } } ... save word_count to persistent storage ... The preceding code nicely parallelizes the input side of the problem. In reality, the code to start up threads would be a bit more complex, since we've hidden a bunch of details by using pseudocode. One problem with Example 23-2 is that it uses a single global data structure for keeping track of the generated counts. As a result, there is likely to be significant lock contention with the word_count data structure as the bottleneck. This problem can be fixed by partitioning the word_count data structure into a number of buckets with a separate lock per bucket, as shown in Example 23-3. Example 23-3. Parallelized word count program with partitioned storage struct CountTable