AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

Leave a Reply Cancel reply