AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

References

Baehr, Craig. (2013). Developing a sustainable content strategy for a technical communication body of knowledge. Technical Communication. 60, 293-306.

Bijker, W. E. & Pinch, T.J. (2003). The social construction of facts and artifacts. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology (pp. 221-231). West Sussex, UK: John Wiley & Sons. (Original work published 1987).

Boyd, D., & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15, 662–679.

Bunge, Mario. (2014). Philosophical inputs and outputs of technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology (Second ed.) [Amazon Kindle edition, Kindle for PC 2, Windows 8.1 desktop version]. West Sussex, UK: John Wiley & Sons. (Original work published 1979).

Dean, Jared. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons.

Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

Ellul, Jacques. (2014). On the aims of a philosophy of technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology (Second ed.) [Amazon Kindle edition, Kindle for PC 2, Windows 8.1 desktop version]. West Sussex, UK: John Wiley & Sons. (Original work published 1954).

Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future. SIGKDD Explorations, 14(2), 1-5.

Gehlen, Arnold. (2003). A philosophical-anthropological perspective on technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1983).

Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, December). The Google file system. ACM SIGOPS Operating Systems Review, 37(5), 29-43. DOI: 10.1145/1165389.945450.

Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication. Technical Communication Quarterly, 24:1, 70-104, DOI: 10.1080/10572252.2015.975955

Heidegger, Martin (2003). The question concerning technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1954).

Jonas, Hans. (2014). Toward a philosophy of technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology (Second ed.) [Amazon Kindle edition, Kindle for PC 2, Windows 8.1 desktop version]. West Sussex, UK: John Wiley & Sons. (Original work published 1979).

Kline, Stephen J. (2003). What is technology. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1985).

Kurzweil, Ray. (2005). The singularity is near: When humans transcend biology [Amazon Kindle edition, Kindle for PC 2, Windows 8.1 desktop version]. New York, New York: Penguin Books.

Mahrt, M. & Scharkow, M. (2013). The value of big data in digital media research, Journal of Broadcasting & Electronic Media, 57 (1), 20-33.

McNely, B., Spinuzzi, C., & Teston, C. (2015). Contemporary research methodologies in technical communication. Technical Communication Quarterly, 24, 1-13.

Mumford, Lewis (2003). Tool-users vs. homo sapiens and the megamachine.  In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1966).

Shrader-Frechette, Kristin. (2003). Technology and ethics.  In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1992).

Winner, Langdon. (2003). Social constructivsm: Opening the black box and finding it empty. In Robert C. Sharff & Val Dusek (Eds.), Philosophy of technology: The technological condition: An anthology. West Sussex, UK: John Wiley & Sons. (Original work published 1993).

Wolfe, Joanna. (2015). Teaching students to focus on the data in data visualization. Journal of Business and Technical Communication, 29, 344-359