AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

AB03 – Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future.

Fan and Bifet (2012) state the aim of their article, and of the particular issue of the publication it introduces, as to provide an overview of the “current status” and future course of the academic discipline and business and industrial field involved in “mining big data.” Toward that aim, the authors say they will “introduce Big Data mining and its applications,” “summarize the papers presented in this issue,” note some of the field’s controversies and challenges, discuss the “importance of open-source software tools,” and draw a few conclusions regarding the field’s overall endeavor (Fan & Bifet, 2012, p. 1).

In their bulleted list of controversies surrounding the big data phenomenon, the authors begin by noting the controversy regarding whether there is any “need to distinguish Big Data analytics from data analytics” (Fan & Bifet, 2012, p. 3). From the perspectives of people who have been involved with data management, including knowledge discovery and data mining, since before “the term ‘Big Data’ appeared for the first time in 1998” (Fan & Bifet, 2012, p. 1), it seems reasonable to consider exactly how the big data of recent years are different from the data of past years.

Although Fan and Bifet acknowledge this controversy, in much of their article they proceed to explain how the big data analytics of today is different from the data analytics of past years. First, they say their conception of big data refers to datasets so large and complex those data sets have “outpaced our capability to process, analyze, store and understand” them with “our current methodologies or data mining software tools” (Fan & Bifet, 2013, p. 1). Next, they describe their conception of “Big Data mining” as “the capability of extracting useful information from these large datasets or streams of data that due to Laney’s “3 V’s in Big Data management” – volume, velocity, and variety – it has thus far been extremely difficult or impossible to do (Fan & Bifet, 2012, pp. 1, 2). In addition to Laney’s 3V’s the authors cite from a note Laney wrote or published in 2001, the authors cite Gartner as explaining two more V’s of big data in a definition of big data on Gartner’s website accessed in 2012 (Fan & Bifet, 2012, p. 2). While one of the Gartner V’s cited by Fan and Bifet is “variability” involving “changes in the structure of the data and how users want to interpret that data” seems to me related enough to Laney’s “variety” one could combine them for simplicity and convenience, the other of the Gartner V’s cited by the authors is “value” which Fan and Bifet interpret as meaning “business value that gives organizations a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach” seems to me unique enough from Laney’s V’s one should consider it a separate, fourth V that could be cited as a characteristic of big data (Fan & Bifet, 2012, p. 2).

In their discussion of how big data analytics can be applied to create value, the authors cite an Intel website accessed in 2012 to describe business applications such as customization of products or services for particular customers, technology applications that would improve “process time from hours to seconds,” healthcare applications for “mining the DNA of each person” in order “to discover, monitor and improve health aspects of everyone,” and public policy planning that could create “smart cities” “focused on sustainable economic development and high quality of life” (Fan & Bifet, 2012, p. 2). Continuing their discussion of the value or “usefulness” of big data, the authors describe the United Nations’ (UN) Global Pulse initiative as an effort begun in 2009 “to improve life in developing countries” by researching “innovative methods and techniques for analyzing real-time digital data,” by assembling a “free and open source” big data “technology toolkit,” and by establishing an “integrated, global network of Pulse Labs” in developing countries in order to enable them to utilize and apply big data (Fan & Bifet, 2012, p. 2).

Before Fan and Bifet mention Laney’s 3V’s of big data and cite Gartner’s fourth V – value – they describe some of the sources of data that have developed in “recent years” and that have contributed to “a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications” including both social media applications that enable end-users to easily generate content and an infrastructure of “mobile phones” that is “becoming the sensory gateway to get real-time data on people” (Fan & Bifet, 2012, p.1). In addition, they mention the “Internet of things (IoT)” and predict it “will raise the scale of data to an unprecedented level” as “people and devices” in private and public environments “are all loosely connected” to create “trillions” of endpoints contributing “the data” from which “valuable information must be discovered” and used to “help improve quality of life and make our world a better place” (Fan & Bifet, 2012, p.1).

Completing their introduction to the topic of big data and their discussion of some of its applications, Fan and Bifet turn in the third section of their paper to summarizing four selected articles from the December 2012 issue of Explorations, the newsletter of the Association for Computing Machinery’s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (KDD), the issue of that newsletter which their article introduces. In their opinion, these four articles “together” represent “very significant state-of-the-art research in Big Data Mining” (Fan & Bifet, 2012, p. 2). Their summaries of the four articles, two articles from researchers in academia and two articles from researchers in industry, discuss big data mining infrastructure and technologies, methods, and objectives. They say the first article, from researchers at Twitter, Inc., “presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter” which illustrates the “current state of data mining tools” is such that “most of the time is consumed in preparatory work” and in revising “preliminary models into robust solutions” (Fan & Bifet, 2012, p. 2). They summarize the second article, from researchers in academia, as being about “mining heterogeneous information networks” of “interconnected, multi-typed data” that “leverage the rich semantics of typed nodes and links in a network” to discover knowledge “from interconnected data” (Fan & Bifet, 2012, p. 2). The third article, also from researchers in academia, they summarize as providing an “overview of mining big graphs” by using the “PEGASUS tool” and as indicating potentially fruitful “research directions for big graph mining” (Fan & Bifet, 2012, p. 2). They summarize the fourth article, from a researcher at Netflix, as being about Netflix’s “recommender and personalization techniques” and as including a substantial section on whether “we need more data or better models to improve our learning methodology” (Fan & Bifet, 2012, pp. 2-3).

In the next section of their paper, the authors provide a seven-bullet list of controversies surrounding the “new hot topic” of “Big Data” (Fan & Bifet, 2012, p. 3). The first controversy on their list, one I mentioned earlier in this article, raises the issue of whether and how the recent and so-called “Big Data” phenomenon is any different from what has previously been referred to as simply data management or data analysis or data analytics, among other similar terms or concepts that have existed in various disciplines or fields or bodies of literature for quite some time. The second controversy mentioned by the authors concerns whether “Big Data” may be nothing more than hype resulting from efforts by “data management systems sellers” to profit from sales of systems capable of storing massive amounts of data to be processed and analyzed by Hadoop and related technologies when in reality smaller volumes of data and other strategies and methods may be more appropriate in some cases (Fan & Bifet, 2012, p. 3). The third controversy the authors note asserts that in the case at least of “real time analytics,” the “recency” of the data is more important than the volume of data. As the fourth controversy, the authors mention how some of Big Data’s “claims to accuracy are misleading” and they cite Taleb’s argument that as “the number of variables grow, the number of fake correlations also grow” and can result in some rather absurd correlations such as the one in which Leinweber found “the S&P 500 stock index was correlated with butter production in Bangladesh” (Fan & Bifet, 2012, p. 3). The fifth controversy the authors addresses the issue of data quality by proposing “bigger data are not always better data” and stating a couple of factors that can determine data quality, for example whether “the data is noisy or not,” and if it is representative” (Fan & Bifet, 2012, p. 3). The authors state the sixth controversy as an ethical issue, mainly whether “it is ethical that people can be analyzed without knowing it” (Fan & Bifet, 2012, p. 3). The final controversy addressed by Fan and Bifet concerns whether access to massive volumes of data and the capabilities to use it (including required infrastructure, knowledge, and skills) are unfairly or unjustly limited and could “create a division between the Big Data rich and poor” (Fan & Bifet, 2012, p. 3).

Fan and Bifet devote the fifth section of their paper to discussing “tools” and focus on the close relationships between big data, “the open source revolution,” and companies including “Facebook, Yahoo!, Twitter,” and “LinkedIn” that both contribute to and benefit from their involvement with “open source projects” such as the Apache Hadoop project (Fan & Bifet, 2012, p. 3) many consider the foundation of big data. After briefly introducing the “Hadoop Distributed File System (HDFS) and MapReduce” as the primary aspects of the Hadoop project that enable storage and processing of massive data sets, respectively, the authors mention a few other open source projects within the Hadoop ecosystem such as “Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper,” and “Apache Cassandra,” among others (Fan & Bifet, 2012, p. 3). Next, the authors discuss more of the “many open source initiatives” involved with big data (Fan & Bifet, 2012, p. 3). “Apache Mahout,” for example, is a “scalable machine learning and data mining open source software based mainly in Hadoop,” “R” is a “programming language and software environment,” “MOA” enables “stream data mining” or “data mining in real time,” and “Vowpal Wabbit” (VW) is a “parallel learning” algorithm known for speed and scalability (Fan & Bifet, 2012, p. 3). Regarding open-source “tools” for “Big Graph mining,” the authors mention “GraphLab” and “PEGASUS,” the latter of which they describe as a “big graph mining system built on top of MAPREDUCE” that enables discovery of “patterns and anomalies “in massive real-world graphs” (Fan & Bifet, 2012, pp. 3-4).

The sixth section of their article provides a seven-bullet list of what the authors consider “future important challenges in Big Data management and analytics” given the nature of big data as “large, diverse, and evolving” (Fan & Bifet, 2012, p. 4). First, they discuss the need to continue exploring architectures in order to ascertain clearly what would be the “optimal architecture” for “analytic systems” “to deal with historic data and with real-time data” simultaneously (Fan & Bifet, 2012, p. 4). Next, they state the importance of ensuring accurate findings and making accurate claims – in other words, “to achieve significant statistical results” – In big data research, especially since “it is easy to go wrong with huge data sets and thousands of questions to answer at once” (Fan & Bifet, 2012, p. 4). Third, they mention the need to expand the number of “distributed mining” methods since some “techniques are not trivial to paralyze” (Fan & Bifet, 2012, p. 4). Fourth, the authors note the importance of improving capabilities in analyzing data streams that are continuously “evolving over time” and “in some cases to detect change first” (Fan & Bifet, 2012, p. 4). Fifth, the authors note the challenge of storing massive amounts of data and emphasize the need to continue exploring the balance between gaining or sacrificing time or space given the “two main approaches” currently used to address the issue, namely either compressing (i.e. sacrificing time compressing to reduce required space to store) or sampling (i.e. using sample of data – “coresets” – in order to represent much larger data volumes) (Fan & Bifet, 2012, p. 4). Sixth, the authors admit “it is very difficult to find user-friendly visualizations” and it will be necessary to develop innovative “techniques” and “frameworks” “to tell and show” the “stories” of data (Fan & Bifet, 2012, p. 4). Last, the authors acknowledge massive amounts of potentially valuable data are being lost since much data being created today are “largely untagged file-based and unstructured data” (Fan & Bifet, 2012, p. 4). Quoting a “2012 IDC study on Big Data,” the authors say “currently only 3% of the potentially useful data is tagged, and even less is analyze” (Fan & Bifet, 2012, p. 4).

In the conclusion to their paper, Fan and Bifet predict “each data scientist will have to manage” increasing data volume, increasing data velocity, and increasing data variety in order to participate in “the new Final Frontier for scientific data research and for business applications” and to “help us discover knowledge that no one has discovered before” (Fan & Bifet, 2012, p. 4).