AB03 – Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future.

Fan and Bifet (2012) state the aim of their article, and of the particular issue of the publication it introduces, as to provide an overview of the “current status” and future course of the academic discipline and business and industrial field involved in “mining big data.” Toward that aim, the authors say they will “introduce Big Data mining and its applications,” “summarize the papers presented in this issue,” note some of the field’s controversies and challenges, discuss the “importance of open-source software tools,” and draw a few conclusions regarding the field’s overall endeavor (Fan & Bifet, 2012, p. 1).

In their bulleted list of controversies surrounding the big data phenomenon, the authors begin by noting the controversy regarding whether there is any “need to distinguish Big Data analytics from data analytics” (Fan & Bifet, 2012, p. 3). From the perspectives of people who have been involved with data management, including knowledge discovery and data mining, since before “the term ‘Big Data’ appeared for the first time in 1998” (Fan & Bifet, 2012, p. 1), it seems reasonable to consider exactly how the big data of recent years are different from the data of past years.

Although Fan and Bifet acknowledge this controversy, in much of their article they proceed to explain how the big data analytics of today is different from the data analytics of past years. First, they say their conception of big data refers to datasets so large and complex those data sets have “outpaced our capability to process, analyze, store and understand” them with “our current methodologies or data mining software tools” (Fan & Bifet, 2013, p. 1). Next, they describe their conception of “Big Data mining” as “the capability of extracting useful information from these large datasets or streams of data that due to Laney’s “3 V’s in Big Data management” – volume, velocity, and variety – it has thus far been extremely difficult or impossible to do (Fan & Bifet, 2012, pp. 1, 2). In addition to Laney’s 3V’s the authors cite from a note Laney wrote or published in 2001, the authors cite Gartner as explaining two more V’s of big data in a definition of big data on Gartner’s website accessed in 2012 (Fan & Bifet, 2012, p. 2). While one of the Gartner V’s cited by Fan and Bifet is “variability” involving “changes in the structure of the data and how users want to interpret that data” seems to me related enough to Laney’s “variety” one could combine them for simplicity and convenience, the other of the Gartner V’s cited by the authors is “value” which Fan and Bifet interpret as meaning “business value that gives organizations a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach” seems to me unique enough from Laney’s V’s one should consider it a separate, fourth V that could be cited as a characteristic of big data (Fan & Bifet, 2012, p. 2).

In their discussion of how big data analytics can be applied to create value, the authors cite an Intel website accessed in 2012 to describe business applications such as customization of products or services for particular customers, technology applications that would improve “process time from hours to seconds,” healthcare applications for “mining the DNA of each person” in order “to discover, monitor and improve health aspects of everyone,” and public policy planning that could create “smart cities” “focused on sustainable economic development and high quality of life” (Fan & Bifet, 2012, p. 2). Continuing their discussion of the value or “usefulness” of big data, the authors describe the United Nations’ (UN) Global Pulse initiative as an effort begun in 2009 “to improve life in developing countries” by researching “innovative methods and techniques for analyzing real-time digital data,” by assembling a “free and open source” big data “technology toolkit,” and by establishing an “integrated, global network of Pulse Labs” in developing countries in order to enable them to utilize and apply big data (Fan & Bifet, 2012, p. 2).

Before Fan and Bifet mention Laney’s 3V’s of big data and cite Gartner’s fourth V – value – they describe some of the sources of data that have developed in “recent years” and that have contributed to “a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications” including both social media applications that enable end-users to easily generate content and an infrastructure of “mobile phones” that is “becoming the sensory gateway to get real-time data on people” (Fan & Bifet, 2012, p.1). In addition, they mention the “Internet of things (IoT)” and predict it “will raise the scale of data to an unprecedented level” as “people and devices” in private and public environments “are all loosely connected” to create “trillions” of endpoints contributing “the data” from which “valuable information must be discovered” and used to “help improve quality of life and make our world a better place” (Fan & Bifet, 2012, p.1).

Completing their introduction to the topic of big data and their discussion of some of its applications, Fan and Bifet turn in the third section of their paper to summarizing four selected articles from the December 2012 issue of Explorations, the newsletter of the Association for Computing Machinery’s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (KDD), the issue of that newsletter which their article introduces. In their opinion, these four articles “together” represent “very significant state-of-the-art research in Big Data Mining” (Fan & Bifet, 2012, p. 2). Their summaries of the four articles, two articles from researchers in academia and two articles from researchers in industry, discuss big data mining infrastructure and technologies, methods, and objectives. They say the first article, from researchers at Twitter, Inc., “presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter” which illustrates the “current state of data mining tools” is such that “most of the time is consumed in preparatory work” and in revising “preliminary models into robust solutions” (Fan & Bifet, 2012, p. 2). They summarize the second article, from researchers in academia, as being about “mining heterogeneous information networks” of “interconnected, multi-typed data” that “leverage the rich semantics of typed nodes and links in a network” to discover knowledge “from interconnected data” (Fan & Bifet, 2012, p. 2). The third article, also from researchers in academia, they summarize as providing an “overview of mining big graphs” by using the “PEGASUS tool” and as indicating potentially fruitful “research directions for big graph mining” (Fan & Bifet, 2012, p. 2). They summarize the fourth article, from a researcher at Netflix, as being about Netflix’s “recommender and personalization techniques” and as including a substantial section on whether “we need more data or better models to improve our learning methodology” (Fan & Bifet, 2012, pp. 2-3).

In the next section of their paper, the authors provide a seven-bullet list of controversies surrounding the “new hot topic” of “Big Data” (Fan & Bifet, 2012, p. 3). The first controversy on their list, one I mentioned earlier in this article, raises the issue of whether and how the recent and so-called “Big Data” phenomenon is any different from what has previously been referred to as simply data management or data analysis or data analytics, among other similar terms or concepts that have existed in various disciplines or fields or bodies of literature for quite some time. The second controversy mentioned by the authors concerns whether “Big Data” may be nothing more than hype resulting from efforts by “data management systems sellers” to profit from sales of systems capable of storing massive amounts of data to be processed and analyzed by Hadoop and related technologies when in reality smaller volumes of data and other strategies and methods may be more appropriate in some cases (Fan & Bifet, 2012, p. 3). The third controversy the authors note asserts that in the case at least of “real time analytics,” the “recency” of the data is more important than the volume of data. As the fourth controversy, the authors mention how some of Big Data’s “claims to accuracy are misleading” and they cite Taleb’s argument that as “the number of variables grow, the number of fake correlations also grow” and can result in some rather absurd correlations such as the one in which Leinweber found “the S&P 500 stock index was correlated with butter production in Bangladesh” (Fan & Bifet, 2012, p. 3). The fifth controversy the authors addresses the issue of data quality by proposing “bigger data are not always better data” and stating a couple of factors that can determine data quality, for example whether “the data is noisy or not,” and if it is representative” (Fan & Bifet, 2012, p. 3). The authors state the sixth controversy as an ethical issue, mainly whether “it is ethical that people can be analyzed without knowing it” (Fan & Bifet, 2012, p. 3). The final controversy addressed by Fan and Bifet concerns whether access to massive volumes of data and the capabilities to use it (including required infrastructure, knowledge, and skills) are unfairly or unjustly limited and could “create a division between the Big Data rich and poor” (Fan & Bifet, 2012, p. 3).

Fan and Bifet devote the fifth section of their paper to discussing “tools” and focus on the close relationships between big data, “the open source revolution,” and companies including “Facebook, Yahoo!, Twitter,” and “LinkedIn” that both contribute to and benefit from their involvement with “open source projects” such as the Apache Hadoop project (Fan & Bifet, 2012, p. 3) many consider the foundation of big data. After briefly introducing the “Hadoop Distributed File System (HDFS) and MapReduce” as the primary aspects of the Hadoop project that enable storage and processing of massive data sets, respectively, the authors mention a few other open source projects within the Hadoop ecosystem such as “Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper,” and “Apache Cassandra,” among others (Fan & Bifet, 2012, p. 3). Next, the authors discuss more of the “many open source initiatives” involved with big data (Fan & Bifet, 2012, p. 3). “Apache Mahout,” for example, is a “scalable machine learning and data mining open source software based mainly in Hadoop,” “R” is a “programming language and software environment,” “MOA” enables “stream data mining” or “data mining in real time,” and “Vowpal Wabbit” (VW) is a “parallel learning” algorithm known for speed and scalability (Fan & Bifet, 2012, p. 3). Regarding open-source “tools” for “Big Graph mining,” the authors mention “GraphLab” and “PEGASUS,” the latter of which they describe as a “big graph mining system built on top of MAPREDUCE” that enables discovery of “patterns and anomalies “in massive real-world graphs” (Fan & Bifet, 2012, pp. 3-4).

The sixth section of their article provides a seven-bullet list of what the authors consider “future important challenges in Big Data management and analytics” given the nature of big data as “large, diverse, and evolving” (Fan & Bifet, 2012, p. 4). First, they discuss the need to continue exploring architectures in order to ascertain clearly what would be the “optimal architecture” for “analytic systems” “to deal with historic data and with real-time data” simultaneously (Fan & Bifet, 2012, p. 4). Next, they state the importance of ensuring accurate findings and making accurate claims – in other words, “to achieve significant statistical results” – In big data research, especially since “it is easy to go wrong with huge data sets and thousands of questions to answer at once” (Fan & Bifet, 2012, p. 4). Third, they mention the need to expand the number of “distributed mining” methods since some “techniques are not trivial to paralyze” (Fan & Bifet, 2012, p. 4). Fourth, the authors note the importance of improving capabilities in analyzing data streams that are continuously “evolving over time” and “in some cases to detect change first” (Fan & Bifet, 2012, p. 4). Fifth, the authors note the challenge of storing massive amounts of data and emphasize the need to continue exploring the balance between gaining or sacrificing time or space given the “two main approaches” currently used to address the issue, namely either compressing (i.e. sacrificing time compressing to reduce required space to store) or sampling (i.e. using sample of data – “coresets” – in order to represent much larger data volumes) (Fan & Bifet, 2012, p. 4). Sixth, the authors admit “it is very difficult to find user-friendly visualizations” and it will be necessary to develop innovative “techniques” and “frameworks” “to tell and show” the “stories” of data (Fan & Bifet, 2012, p. 4). Last, the authors acknowledge massive amounts of potentially valuable data are being lost since much data being created today are “largely untagged file-based and unstructured data” (Fan & Bifet, 2012, p. 4). Quoting a “2012 IDC study on Big Data,” the authors say “currently only 3% of the potentially useful data is tagged, and even less is analyze” (Fan & Bifet, 2012, p. 4).

In the conclusion to their paper, Fan and Bifet predict “each data scientist will have to manage” increasing data volume, increasing data velocity, and increasing data variety in order to participate in “the new Final Frontier for scientific data research and for business applications” and to “help us discover knowledge that no one has discovered before” (Fan & Bifet, 2012, p. 4).