Introduction
In recent years, it has become commonplace to hear of massive data breaches, of huge volumes of data collected on customers and citizens, of monstrous data leaks, of enormous value in big data, of the paramount importance of protecting personal data, and of the new world in which those with the big data will inherit the earth (and perhaps the universe, too). Jared Dean, a business person, has likened data to a natural resource, calling it the ânew oilâ (Dean, 2014, p. 12). And the lead refiners of the data, known professionally as data scientists, not only have been designated to have the âsexiest job in the 21st centuryâ (Davenport & Patil, 2012, p. 70), but also are known to earn six-figure salaries at high-profile technology companies such as Facebook, Apple, Microsoft, and IBM, among others (Smith, 2016, para. 3-5).
With all this discourse about data and big data prominent in the public media, in the business media, and in academic publications, one might imagine big data would be easy to define, but in reality, there are diverse views about all aspects of big data â and it is the aim of this annotated bibliography to explore and summarize some of those diverse views. To guide my exploration, I developed and considered the research questions (topics, problems) in the below bulleted list in the next section. After that, the annotations are organized in the order I read the works and wrote the summaries.
Ultimately, my analyses of the authorsâ works in relation to these questions and my extrapolations of the authorsâ attitudes and positions toward big data and technology will provide the framework for my next paper which will explore technology and how the big data phenomenon can help us define technology, understand how technology is applied, and comprehend technologyâs significance. In addition, I will consider how the perspectives and special interests of particular disciplines or professional fields of practice affect their definitions of technology and big data and their attitudes toward them.
Topic/problem
- What is big data (and closely related concepts, e.g. data mining, text mining, data analytics, business intelligence, artificial intelligence, predictive modeling, machine learning, deep learning, computational linguistics, natural language processing, statistics, statistical analytics, and information technology)?
- How is big data done? What is the required âtechnology stack,â so to speak, i.e. hardware, software, professional services, and other? What knowledge, skills, and technologies are required to do it?
- What are the authorsâ explicit or implicit attitudes or positions (optimistic, pessimistic, and/or other) toward big data and/or toward big dataâs relationships to other concepts such as technology, science, knowledge (truth, probable truth), human identity, x-intelligence (e.g. business intelligence and artificial intelligence), communication (rhetoric, discourse), technological superstructure (megamachine), freedom/determinism (progressive technology, system of necessity), economics, privacy (freedom, democracy) and security (safety, totalitarianism), and evolution?
- Why are what big data is and how it is done significant for a) humanity in general, b) for practitioners/scholars in technical communication, rhetoric, composition (or practitioners/scholars in philosophy of technology, sociology of technology, history of technology, and science and technology studies), c) any selected discipline/field/issue?
- How do the authorsâ definitions of the big data phenomenon, their attitudes toward it, and their positions on issues surrounding it relate to the authorsâ disciplinary or professional perspectives (points-of-view) and special interests?
Annotations
Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication. Technical Communication Quarterly, 24:1, 70-104.
A team of researchers determines to bring the power of âbig dataâ into the toolkit of technical communication scholars by piloting a research method they âdub statistical genre analysis (SGA)â and by describing and explaining the method in an article published in the journal Technical Communication Quarterly (Graham, Kim, Devasto, & Keith, 2015, pp. 70-71).
Acknowledging the value academic markets have begun assigning to findings, conclusions, and theories founded upon rigorous analysis of massive data sets, this team deconstructs the amorphous âbig dataâ phenomenon and demonstrates how their SGA methodology can be used to quantitatively describe and visually represent the generic content (e.g. types of evidence and modes of reasoning) of rhetorical situations (e.g. committee meetings) and to discover input variables (e.g. conflicts of interest) that have statistically significant effects upon output variables (e.g. recommendations) of important policy-influencing entities such as the Food and Drug Administrationâs (FDA) Oncologic Drugs Advisory Committee (ODAC) (Graham et al., 2015, pp. 86-89).
The authors believe there is much to gain by integrating the âhumanistic and qualitative study of discourse with statistical methodsâ and although they respect the âcraft character of rhetorical inquiryâ (Graham et al., 2015, pp 71-72) and utilize âthe inductive and qualitative nature of rhetorical analysis as a necessaryâ initial step in their hybrid method (Graham et al., 2015, p. 77), they conclude their mixed-method SGA approach can increase the ârange and powerâ (Graham et al., 2015 p. 92) of âtraditional, inductive approaches to genre analysisâ (Graham et al., 2015, p. 86) by offering the advantages âof statistical insightsâ while avoiding the disadvantages of statistical sterility that can emerge when the qualitative humanist element is absent (Graham et al., 2015, p. 91).
In the conclusion of their article, the researchers identify two main benefits of their hybrid SGA method. The first benefit is communication genres âcan be defined with more precisionâ since SGA documents the actual frequency of generic conventions as they exist within a large sample of the corpus, rather than being defined more generally since traditional rhetorical methods may document the opinions experts have of the âtypicalâ frequency of generic conventions as they perceive them to exist within a limited sample of âexemplarsâ selected from a small sample of the corpus. In addition, the authors argue analysis of a massive number of texts may reveal generic conventions that do not appear in the limited sample of exemplars that may be studied by practitioners of the traditional rhetorical approach involving only âcritical analysis and close reading.â The second benefit is communications scholars are enabled to move beyond critical opinion and to claim statistically significant correlations between âsituational inputs and outputsâ and âgenre characteristics that have been empirically establishedâ (Graham et al., 2015, p. 92).
Befitting the subject of their study, the authors devote a considerable portion of their article to describing their research methodology. In the third section titled âStatistical Genre Analysis,â they begin by noting they conducted the âcurrent pilot studyâ on a ârelatively small subsetâ of the available data in order to âdemonstrate the potential of SGA.â Further, they outline their research questions, the answers to two of which indeed seem to attest to the strength SGA can contribute to both the evidence and the inferences used by communication scholars in their own arguments about the communications they study. As they do in the introduction, in this section also, the authors note the intellectual lineage of SGA in various disciplines, including ârhetorical studies, linguistics,â âhealth communication,â psychology, and âapplied statisticsâ (Graham et al., 2015, pp. 71, 76).
As explained earlier, the communication artifacts studied by these researches are selected from among the various artifacts arising from the FDAâs ODAC meetings, specifically the textual transcriptions of presentations (essentially opening statements) given by the sponsors (pharmaceutical manufacturing companies) of the drugs under review during meetings which usually last one or two days (Graham et al., 2015, pp. 75-76). Not only in the arenas of technical communication and rhetoric, but also in the arenas of Science and Technology Studies (STS) and of Science, Technology, Engineering, and Math (STEM) public policy, managing conflicts of interests among ODAC participants and encouraging inclusion of all relevant stakeholders in ODAC meetings are prominent issues (Graham et al., 2015, p. 72). At the conclusion of ODAC meetings, voting participants vote either for or against the issue under consideration, generally âapplications to market new drugs, new indications for already approved drugs, and appropriate research/study endpointsâ (Graham et al., 2015, pp. 74-76).
It is within this context the authors attempted to answer the following two research questions, among others, regarding all ODAC meetings and sponsor presentations given at those meetings between 2009 and 2012: â1. How does the distribution of stakeholders affect the distribution of votes?â and â3. How does the distribution of evidence and forms of reasoning in sponsor presentations affect the distribution of votes?â (Graham et al., 2015, pp. 75-76). Notice both of these research questions ask whether certain input variables affect certain output variables. And in this case, the output variables are votes either for or against an action that will have serious consequences for people and organizations. Put another way, this is a political (or deliberative rhetoric) situation and the ability to predict with a high degree of certainty which inputs produce which outputs could be quite valuable, given those inputs and outputs could determine substantial budget allocations, consulting fees, and pharmaceutical sales â essentially, success or failure â among other things.
Toward the aim of asking and answering research questions with such potentially high stakes, the authors applied their SGA mixed-methods approach, which they explain included four phases of research conducted over approximately six months to one year and included at least four researchers. The authors explain SGA ârequires first an extensive data preparation phaseâ after which the researchers âsubjectedâ the data âto various statistical tests to directly address the research questions.â They describe the four phases of their SGA method as â(a) coding schema development, (b) directed content analysis, (c) meeting data and participant demographics extraction, and (d) statistical analyses.â Before moving into a deeper discussion of their own âcoding schemaâ development, as well as the other phases of their SGA approach, the authors cite numerous influences from scholars in âbehavioral research,â âmultivariate statistics,â âcorpus linguistics,â and âquantitative work in English for specific purposes,â while explaining the specific statistical âtechniquesâ they apply âcan be found in canonical works of multivariate statistics such as Keppelâs (1991) Design and Analysis and Johnson and Wichernâs (2007) Applied Multivariate Statistical Analysisâ (Graham et al., 2015, pp. 75-77). One important distinction the authors make between their method and these other methods is while the other methods operate at the more granular âword and sentence levelâ that facilitates formulation of âcoding schema amenable to automated content analysis,â the authors operate at the less granular paragraph level that requires human intervention in order to formulate coding schema reflecting nuances only discernable at higher cognitive levels, for example whether particular evidentiary artifacts (transcripts) are based on randomized controlled trials (RCTs) addressing issues of âefficacyâ or RCTs addressing issues of âsafety and treatment-related hazardsâ (Graham et al., 2015, pp. 77-78). Choosing the longer, more complex paragraph as their unit of analysis requires the research method to depend upon âthe inductive and qualitative nature of rhetorical analysis as a necessary precursor to both qualitative coding and statistical testingâ (Graham et al., 2015, p. 77).
In the final section of their explanation of SGA, their research methodology, the authors summarize their statistical methods including both âdescriptive statisticsâ and âinferential statisticsâ and how they applied these two types of statistical methods, respectively, to âprovide a quantitative representation of the data setâ (e.g. âmean, median, and standard deviationâ) and to âestimate the relationship between variablesâ (e.g. âstatistically significant impactsâ) (Graham et al., 2015, pp. 81-83).
Returning to the point of the authorsâ research â namely demonstrating how SGA empowers scholars to provide confident answers to research questions and therefore to create and assert knowledge clearly valued by societal interests â their SGA enables them to state their âmultiple regression analysisâ found âRCT-efficacy data and conflict of interest remained as the only significant predictors of approval rates. Oddly, the use of efficacy data seems to lower the chance of approval, whereas a greater presence of conflict of interest increases the probability of approvalâ (Graham et al., 2015, p. 89). Obviously, this finding encourages entities aiming to increase the probability of approval to allocate resources toward increasing the presence of conflicts of interests since that is the only input variable demonstrated to contribute to achieving their aim. On the other hand, this finding provides evidence entities claiming conflicts of interests illegally (or at least undesirably) affect ODAC participantsâ votes can use to bolster their arguments âstricter controls on conflicts of interests should be deployed (Graham et al., 2015, p. 92).
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15, 662â679.
As âsocial scientists and media studies scholars,â Boyd and Crawford (2012) consider it their responsibility to encourage and focus the public discussion regarding âBig Dataâ by asserting six claims they imply help define the many and important potential issues the âera of Big Dataâ has already presented to humanity and the diverse and competing interests that comprise it (Boyd & Crawford, 2012, pp. 662-663). Before asserting and explaining their claims, however, the authors define Big Data âas a cultural, technological, and scholarly phenomenonâ that âis less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets,â a phenomenon that has three primary components (fields or forces) interacting within it: 1) technology, 2) analysis, and 3) mythology (Boyd & Crawford, 2012, p. 663). Precisely because Big Data, as well as some âother socio-technical phenomenon,â elicit both âutopian and dystopian rhetoricâ and visions of the future of humanity, Boyd and Crawford think it is ânecessary to ask critical questionsâ about âwhat all this data means, who gets access to what data, how data analysis is deployed, and to what endsâ (Boyd & Crawford, 2012, p. 664).
The authorsâ first two claims are concerned essentially with epistemological issues regarding the nature of knowledge and truth (Boyd & Crawford, 2012, pp. 665-667. In explaining their first claim, â1. Big Data changes the definition of knowledge,â the authors draw parallels between Big Data as a âsystem of knowledgeâ and ââFordismââ as a âmanufacturing system of mass production.â According to the authors, both of these systems influence peoplesâ âunderstandingâ in certain ways. Fordism âproduced a new understanding of labor, the human relationship to work, and society at large.â And Big Data âis already changing the objects of knowledgeâ and suggesting new concepts that may âinform how we understand human networks and communityâ (Boyd & Crawford, 2012, p. 665). In addition, the authors cite Burkholder, Latour, and others in describing how Big Data refers not only to the quantity of data, but also to the âtools and proceduresâ that enable people to process and analyze âlarge data sets,â and to the general âcomputational turn in thought and researchâ that accompanies these new instruments and methods (Boyd & Crawford, 2012, p. 665). In addition, the authors state âBig Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and categorization of realityâ (Boyd & Crawford, 2012, p. 665). Finally, as counterpoint to the many potential benefits and positive aspects of Big Data they have emphasized thus far, the authors cite Anderson as one who has revealed the at times prejudicial and arrogant beliefs and attitudes of some quantitative proponents who summarily dismiss all qualitative or humanistic approaches to gathering evidence and formulating theories (Boyd & Crawford, 2012, pp. 665-666) as inferior.
In explaining their second claim, â2. Claims to objectivity and accuracy are misleading,â the authors continue considering some of the biases and misconceptions inherent in epistemologies that privilege âquantitative science and objective methodâ as the paths to knowledge and absolute truth. According to the authors, Big Data âis still subjectiveâ and even when research subjects or variables are quantified, those quantifications do ânot necessarily have a closer claim on objective truth.â In the view of the authors, the obsession of social science and the âhumanistic disciplinesâ with attaining âthe status of quantitative science and objective methodâ is at least to some extent misdirected (Boyd & Crawford, 2012, pp. 666-667), even if understandable given the apparent value society assigns to quantitative evidence. Citing Gitelman and Bollier, among others, the authors believe âall researchers are interpreters of dataâ not only when they draw conclusions based on their research findings, but also when they design their research and decide what will â and what will not â be measured. Overall, the authors argue against too eagerly embracing the positivistic perspective on knowledge and truth and argue in favor of critically examining research philosophies and methods and considering the limitations inherent within them (Boyd & Crawford, 2012, pp. 667-668).
The third and fourth claims the authors make could be considered to address research quality. Their third claim, â3. Big data are not always better data,â emphasizes the importance of quality control in research and highlights how âunderstanding sample, for example, is more important than ever.â Since âthe public discourse aroundâ massive and easily collected data streams such as Twitter âtends to focus on the raw number of tweets availableâ and since âraw numbersâ would not be a ârepresentative sampleâ of most populations about which researchers seek to make claims, public perceptions and opinion could be skewed by either mainstream mediaâs misleading reporting about valid research or by unprofessional researchersâ erroneous claims based upon invalid research methods and evidence (Boyd & Crawford, 2012, pp. 668-669). In addition to these issues of research design, the authors highlight how additional âmethodological challengesâ can arise âwhen researchers combine multiple large data sets,â challenges involving ânot only the limits of the data set, but also the limits of which questions they can ask of a data set and what interpretations are appropriateâ (Boyd & Crawford, 2012, pp. 669-670).
The authors fourth claim continues addressing research quality, but at the broader level of context. Their fourth claim, â4. Taken out of context, Big Data loses its meaning,â emphasizes the importance of considering how the research context affects research methods and research findings and conclusions. The authors imply attitudes toward mathematical modeling and data collection methods may cause researchers to select data more for their suitability to large-scale, computational, automated, quantitative data collection and analysis than for their suitability to discovering patterns or to answering research questions. As an example, the authors consider the evolution of the concept of human networks in sociology and focus on different ways of measuring ââtie strength,ââ a concept understood by many sociologists to indicate âthe importance of individual relationshipsâ (Boyd & Crawford, 2013, p. 670). Although recently developed concepts such as âarticulated networksâ and âbehavioral networksâ may appear at times to indicate tie strength equivalent to more traditional concepts such as âkinship networks,â the authors explain how the tie strength of kinship networks is based on more in-depth, context-sensitive data collection such as âsurveys, interviewsâ and even âobservation,â while the tie strength of articulated networks or behavioral networks may rely on nothing more than interaction frequency analysis; and âmeasuring tie strength through frequency or public articulation is a common mistakeâ (Boyd & Crawford, 2013, p. 671). In general, the authors urge caution against considering Big Data the panacea that will objectively and definitively answer all research questions. In their view, âthe size of the data should fit the research question being asked; in some cases, small is bestâ (Boyd & Crawford, 2012, p. 670).
The authorsâ final two claims address ethical issues related to Big Data, some of which seem to have arisen in parallel with its ascent. In their fifth claim, â5. Just because it is accessible does not make it ethical,â the authors focus primarily on whether âsocial media usersâ implicitly give permission to anyone to use publicly available data related to the user in all contexts, even contexts the user may not have imagined, such as in research studies or in the collectorsâ data or information products and services (Boyd & Crawford, 2012, pp. 672-673). Citing Ess and others, the authors emphasize researchers and scholars have âaccountabilityâ for their actions, including those actions related to âthe serious issues involved in the ethics of online data collections and analysis.â The authors encourage researchers and scholars to consider privacy issues and to proactively assess whether they should assume users have provided âinformed consentâ for the researchers to collect and analyze usersâ publicly available data just because the data is publicly availableâ (Boyd & Crawford, 2013, pp. 672-673). In their sixth claim, â6. Limited access to Big Data creates new digital divides,â the authors note that although there is a prevalent perception Big Data âoffers easy access to massive amounts to data,â the reality is access to Big Data and the ability to manage and analyze Big Data require resources unavailable to much of the population â and this âcreates a new kind of digital divide: the Big Data rich and the Big Data poorâ (Boyd & Crawford, 2013, pp. 673-674). âWhenever inequalities are explicitly written into the system,â the authors assert further, âthey produce class-based structures (Boyd & Crawford, 2012, p. 675)
In their article overall, Boyd & Crawford maintain an optimistic tone while enumerating the many and myriad issues emanating from the phenomenon Big Data. In concluding, the authors encourage scholars, researchers, and society to âstart questioning the underlying assumptions, values, and biases of this new wave of researchâ (Boyd & Crawford, 2012, p. 675).
Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future. SIGKDD Explorations, 14(2), 1-5.
Fan and Bifet (2012) state the aim of their article, and of the particular issue of the publication it introduces, as to provide an overview of the âcurrent statusâ and future course of the academic discipline and business and industrial field involved in âmining big data.â Toward that aim, the authors say they will âintroduce Big Data mining and its applications,â âsummarize the papers presented in this issue,â note some of the fieldâs controversies and challenges, discuss the âimportance of open-source software tools,â and draw a few conclusions regarding the fieldâs overall endeavor (Fan & Bifet, 2012, p. 1).
In their bulleted list of controversies surrounding the big data phenomenon, the authors begin by noting the controversy regarding whether there is any âneed to distinguish Big Data analytics from data analyticsâ (Fan & Bifet, 2012, p. 3). From the perspectives of people who have been involved with data management, including knowledge discovery and data mining, since before âthe term âBig Dataâ appeared for the first time in 1998â (Fan & Bifet, 2012, p. 1), it seems reasonable to consider exactly how the big data of recent years are different from the data of past years.
Although Fan and Bifet acknowledge this controversy, in much of their article they proceed to explain how the big data analytics of today is different from the data analytics of past years. First, they say their conception of big data refers to datasets so large and complex those data sets have âoutpaced our capability to process, analyze, store and understandâ them with âour current methodologies or data mining software toolsâ (Fan & Bifet, 2013, p. 1). Next, they describe their conception of âBig Data miningâ as âthe capability of extracting useful information from these large datasets or streams of data that due to Laneyâs â3 Vâs in Big Data managementâ â volume, velocity, and variety â it has thus far been extremely difficult or impossible to do (Fan & Bifet, 2012, pp. 1, 2). In addition to Laneyâs 3Vâs the authors cite from a note Laney wrote or published in 2001, the authors cite Gartner as explaining two more Vâs of big data in a definition of big data on Gartnerâs website accessed in 2012 (Fan & Bifet, 2012, p. 2). While one of the Gartner Vâs cited by Fan and Bifet is âvariabilityâ involving âchanges in the structure of the data and how users want to interpret that dataâ seems to me related enough to Laneyâs âvarietyâ one could combine them for simplicity and convenience, the other of the Gartner Vâs cited by the authors is âvalueâ which Fan and Bifet interpret as meaning âbusiness value that gives organizations a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reachâ seems to me unique enough from Laneyâs Vâs one should consider it a separate, fourth V that could be cited as a characteristic of big data (Fan & Bifet, 2012, p. 2).
In their discussion of how big data analytics can be applied to create value, the authors cite an Intel website accessed in 2012 to describe business applications such as customization of products or services for particular customers, technology applications that would improve âprocess time from hours to seconds,â healthcare applications for âmining the DNA of each personâ in order âto discover, monitor and improve health aspects of everyone,â and public policy planning that could create âsmart citiesâ âfocused on sustainable economic development and high quality of lifeâ (Fan & Bifet, 2012, p. 2). Continuing their discussion of the value or âusefulnessâ of big data, the authors describe the United Nationsâ (UN) Global Pulse initiative as an effort begun in 2009 âto improve life in developing countriesâ by researching âinnovative methods and techniques for analyzing real-time digital data,â by assembling a âfree and open sourceâ big data âtechnology toolkit,â and by establishing an âintegrated, global network of Pulse Labsâ in developing countries in order to enable them to utilize and apply big data (Fan & Bifet, 2012, p. 2).
Before Fan and Bifet mention Laneyâs 3Vâs of big data and cite Gartnerâs fourth V â value â they describe some of the sources of data that have developed in ârecent yearsâ and that have contributed to âa dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applicationsâ including both social media applications that enable end-users to easily generate content and an infrastructure of âmobile phonesâ that is âbecoming the sensory gateway to get real-time data on peopleâ (Fan & Bifet, 2012, p.1). In addition, they mention the âInternet of things (IoT)â and predict it âwill raise the scale of data to an unprecedented levelâ as âpeople and devicesâ in private and public environments âare all loosely connectedâ to create âtrillionsâ of endpoints contributing âthe dataâ from which âvaluable information must be discoveredâ and used to âhelp improve quality of life and make our world a better placeâ (Fan & Bifet, 2012, p.1).
Completing their introduction to the topic of big data and their discussion of some of its applications, Fan and Bifet turn in the third section of their paper to summarizing four selected articles from the December 2012 issue of Explorations, the newsletter of the Association for Computing Machineryâs (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (KDD), the issue of that newsletter which their article introduces. In their opinion, these four articles âtogetherâ represent âvery significant state-of-the-art research in Big Data Miningâ (Fan & Bifet, 2012, p. 2). Their summaries of the four articles, two articles from researchers in academia and two articles from researchers in industry, discuss big data mining infrastructure and technologies, methods, and objectives. They say the first article, from researchers at Twitter, Inc., âpresents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitterâ which illustrates the âcurrent state of data mining toolsâ is such that âmost of the time is consumed in preparatory workâ and in revising âpreliminary models into robust solutionsâ (Fan & Bifet, 2012, p. 2). They summarize the second article, from researchers in academia, as being about âmining heterogeneous information networksâ of âinterconnected, multi-typed dataâ that âleverage the rich semantics of typed nodes and links in a networkâ to discover knowledge âfrom interconnected dataâ (Fan & Bifet, 2012, p. 2). The third article, also from researchers in academia, they summarize as providing an âoverview of mining big graphsâ by using the âPEGASUS toolâ and as indicating potentially fruitful âresearch directions for big graph miningâ (Fan & Bifet, 2012, p. 2). They summarize the fourth article, from a researcher at Netflix, as being about Netflixâs ârecommender and personalization techniquesâ and as including a substantial section on whether âwe need more data or better models to improve our learning methodologyâ (Fan & Bifet, 2012, pp. 2-3).
In the next section of their paper, the authors provide a seven-bullet list of controversies surrounding the ânew hot topicâ of âBig Dataâ (Fan & Bifet, 2012, p. 3). The first controversy on their list, one I mentioned earlier in this article, raises the issue of whether and how the recent and so-called âBig Dataâ phenomenon is any different from what has previously been referred to as simply data management or data analysis or data analytics, among other similar terms or concepts that have existed in various disciplines or fields or bodies of literature for quite some time. The second controversy mentioned by the authors concerns whether âBig Dataâ may be nothing more than hype resulting from efforts by âdata management systems sellersâ to profit from sales of systems capable of storing massive amounts of data to be processed and analyzed by Hadoop and related technologies when in reality smaller volumes of data and other strategies and methods may be more appropriate in some cases (Fan & Bifet, 2012, p. 3). The third controversy the authors note asserts that in the case at least of âreal time analytics,â the ârecencyâ of the data is more important than the volume of data. As the fourth controversy, the authors mention how some of Big Dataâs âclaims to accuracy are misleadingâ and they cite Talebâs argument that as âthe number of variables grow, the number of fake correlations also growâ and can result in some rather absurd correlations such as the one in which Leinweber found âthe S&P 500 stock index was correlated with butter production in Bangladeshâ (Fan & Bifet, 2012, p. 3). The fifth controversy the authors addresses the issue of data quality by proposing âbigger data are not always better dataâ and stating a couple of factors that can determine data quality, for example whether âthe data is noisy or not,â and if it is representativeâ (Fan & Bifet, 2012, p. 3). The authors state the sixth controversy as an ethical issue, mainly whether âit is ethical that people can be analyzed without knowing itâ (Fan & Bifet, 2012, p. 3). The final controversy addressed by Fan and Bifet concerns whether access to massive volumes of data and the capabilities to use it (including required infrastructure, knowledge, and skills) are unfairly or unjustly limited and could âcreate a division between the Big Data rich and poorâ (Fan & Bifet, 2012, p. 3).
Fan and Bifet devote the fifth section of their paper to discussing âtoolsâ and focus on the close relationships between big data, âthe open source revolution,â and companies including âFacebook, Yahoo!, Twitter,â and âLinkedInâ that both contribute to and benefit from their involvement with âopen source projectsâ such as the Apache Hadoop project (Fan & Bifet, 2012, p. 3) many consider the foundation of big data. After briefly introducing the âHadoop Distributed File System (HDFS) and MapReduceâ as the primary aspects of the Hadoop project that enable storage and processing of massive data sets, respectively, the authors mention a few other open source projects within the Hadoop ecosystem such as âApache Pig, Apache Hive, Apache HBase, Apache ZooKeeper,â and âApache Cassandra,â among others (Fan & Bifet, 2012, p. 3). Next, the authors discuss more of the âmany open source initiativesâ involved with big data (Fan & Bifet, 2012, p. 3). âApache Mahout,â for example, is a âscalable machine learning and data mining open source software based mainly in Hadoop,â âRâ is a âprogramming language and software environment,â âMOAâ enables âstream data miningâ or âdata mining in real time,â and âVowpal Wabbitâ (VW) is a âparallel learningâ algorithm known for speed and scalability (Fan & Bifet, 2012, p. 3). Regarding open-source âtoolsâ for âBig Graph mining,â the authors mention âGraphLabâ and âPEGASUS,â the latter of which they describe as a âbig graph mining system built on top of MAPREDUCEâ that enables discovery of âpatterns and anomalies âin massive real-world graphsâ (Fan & Bifet, 2012, pp. 3-4).
The sixth section of their article provides a seven-bullet list of what the authors consider âfuture important challenges in Big Data management and analyticsâ given the nature of big data as âlarge, diverse, and evolvingâ (Fan & Bifet, 2012, p. 4). First, they discuss the need to continue exploring architectures in order to ascertain clearly what would be the âoptimal architectureâ for âanalytic systemsâ âto deal with historic data and with real-time dataâ simultaneously (Fan & Bifet, 2012, p. 4). Next, they state the importance of ensuring accurate findings and making accurate claims â in other words, âto achieve significant statistical resultsâ â In big data research, especially since âit is easy to go wrong with huge data sets and thousands of questions to answer at onceâ (Fan & Bifet, 2012, p. 4). Third, they mention the need to expand the number of âdistributed miningâ methods since some âtechniques are not trivial to paralyzeâ (Fan & Bifet, 2012, p. 4). Fourth, the authors note the importance of improving capabilities in analyzing data streams that are continuously âevolving over timeâ and âin some cases to detect change firstâ (Fan & Bifet, 2012, p. 4). Fifth, the authors note the challenge of storing massive amounts of data and emphasize the need to continue exploring the balance between gaining or sacrificing time or space given the âtwo main approachesâ currently used to address the issue, namely either compressing (i.e. sacrificing time compressing to reduce required space to store) or sampling (i.e. using sample of data â âcoresetsâ â in order to represent much larger data volumes) (Fan & Bifet, 2012, p. 4). Sixth, the authors admit âit is very difficult to find user-friendly visualizationsâ and it will be necessary to develop innovative âtechniquesâ and âframeworksâ âto tell and showâ the âstoriesâ of data (Fan & Bifet, 2012, p. 4). Last, the authors acknowledge massive amounts of potentially valuable data are being lost since much data being created today are âlargely untagged file-based and unstructured dataâ (Fan & Bifet, 2012, p. 4). Quoting a â2012 IDC study on Big Data,â the authors say âcurrently only 3% of the potentially useful data is tagged, and even less is analyzeâ (Fan & Bifet, 2012, p. 4).
In the conclusion to their paper, Fan and Bifet predict âeach data scientist will have to manageâ increasing data volume, increasing data velocity, and increasing data variety in order to participate in âthe new Final Frontier for scientific data research and for business applicationsâ and to âhelp us discover knowledge that no one has discovered beforeâ (Fan & Bifet, 2012, p. 4).
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Googleâs services to its users, services such as the âGoogle Web search serviceâ for which Google must create and maintain a âlarge-scale indexingâ system, or the âGoogle Zeitgeist and Google Trendsâ services for which it must extract and analyze âdata to produce reports of popular queriesâ (Dean & Ghemawat, 2008, pp. 107, 112).
As Dean and Ghemawat explain in the introduction to their article, even though many of the required âcomputations are conceptually straightforward,â the data volume is massive (terabytes or petabytes in 2003) and the âcomputations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of timeâ (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already âimplemented hundreds of special-purpose computationsâ to âprocess large amounts of raw dataâ and the system worked, the authors describe how they sought to reduce the âcomplexityâ introduced by a systems infrastructure requiring âparallelization, fault tolerance, data distribution and load balancingâ (Dean & Ghemawat, 2008, p. 107).
Their solution involved creating âa new abstractionâ that not only preserved their âsimple computations,â but also provided a cost-effective, performance-optimized large cluster of machines that âhides the messy detailsâ of systems infrastructure administration âin a libraryâ while enabling âprogrammers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easilyâ (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their âabstraction is inspired by the map and reduce primitives present in Lisp and many other functional languagesâ and acknowledge others âhave provided restricted programming models and used the restrictions to parallelize the computation automatically.â In addition, however, Dean and Ghemawat assert their âMapReduceâ implementation is a âsimplification and distillation of some of these modelsâ resulting from their âexperience with large real-world computationsâ and their unique contribution may be their provision of âa fault-tolerant implementation that scales to thousands of processorsâ while other âparallel processing systemsâ were âimplemented on smaller scalesâ while requiring the programmer to address machine failures (2008, pp. 107, 113).
In sections 2 and 3 of their paper, the authors provide greater detail of their âprogramming model,â their specific âimplementation of the MapReduce interfaceâ including the Google File System (GFS) â a âdistributed file systemâ that âuses replication to provide availability and reliability on top of unreliable hardwareâ â and an âexecution overviewâ with a diagram showing the logical relationships and progression of their MapReduce implementationâs components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some âextensionsâ at times useful for augmenting âmap and reduce functionsâ (Dean & Ghemawat, 2008, p. 110).
In section 5, the authors discuss their experience measuring âthe performance of MapReduce on two computations running on a large cluster of machinesâ and describe the two computations or âprogramsâ they run as ârepresentative of a large subset of the real programs written by users of MapReduce,â that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a âclassâ of âprogramâ that âextracts a small amount of interesting data from a large datasetâ and the sort function as a âclassâ of âprogramâ that âshuffles data from one representation to anotherâ (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention âlocality optimization,â a feature they describe further over the next few sections of their paper as one that âdraws its inspiration from techniques such as active disksâ and one that preserves âscarceâ network bandwidth by reducing the distance between processors and disks and thereby limiting âthe amount of data sent across I/O subsystems or the networkâ (Dean & Ghemawat, 2008, pp. 112-113).
In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the âMapReduce programming modelâ as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging âlarge amounts of resourcesâ and to write relatively simple programs that ârun efficiently on a thousand machines in a half hourâ (Dean & Ghemawat, 2008, p. 112).
Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.
Baehr, C. (2013). Developing a sustainable content strategy for a technical communication body of knowledge. Technical Communication. 60, 293-306.
People responsible for planning, creating, and managing information and information systems in the world today identify with various academic disciplines and business and industrial fields. As Craig Baehr explains, this can make it difficult to find or to develop and sustain a body of knowledge that represents the âinterdisciplinary natureâ of the technical communication field (Baehr, 2013, p. 294). In his article, Baehr describes his experience working with a variety of other experts to develop and produce a âlarge-scale knowledge baseâ for those who identify with the âtechnical communicationâ field and to ensure that knowledge base embodies a âsystematic approachâ to formulating an âintegrated or hybridâ âcontent strategyâ that considers the “complex set of factorsâ involved in such long-term projects, factors such as the âhuman user,â âcontent assets,â âtechnology,â and âsustainable practicesâ (Baehr, 2013, pp. 293, 295, 305).
Baehr defines a âbody of knowledgeâ as representing the âbreadth and depth of knowledge in the field with overarching connections to other disciplines and industry-wide practicesâ (Baehr, 2013, p. 294). As the author discusses, the digital age presents a unique set of challenges for those collecting and presenting knowledge that will attract and help scholars and practitioners. One important consideration Baehr discusses is the âtwo dominant, perhaps philosophical, approaches that characterize how tacit knowledge evolves into a more concrete product,â an information and information systems product such as a website with an extensive content database and perhaps some embedded web applications. The two approaches Baehr describes are the âfolksonomyâ or âuser-driven approachâ and the âtaxonomyâ or âcontent-driven approachâ (Baehr, 2013, p. 294). These two approaches affect aspects of the knowledge base such as the âfindabilityâ of its content and whether users are allowed to âtagâ content to create a kind of âbottom-up classificationâ in addition to the top-down taxonomy created by the siteâs navigation categories (Baehr, 2013, p. 294). In regard to this particular project, Baehr explains how the development team used both a user survey and topics created through user-generated content to create âthree-tiered Topic Listsâ for the siteâs home page. While some of the highest-level topics such as âconsultingâ and âresearchâ were taken from the user survey, second-level topics such as âbig data,â and third-level topics such as âapplication developmentâ were taken from user-generated topics on discussions boards and from topics the development team gleaned from current technical communication research (Baehr, 2013, p. 304).
In this article, Baehrâs primary concern is with providing an overview of the issues involved in developing digital knowledge bases in general and of his experience in developing a digital knowledge base for the technical communication field in particular. As mentioned, he concludes using âan integrated or hybridâ approach involving various methods to develop and organize the information content based upon a âsustainable content strategyâ (Baehr, 2013, p. 293).
Mahrt, M. & Scharkow, M. (2013). The value of big data in digital media research. Journal of Broadcasting & Electronic Media, 57 (1), 20-33.
In their effort to espouse âtheory-drivenâ research strategies and to caution against the blind embrace of âdata-drivenâ research strategies that seems to have culminated recently in a veritable ââdata rushâ promising new insightsâ regarding everything, the authors of this paper âreviewâ a âdiverse selection of literature onâ digital media research methodologies and the Big Data phenomenon as they provide âan overview of ongoing debatesâ in this realm while arguing ultimately for a pragmatic approach based on âestablished principles of empirical researchâ and âthe importance of methodological rigor and careful research designâ (Mahrt & Scharkow, 2013, pp. 26, 20, 21, 30).
Mahrt and Scharkow acknowledge the advent of the Internet and other technologies has enticed âsocial scientists from various fieldsâ to utilize âthe massive amounts of publicly available data about Internet usersâ and some scholars have enjoyed success in âgiving insight into previously inaccessible subject mattersâ (Mahrt & Scharkow, 2013, p. 21). Still, the authors note, there are some âinherent disadvantagesâ with sourcing data from the Internet in general and also from particular sites such as social media sites or gaming platforms (Mahrt & Scharkow, 2013, p. 21, 25). One of the most commonly cited problems with sourcing publicly available data from social media sites or gaming platforms or Internet usage is âthe problem of random sampling on which all statistical inference is based, remains largely unsolvedâ (Mahrt & Scharkow, 2013, p. 25). The data in Big Data essentially are âhugeâ amounts of data âânaturallyâ created by Internet users,â ânot indexed in any meaningful way,â and with no âcomprehensive overviewâ available (Mahrt & Scharkow, 2013, p. 21).
While Mahrt and Scharkow mention the positive attitude of âcommercial researchersâ toward a âgolden futureâ for big data, they also mention the cautious attitude of academic researchers and explain how the âterm Big Data has a relative meaningâ (Mahrt & Scharkow, 2013, pp. 22, 25) contingent perhaps in part on these different attitudes. And although Mahrt and Scharkow imply most professionals would agree the big data concept âdenotes bigger and bigger data sets over time,â they explain also how âin computer scienceâ researchers emphasize the concept ârefers to data sets that are too bigâ to manage with âregular storage and processing infrastructuresâ (Mahrt & Scharkow, 2013, p. 22). This emphasis on data volume and data management infrastructure familiar to computer scientists may seem to some researchers in âthe social sciences and humanities as well as applied fields in businessâ too narrowly focused on computational or quantitative methods and this focus may seem exclusive and controversial in additional ways (Mahrt & Scharkow, 2013, pp. 22-23). Some of these additional controversies revolve around issues such as, for example, whether a âdata analysis divideâ may be developing that favors those with âthe necessary analytical training and toolsâ over those without them (Mahrt & Scharkow, 2013, pp. 22-23), or whether an overemphasis on âdata analysisâ may have contributed to the âassumption that advanced analytical techniques make theories obsolete in the research process,â as if the numbers, the âobserved data,â no longer require human interpretation to clarify meaning or to identify contextual or other confounding factors that may undermine the quality of the research and raise âconcerns about the validity and generalizability of the resultsâ (Mahrt & Scharkow, 2013, pp. 23-25).
Although Mahrt and Scharkow grant advances in âcomputer-mediated communication,â âsocial media,â and other types of âdigital mediaâ may be âfueling methodological innovationâ such as analysis of large-scale data sets â or so-called Big Data â and that the opportunity to participate is alluring to âsocial scientistsâ in many fields, the authors conclude their paper by citing Herring and others urging researchers to commit to âmethodological training,â âto learn to ask meaningful questions,â and to continually âassessâ whether collection and analysis of massive amounts of data is truly valuable in any specific research endeavor (Mahrt & Scharkow, 2013, p. 20, 29-30). The advantages of automated, big data research are numerous, as Mahrt and Scharkow concede, for instance âconvenienceâ and âefficiency,â or the elimination of research obstacles such as âartificial settingsâ and âobservation effects,â or the âvisualizationâ of massive âpatterns in human behaviorâ previously impossible to discover and render (Mahrt & Scharkow, 2013, pp. 24-25). With those advantages understood and granted, the authorâs argument seems a reasonable reminder of the âestablished principles of empirical researchâ and of the occasional need to reaffirm the value of the tradition (Mahrt & Scharkow, 2013, p. 21).
Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, December). The Google file system. ACM SIGOPS Operating Systems Review, 37(5), 29-43.
When they published their paper in 2003, engineers at Google had already designed, developed, and implemented the Google File System (GFS) in an effort to sustain performance and control costs while providing the infrastructure, platform, and applications required to deliver Googleâs services to users (Ghemawat, Gobioff, & Leung, 2003, p. 29). Although the authors acknowledge GFS has similar aims as existing distributed file systems, aims such as âperformance, scalability, reliability, and availability,â they state GFS has dissimilar âdesign assumptionsâ arising from their âobservationsâ of Googleâs âapplication workloads and technological environmentâ (Ghemawat, et al., 2003, p. 29). In general, the authors describe GFS as âthe storage platform for the generation and processing of data used by our serviceâ and used by our âresearch and development efforts that require large data setsâ (Ghemawat, et al., 2003, p. 29). In addition, they state that GFS is suitable for âlarge distributed data-intensive applications,â that it is capable of providing âhigh aggregate performance to a large number of clients,â and that it âis an important toolâ that allows Google âto innovate and attack problems on the scale of the entire webâ (Ghemawat, et al., 2003, pp. 29, 43).
In the introduction to their paper, the authors state the four primary characteristics of their âworkloads and technological environmentâ as 1) âcomponent failures are the norm rather than the exception,â 2) âfiles are large by traditional standards,â 3) âmost files are mutated by appending new data rather than overwriting existing data,â and 4) âco-designing the applications and the file system API benefits the overall system by increasing flexibilityâ (Ghemawat, et al., 2003, p. 29).Each of these observations aligns with (results in) what the authors call their âradically different points in the design spaceâ (Ghemawat, Gobioff, & Leung, 2003, p. 29) which they elaborate in some detail both in the numbered list in the paperâs introduction and in the bulleted list in the second section, âAssumptions,â of the paperâs second part, âDesign Overviewâ (Ghemawat, et al., 2003, p. 30). Considering the authorsâ first observation, for example, that the âquantity and quality of the components virtually guaranteeâ parts of the system will fail and âwill not recover,â it is reasonable to assert the design premises (assumptions) that system specifications should include the system is made of âinexpensive commodity componentsâ and it âmust constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basisâ (Ghemawat, et al., 2003, pp. 29-30). Considering the authorsâ second observation, for example, âfiles are huge by traditional standards,â meaning âmulti-GB files are commonâ and âthe system stores a modest number of large files,â it is reasonable to assert the design premises (assumptions) that system parameters âsuch as I/O operation and block sizesâ need âto be revisitedâ and re-defined in order to optimize the system for managing large files while maintaining support for managing small files (Ghemawat, et al., 2003, pp. 29-30). These two examples demonstrate the type of arguments and evidence the authors provide to support their claim GFS responds to fundamental differences between the data, workloads (software), and infrastructure (hardware) of traditional information technology and the data, workloads, and infrastructure Google needs to sustain its operations in contemporary and future information technology (Ghemawat, et al., 2003, pp. 29-33, 42-43). In the remaining passages of their paperâs introduction and in the first section of their design overview, the authors continue discussing Googleâs technological environment by describing the third and fourth primary characteristics of the environment they have observed and by explaining corollary design premises and assumptions arising from those observations they applied to designing and developing GFS (Ghemawat, et al., 2003, pp. 29-30).
With the rationale for their work thus established, the authors move on in the remaining sections of their design overview to discuss the overall architecture of GFS. First, they introduce some features the authors imply are shared with other distributed file systems â for example an API supporting âthe usual operations to create, delete, open, close, read, and write files,â — and some features the authors imply are unique to GFS â for example âsnapshot and record append operationsâ (Ghemawat, et al., 2003, p. 30). Next, they describe the main software components (functions or roles) included in a GFS implementation on a given âclusterâ (set) of machines, namely the âGFS clients,â the âGFS master,â and the âGFS chunkservers.â The GFS clients enable communication between applications requiring data and between the GFS master and GFS chunkservers providing data. The GFS master âmaintains all file system metadataâ and âcontrols system-wide activities.â The GFS chunkservers store the actual data (Ghemawat, et al., 2003, p. 31).
At this point in their paper, although the authors begin providing fairly detailed technical explanations for how these various GFS components interact, I will mention only a few points the authors emphasize as crucial to the success of GFS. First of all, in contrast with some other distributed file systems, GFS is a âsingle masterâ architecture that has both advantages and disadvantages (Ghemawat, et al., 2003, pp. 30-31). According to the authors, one advantage of âhaving a single masterâ is it âvastly simplifiesâ the âdesignâ of GFS and âenables the master to make sophisticated chunk placement and replication decisions using global knowledgeâ (Ghemawat, et al., 2003, pp. 30-31). A disadvantage of having only one master, however, is its resources could be overwhelmed and it could become a âbottleneckâ (Ghemawat, et al., 2003, p. 31). In order to overcome this potential disadvantage of the single master architecture, the authors explain how communication and data flows through the GFS architecture, namely that GFS clients âinteract with the master for metadata operations,â but interact with the chunkservers for actual data operations (i.e. operations requiring alteration or movement of data) and thereby relieve the GFS master from performing âcommon operationsâ that could overwhelm it (Ghemawat, et al., 2003, p. 31, 43). Other important points include GFSâs relatively large data âchunk size,â its ârelaxed consistency model,â its elimination of the need for substantial client cache, and its use of replication instead of RAID to solve fault tolerance issues (Ghemawat, et al., 2003, pp. 31-32, 42).
McNely, B., Spinuzzi, C., & Teston, C. (2015). Contemporary research methodologies in technical communication. Technical Communication Quarterly, 24, 1-13.
In Technical Communication Quarterlyâs most recent special issue on research methods and methodologies, the issueâs guest editors assert âmethodological approachesâ are important âmarkers for disciplinary identityâ and thereby agree with previous guest editor, Goubil-Gambrell, who in the 1998 special issue âargued that âdefining research methods is a part of disciplinary developmentââ (McNely, Spinuzzi, & Teston, 2015, p. 2). Furthermore, the authors of the 2015 special issue revere the 1998 special issue as a âlandmark issueâ including ideas that âinformed a generation of technical communication scholars as they defined their own objects of study, enacted their research ethics, and thought through their metricsâ (McNely, et al., 2015, p. 9).
It is in this tradition the authors of the 2015 special issue both desire to review âkey methodological developmentsâ and associated theories forming the technical communication âfieldâs current research identityâ and to preview and âmap future methodological approachesâ and relevant theories (McNely, et al., 2015, p. 2). The editors argue the approaches and theories discussed in this special edition of the journal ânot only respond toâ what they view as substantial changes in âtools, technologies, spaces, and practicesâ in the field over the past two decades, but also âinnovateâ by describing and modeling how these changes are informing technical communicatorsâ emerging research methodologies and theories as those methodologies and theories relate to the âfieldâs objects of study, research ethics, and metricsâ (i.e. âmethodo-communicative issuesâ) (McNely, et al., 2015, pp. 1-2, 6-7).
Reviewing what they see as the fundamental theories and research methodologies of the field, the authors explore how a broad set of factors (e.g. assumptions, values, agency, tools, technology, and contexts) manifest in work produced along three vectors of theory and practice they identify as âsociocultural theories of writing and communication,â âassociative theories and methodologies,â and âthe new material turnâ (McNely, et al., 2015, p. 2). The authors describe the sociocultural vector as developing from theoretical traditions in âsocial psychology, symbolic interactionism,â âlearning theory,â and âactivity theory,â among others, and as essentially involving âpurposeful human actors,â âmaterial surroundings,â âheterogeneous artifacts and tools,â and even âcognitive constructsâ combining in âconcrete interactionsâ â that is, situations â arising from synchronic and diachronic contextual variables scholars may identify, describe, measure, and use to explain phenomena and theorize about them (McNely, et al., 2015, pp. 2-4). The authors describe the associative vector as developing from theoretical traditions in âarticulation theory,â ârhizomatics,â âdistributed cognition,â and âactor-network theory (ANT)â (McNely, et al., 2015, p. 4) and as essentially involving âsymmetryâa methodological stance that ascribes agency to a network of human and nonhuman actors rather than to specific human actorsâ and therefore leading researchers to âfocus on associations among nodesâ as objects at the methodological nexus (McNely, et al., 2015, p. 4). The authors describe the new material vector as developing from theoretical traditions in “science and technology studies, political science, rhetoric, and philosophyâ (with the overlap of the specific traditions from political science and philosophy often âcollected under the umbrella known as âobject-oriented ontologyâ) and as essentially involving a âradically symmetrical perspective on relationships between humans and nonhumansâbetween people and things, whether those things are animal, vegetable, or mineralâ and how these human and non-human entities integrate into âcollectivesâ or âassemblagesâ that have âagencyâ one could view as âdistributed and interdependent,â a phenomenon the authors cite Latour as labeling âinteragentivityâ (McNely, et al., 2015, p. 5).
Previewing the articles in this special issue, the editors acknowledge how technical communication methodologies have been âinfluenced by new materialisms and associative theoriesâ and argue these methodologies âbroaden the scope of social and rhetorical aspectsâ of the field and âencourage us to consider tools, technologies, and environs as potentially interagentive elements of practiceâ that enrich the field (McNely, et al., 2015, p. 6). At the same time, the editors mention how approaches such as âaction researchâ and âparticipatory designâ are advancing âtraditional qualitative approachesâ (McNely, et al., 2015, p. 6). In addition, the authors state âgiven the increasing importance of so-called âbig dataâ in a variety of knowledge work fields, mixed methods and statistical approaches to technical communication are likely to become more prominentâ (McNely, et al., 2015, p. 6). Amidst these developments, the editorâs state their view that adopting âinnovative methodsâ in order to âexplore increasingly large date setsâ while âremaining grounded in the values and aims that have guided technical communication methodologies over the previous three decadesâ may be one of the fieldâs greatest challenges (McNely, et al., 2015, p. 6).
In the final section of their paper, the authors explicitly return to what they seem to view as primary disciplinary characteristics (i.e. markers, identifiers), which they call âmethodo-communicative issues,â and use those characteristics to compare the articles in the 1998 special issue with those in the 2015 special issue and to identify what they see as new or significant in the 2015 articles. The âmethodo-communicative issuesâ or disciplinary characteristics they use are: âobjects of study, research ethics, and metricsâ (McNely, et al., 2015, pp. 6-7). Regarding objects of study, the authors note how in the 1998 special issue, Longo focuses on the âcontextual nature of technical communicationâ while in the 2015 special issue, Read and Swarts focus on ânetworks and knowledge workâ (McNely, et al., 2015, p. 7). Regarding ethics, the authors cite Blyer in the 1998 special issue as applying âcriticalâ methods rather than âdescriptive/explanatory methodsâ while in the 2015 special issue, Walton, Zraly, and Mugengana apply âvisual methodsâ to create âethically sound cross-cultural, community-based researchâ (McNely, et al., 2015, p. 7). Regarding metrics or âmeasurement,â the authors cite Charney in the 1998 special issue as contrasting the affordances of âempiricismâ with âromanticismâ while in the 2015 special issue, Graham, Kim, DeVasto, and Keith explore the affordances of âstatistical genre analysis of larger data setsâ (McNely, et al., 2015, p. 7). In their discussion of what is new or significant in the articles in the 2015 special issue, the editors highlight how some articles address particular methodo-communicative issues. Regarding metrics or âmeasurement,â for example, they highlight how Graham, Kim, DeVasto, and Keith apply Statistical Genre Analysis (SGA) â a hybrid research method combining rhetorical analysis with statistical analysis â to answer research questions such as which âspecific genre features can be correlated with specific outcomesâ across an âentire data setâ rather than across selected exemplars (McNely, et al., 2015, p. 8).
In summary, the guest editors of this 2015 special issue on contemporary research methodologies both review the theoretical and methodological traditions of technical communication and preview the probable future direction of the field as portrayed in the articles included in this special issue.
Wolfe, J. (2015). Teaching students to focus on the data in data visualization. Journal of Business and Technical Communication, 29, 344-359.
This âpedagogical reflectionâ by Joanna Wolfe utilizes âPerelman and Olbrechts-Tytecaâs concept of interpretative levelâ to elucidate the rhetorical decisions people make when selecting and presenting data and to provide a theoretical foundation for two exercises and a formal assignment Wolfe designed to teach âdata visualizationâ and âwriting about dataâ in communication courses (Wolfe, 2015, pp. 344-345, 348).
According to Wolfe, âinterpretative levelâ is used by Perelman and Olbrechts-Tyteca âto describe the act of choosing between competing, valid interpretationsâ (Wolfe, 2015, p. 345). In relation to data specifically, Wolfe states the concept can be applied to describe âthe choice we make to summarize data on variable x versus variable yâ and explains further by emphasizing how people decide whether data are presented as, for example, âaverages versus percentages or raw countsâ and how those choices have âdramatic consequences for the stories we might tell about dataâ (Wolfe, 2015, pp. 345-346).
By focusing on interpretative level, Wolfe hopes to address what she perceives as a deficiency by technical communication textbooks to address strategic concerns that would encourage authors to âreturn to the data to reconsider what data are selected, how they are summarized, and whether they should be synthesized with other data for a more compelling argumentâ (Wolfe, 2015, p. 345). Although Wolfe praises the communication literature and technical communication textbooks for addressing tactical concerns such as aligning visualization designs with type of data or adjusting visualizations for specific audiences or considering âhow to ethically represent data,â she proposes greater involvement with the data to address strategic concerns such as what is the rhetorical purpose and context and which tactics should be used to advance the overall rhetorical strategy (Wolfe, 2015, pp. 348).
In the main body of her paper, Wolfe explains the two exercises and formal assignment she designed to teach students the interpretative level concept and to enable them to practice using it by creating data visualizations from actual data sets (Wolfe, 2015, pp. 348-356). In the first exercise, she demonstrates how deciding on what variable to sort data in a table will determine which ââstory or narrativeââ is immediately perceived by most viewers (Wolfe, 2015, p. 349) and she explains how to have students practice creating data visualizations to present the ââfairestâ view of Olympics medal data (Wolfe, 2015, pp. 348-251). In the second exercise and in the formal assignment, Wolfe continues adding complexity by increasing the number of analytical points and potential visualization methods students should consider (Wolfe, 2015, pp. (351-355). This increased complexity allows Wolfe to discuss additional methods for visualizing data. She explains, for example, how to consolidate variables using point systems to provide an index score that better summarizes data and how to use stylistic and organizational choices in visualizations to reveal patterns in the data that enable viewers to âderive conclusionsâ aligned with the authorsâ decisions regarding rhetorical strategy (Wolfe, 2015, pp. 353-356).
In conclusion, Wolfe proposes again that communication instruction regarding data visualization should go beyond teaching optimal data visualization tactics by introducing concepts such as interpretative level that encourage students to create rhetorical strategies â and to revisit the data and the analysis and the rhetorical purpose and context â and thereby invent ânarrativesâ that attain those strategies (Wolfe, 2015, p. 357). This, according to Wolfe, will enable students to see data not âas pure, unmodifiable fact,â but âas a series of rhetorical choicesâ (Wolfe, 2015, pp. 357).
Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons.
In the introduction to his book on the big data phenomenon, Jared Dean notes recent examples of big dataâs impact, provides an extended definition of big data, and discusses some prominent issues debated in the field currently (Dean, 2014, pp. 1-12). In part one, Dean describes what he calls âthe computing environmentâ including elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact (Dean, 2014, pp. 23-25). In part two, Dean explains a broad set of tactics for âturning data into business valueâ through the âmethodology, algorithms, and approaches that can be applied to your data mining activitiesâ (Dean, 2014, pp. 53-54). In part three, Dean examines cases of âlarge multinational corporationsâ that completed big data projects and âovercame major challenges in using their data effectivelyâ (Dean, 2014, p. 194). In the final chapter of his book, Dean notes some of the trends, âopportunities,â and âserious challengesâ he sees in the future of âbig data, data mining, and machine learningâ (Dean, 2014, p. 233).
Including data mining and machine learning in the title of his book, Dean highlights two fields of practice that managed and processed relatively large volumes of data long before the popular term big data was first used and, according to Dean, became âco-opted for self-promotion by many people and organizations with little or no ties to storing and processing large amounts of data or data that requires large amounts of computationâ (Dean, 2014, pp. 9-10). Although Dean does not explain yet why so much attention has become focused on data mining and related fields, he says data is the ânew oil,â the natural resource that is âplentiful but difficult and sometimes messy to extract,â and he says this natural resource requires âinfrastructureâ to âtransport, refine, and distribute itâ (Dean, 2014, p. 12). While noting ââBig Dataâ once meant petabyte scaleâ and was generally used in reference to âunstructured chunks of data mined or generated from the internet,â Dean proposes the usage of the term big data has âexpanded to mean a situation whereâ organizations âhave too much data to store effectively or compute efficiently using traditional methodsâ (Dean, 2014, p. 10). Furthermore, Dean proposes what he calls the current âbig data eraâ is differentiated by both a) notable changes in perception, attitude, and behavior by those who have realized âorganizations that use data to make decisions over time in fact do make better decisionsâ â and therefore attain competitive advantage warranting âinvestment in collecting and storing data for its potential future valueâ â and by b) the ârapid development, creation, and maturity of technologies to store, manipulate, and analyze this data in new and efficient waysâ (Dean, 2014, pp. 4-5). The main example of big data Dean cites in his bookâs introduction illustrates big dataâs impact by highlighting how it enabled âscientistsâ âto identify genetic markersâ that enabled them to discover the drug tamoxifen used to treat breast cancer âis not 80% effective in patients but 100% effective in 80% of patients and ineffective in the restâ (Dean, 2014, p. 2). In commenting on this example of big dataâs impact, Dean states âthis type of analysis was not possible beforeâ the âera of big dataâ because the âvolume and granularity of the data was missing,â the âcomputational resourcesâ required for the analysis were too scarce and expensive, and the âalgorithms or modeling techniquesâ were too immature (Dean, 2014, p. 2). These types of discoveries, Dean states, have revealed to organizations the âpotential future valueâ of data and have resulted in a âvirtuous circleâ where the realization of the value of data leads to increased allocation of resources to collect, store, and analyze data which leads to more valuable discoveries (Dean, 2014, p. 4). Although Dean mentions âcreditâ for first using the term big data is generally given to John Mashey who âin the late 1990sâ âgave a series of talks to small groups about this big data tidal wave that was coming,â he also notes the âfirst academic paper was presented in 2000, and published in 2003, by Francis X. Dieboltâ (Dean, 2014, p. 3). With the broad parameters of Deanâs extended definition of big data thus outlined, Dean completes the introduction of his book with a discussion of some prominent issues debated in the field currently, such as when sampling data may continue to be preferable to using all available data or when the converse is true (Dean, 2014, pp. 13-21), when new sources of data should be incorporated into existing processes (Dean, 2014, p. 13), and, perhaps most importantly, when the benefit of the information produced by a big data process clearly outweighs the cost of the big data process and thereby value is created (Dean, 2014, p. 11).
Deanâs use of the term âdata miningâ in the first sentences of his initial remarks both to part one and to part two of his book emphasizes Deanâs awareness of big dataâs lineage in previously existing academic disciplines and professional fields of practice, a lineage that can seem lost in recent mainstream explanations of the big data phenomenon that often invoke a few terms all beginning with the letter âvâ (Dean, 2014, pp. 24, 54). In fact, Dean himself refers to these âvâsâ of big data, although he adds the term âvalueâ to the other three commonly used terms âvolume,â âvelocity,â and âvarietyâ (Dean, 2014, p. 24). Dean states âdata mining is going through a significant shift with the volume, variety, value, and velocity of data increasing significantly each yearâ and he discusses in a fair amount of detail throughout parts one and two of his book the resources and methodologies available to and applied by those capitalizing on the big data phenomenon (Dean, 2014, pp. 23-25).
Dean separates the data mining endeavor into two sets of elements corresponding to the first two parts of his book. Part one, which Dean calls the âthe computing environment, includes elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact. Part two, which dean calls âturning data into business valueâ includes elements such as the âmethodology, algorithms, and approaches that can be applied toâ big data projects (Dean, 2014, pp. 23-25, 54).
Although Dean does not explicitly identify as such what many in the information technology (IT) industry call the workload â meaning the primary and supporting software applications and data required to accomplish some information technology objective, e.g. a data mining project â he does discuss at various points throughout his book how the software (applications) used in data mining has characteristics which determine how well the software will run on particular types of hardware and solution architectures (designs). Deanâs introductory remarks to part one describe this as the âinteraction between hardware and softwareâ and he notes specifically how âtraditional data mining software was implemented by loading data into memory and running a single thread of execution over the dataâ and how this traditional implementation form determined the âprocess was constrained by the amount of memory available and the speed of the processorâ (Dean, 2014, p. 24). This would mean in cases where the data volume was greater than the available RAM on the system, âthe process would failâ (Dean, 2014, p. 24). In addition, Dean notes how software implemented with a âsingle thread of executionâ cannot utilize the advantages of âmulticoreâ CPUs and therefore contributes to imbalances in system utilization and thereby to project waste impeding performance/price optimization (Dean, 2014, pp. 24-25). Further emphasizing his point, Dean says âall software packages cannot take advantage of current hardware capacityâ and he notes how âthis is especially true of the distributed computing model,â a model he acknowledges as important by encouraging decisions makers âto ensure that algorithms are distributed and effectively leveragingâ currently available computing resources (Dean, 2014, p. 25).
In the first chapter of part one, Dean begins discussing the hardware involved in big data and he focuses on five primary hardware components: the storage, the central processing unit (CPU), the graphical processing unit (GPU), the memory (RAM), and the network (Dean, 2014, pp. 27-34). Regarding the storage hardware, Dean notes how although the price to performance and price to capacity ratios are improving, they may not be improving fast enough to offset increasing data volumes, volumes he says are âdoubling every few yearsâ (Dean, 2014, p. 28). Dean also draws attention to a few hardware innovations important to large-scale data storage and processing such as how external storage subsystems (disk arrays) and solid-state drives (SSDs) provide CPUs with faster access to larger volumes of data by improving âthroughputâ rates that in turn decrease the amount of time required to analyze vast quantities of data (Dean, 2014, pp. 28-29). Still, even though these innovations in data storage and access are improving data processing and analysis capabilities, Dean emphasizes the importance of the overall systems and their inter-relationships and how big data analytics teams must ensure they choose âanalytical softwareâ that can take advantage of improvements in data storage, access, and processing technologies, for example by choosing software that can âaugment memory by writing intermediate results to disk storageâ since disk storage is less expensive than memory, and that can utilize multi-core processors by executing multi-thread computations in parallel since that improves utilization rates of CPUs and utilization rates are a primary measure of efficiency used in analyzing costs relative to benefits (Dean, 2014, pp. 24, 28-29). Regarding CPU hardware, Dean notes how although the âfamous Mooreâs lawâ described the rapid improvements in processing power that âcontinued into the 1990s,â the law did not ensure sustainment of those same kinds of improvements (Dean, 2014, pp. 29-30). In fact, Dean states âin the early 2000s, the Mooreâs law free lunch was over, at least in terms of processing speed,â for various reasons, for example the heat generated by processers running at ultra-high frequencies is excessive, and therefore CPU manufacturers tried other means of improving CPU performance, for example by âadding extra threads of execution to their chipsâ (Dean, 2014, p. 30). Ultimately, even though the innovation in CPUs is different than it was in the Mooreâs law years of the 1980s and 1990âs, innovation in CPUs continues and it remains true that CPU utilization rates are low relative to other system components such as mechanical disks, SSDs, and memory; therefore in Deanâs view it remains true the âmismatch that exists among disk, memory, and CPUâ often is the primary problem constraining performance (Dean, 2014, pp. 29-30). Regarding GPUs, Dean discusses how they have recently begun to be used to augment system processing power; and he focuses on how some aspects of graphics problems and data mining problems are similar in that they require or benefit from performing âa huge number of very similar calculationsâ to solve âhard problems remarkably fastâ (Dean, 2014, p. 31). Furthermore, Dean notes how in the past âthe ability to develop code to run on the GPU was restrictive and costly,â but recent improvements in âprogramming interfaces for developing softwareâ to exploit GPU resources are overcoming those barriers (Dean, 2014, p. 31). Regarding memory (RAM), Dean emphasizes its importance for data mining workloads due to its function as the âintermediary between the storage of data and the processing of mathematical operations that are performed by the CPUâ (Dean, 2014, p.32). In discussing RAM, Dean provides some background by mentioning a few milestones in the development of RAM and related components, for example how previous 32-bit CPUs and operating systems (OSes) limited addressable memory to 4GB and how Intelâs and AMDâs introductions of 64-bit CPUs at commodity prices in the early 2000s, along with the release of 64-bit OSes to support their widespread adoption, expanded addressable RAM to 8TB at that time and thereby supported âdata mining platforms that could store the entire data mining problem in memoryâ (Dean, 2014, p. 32). These types of advancements in technology coupled with improvements in the ratios of price to performance and price to capacity, including the âdramatic drop in the price of memoryâ during this same time period âcreated an opportunity to solve many data mining problems that previously were not feasibleâ (Dean, 2014, p. 32). Even with this optimistic view of overall advancements, Dean reiterates the advancement pace in RAM and in hard drives remains slow compared to the advancement pace in CPUs â RAM speeds âhave increased by 10 timesâ while CPU speeds âhave increased 10,000 timesâ and disk storage advancements have been slower than those in RAM â therefore it remains important to continue seeking higher degrees of optimization, for example by using distributed systems since âit is much less expensive to deploy a set of commodity systemsâ with high capacities of RAM than it is to use âexpensive high-speed disk storage systemsâ (Dean, 2014. pp. 32-33). One of the disadvantages of distributed systems, however, is the network âbottleneckâ existing between individual nodes in the cluster even when high-speed proprietary technologies such as Infiniband are used (Dean, 2014, pp. 33-34). In the case of less expensive, standard technologies, Dean notes the âstandard network connection for an analytical computing cluster is 10 gigabit Ethernet (10 GbE), which has an upper-bound data transfer rate of 4 gigabytes per second (GB/sec)â (Dean, 2014, p. 34). Since Dean identifies the inter-node network as the slowest component of distributed computing systems, he emphasizes the importance of considering the ânetwork component when evaluating data mining softwareâ and notes skillful design and selection of the âsoftware infrastructureâ and âalgorithmsâ is required to ensure efficient âparallelizationâ is possible while minimizing data movement and âcommunication between computersâ (Dean, 2014, p. 34).
In the second chapter of part one, Dean discusses the advantages and disadvantages of different types of distributed systems and notes how the decreasing price of standard hardware â including âhigh-core/large-memory (massively parallel processing [MPP] systems) and clusters of moderate systemsâ â has improved the cost to benefit ratio of solving âharder problemsâ which he defines as âproblems that consume much larger volumes of data, with much higher numbers of variablesâ (Dean, 2014, p. 36). The crucial development enabling this advance in âbig data analytics,â according to Dean, is the capability and practice of moving âthe analytics to the dataâ rather than moving the âdata to the analyticsâ (Dean, 2014, p. 36). Cluster computing that effectively moves âthe analytics to the data,â according to Dean, âcan be divided into two main groups of distributed computing systems,â that is âdatabase computingâ systems based on the traditional relational database management system (RDBMS) and âfile system computingâ systems based on distributed file systems (Dean, 2014, pp. 36-37). Regarding database computing, Dean describes âMPP databasesâ that âbegan to evolve from traditional DBMS technologies in the 1980sâ as âpositioned as the most direct update for the organizational enterprise data warehouses (EDWs)â and explains âthe technology behindâ them as involving âcommodity or specialized servers that hold data on multiple hard disksâ (Dean, 2014, p. 37). In addition to MPP databases, Dean describes âin-memory databases (IMDBs)â as an evolution begun âin the 1990sâ that has become a currently âpopular solution used to accelerate mission-critical data transactionsâ for various industries willing to absorb the higher costs of high-capacity RAM in order to attain the increased performance possible when all data is stored in RAM (Dean, 2014, p. 37). Regarding file system computing, Dean notes while there are many available âplatforms,â the âmarket is rapidly consolidating on Hadoopâ due to the number of âdistributions and tools that are compatible with its file systemâ (Dean, 2014, p. 37). Dean attributes the âinitial developmentâ of Hadoop to Doug Cutting and Mike Cafarella who âcreatedâ it âbasedâ upon development they did on the âApache open source web crawling projectâ called Nutch and upon âa paper published by Google that introduced the MapReduce paradigm for processing data on large clustersâ (Dean, 2014, pp. 37-38). Summarizing the advantages of Hadoop, Dean explains it is âattractive because it can store and manage very large volumes of data on commodity hardware and can expand easily by adding hardware resources with incremental costâ (Dean, 2014, p. 38). This coupling of high-capacity data storage with incremental capital expenditures makes the cost to benefit ration appear more attractive to organizations and enables them to rationalize storing âall available data in Hadoopâ wagering on its potential future value even while understanding the large volume of data stored in Hadoop âis rarely (actually probably never) in an appropriate form for data miningâ without the need for additional resource expenditure to cleanse it, transform it, and even âaugmentâ it with data stored in other repositories (Dean, 2014, pp. 38-39). At this point in chapter two of his book, Dean has completed an initial overview of hardware components and solution architectures commonly considered by those responsible for purchasing and implementing big data management projects. Members of the IT organization, according to Dean, are often responsible for these decisions, although they may be expected to collaborate with other organizational stakeholders to understand organizational needs and objectives and to explain the advantages and disadvantages of various technological options and potential âtrade-offsâ that should be considered (Dean, 2014, pp. 39-40). Near the end of chapter two, Dean illustrates some of these factors (criteria) and âbig data technologiesâ in a comparative table which ranks some of the solutions he has discussed (e.g. IMDBs, MPPDBs, and Hadoop) according to the degree (high, medium, low) to which they possess some features or capabilities often required in big data solutions (e.g. maintaining data integrity, providing high-availability to the data, and handling âunstructured dataâ) (Dean, 2014, p. 40). Concluding chapter two, Dean notes selection of the optimal âcomputing platformâ for big data projects depends âon many dimensions, primarily the volume of data (initial, working set, and output data volumes), the pattern of access to the data, and the algorithm for analysisâ (Dean, 2014, p. 41). Furthermore, he states these dimensions will âvaryâ at different phases of the data analysis (Dean, 2014, p. 41).
In the third chapter of part one, Dean departs from the preparation of data and the systems used to store data and moves on to the âanalytical toolsâ that enable people to âcreate valueâ from the data (Dean, 2014, p. 43). Deanâs discussion of analytical tools focuses on five software applications and programming languages commonly used for large-scale data processing and analysis and he notes some of the strengths and weaknesses of each one. Beginning with the open source data mining software âWeka (Waikato Environment for Knowledge Analysis),â Dean describes it as âfully implemented in Javaâ and ânotable for its broad range of extremely advanced training algorithms, its work flow graphical user interface (GUI)â and its âdata visualizationâ capabilities (Dean, 2014, pp. 43-44). A weakness of Weka in Deanâs view is that it âdoes not scale well for big data analyticsâ since it âis limited to available RAM resourcesâ and therefore its âdocumentation directs users to its data preprocessing and filtering algorithms to sample big data before analysisâ (Dean, 2014, p. 44). Even with this weakness, however, Dean states many of Wekaâs âmost powerful algorithmsâ are only available in Weka and that its GUI makes it âa good optionâ for those without Java programming experience who need âto prove valueâ quickly (Dean, 2014, p. 44). When organizations need to âdesign custom analytics platforms,â Dean cites âJava and JVM languagesâ as âcommon choicesâ because of Javaâs âconsiderable development advantages over lower-level languagesâ such as FORTRAN and C that âexecute directly on native hardwareâ especially since âtechnological advances in the Java platformâ have improved its performance âfor input/output and network-bound processes like those at the core of many open source big data applicationsâ (Dean, 2014, pp. 44-45). As evidence of these improvements, Dean notes how Apache Hadoop, the popular âJava-based big data environment,â won the â2008 and 2009 TeraByte Sort Benchmarkâ (Dean, 2014, p. 45). Overall, Deanâs perspective is the performance improvements in Java and the increasing âscale and complexityâ of âanalytic applicationsâ have converged such that Javaâs advantages in âdevelopment efficiencyâ along with âits rich libraries, many application frameworks, inherent support for concurrency and network communications, and a preexisting open source code base for data mining functionalityâ now outweigh some of its known weaknesses such as âmemory and CPU-bound performanceâ issues (Dean, 2014, p. 45). In addition to Java, Dean mentions âScala and Clojureâ as ânewer languages that also run on the JVM and are used for data mining applicationsâ (Dean, 2014, p. 45). Dean describes âRâ as an âopen source fourth-generation programming language designed for statistical analysisâ that is gaining in âprominenceâ and popularity in the rapidly expanding âdata science communityâ as well as in academia and âin the private sectorâ (Dean, 2014, p. 47). Evidence of Râs growing popularity is its ranking in the â2013 TIOBE general survey of programming languages,â in which it ranked âin 18th place in overall development language popularityâ alongside âcommercial solutions like SAS (at 21st) and MATLAB (at 19th)â (Dean, 2014, p. 47). Among the advantages of R, there are âthousands of extension packagesâ enabling customization to include âeverything from speech analysis, to genomic science, to text mining,â in addition to its âimpressive graphics, free and polished integrated development environments (IDEs), programmatic access to and from many general-purpose languages, and interfaces with popular proprietary analytics solutions including MATLAB and SASâ (Dean, 2014, p. 47). Python is described by Dean as âdesigned to be an extensible, high-level language with a large standard library and simple, expressive syntaxâ that âcan be used interactively or programmaticallyâ and that âis often âdeployed for scripting, numerical analysis, and OO general-purpose and Web application developmentâ (Dean, 2014, p. 49). Dean highlights Pythonâs âgeneral programming strengthsâ and âmany database, mathematical, and graphics librariesâ as particularly beneficial âin the data exploration and data mining problem domainsâ (Dean, 2014, p. 49). Although Dean provides a long list of advantageous features of Python, he asserts âthe maturityâ of its âscikit-learn toolkitâ as a primary factor in its recently higher adoption rates âin the data mining and data science communitiesâ (Dean, 2014, p. 49). Last, Dean describes SAS as âthe leading analytical software on the marketâ and cites reports by IDC, Forrester, and Gartner as evidence (Dean, 2014, p. 50). In their most recent reports at the time Dean published this book, both Forrester and Gartner had named SAS âas the leading vendor in predictive modeling and data miningâ (Dean, 2014, p. 50). Dean describes âthe SAS Systemâ as composed of âa number of product areas including statistics, operations research, data management, engines for accessing data, and business intelligence (BI); although he states the products âSAS/STAT, SAS Enterprise Miner, and the SAS text analytics suiteâ are most ârelevantâ in the context of this book (Dean, 2014, p. 50). Dean explains âthe SAS systemâ can be âdivided into two main areas: procedures to perform an analysis and the fourth-generation language that allows users to manipulate dataâ (Dean, 2014, p. 50). Dean illustrates one of the advantages of SAS by providing an example of how a SAS proprietary âprocedure, or PROCâ simplifies the code required to perform specific analyses such as âbuilding regression modelsâ or âdoing descriptive statisticsâ (Dean, 2014, p. 51). Another great advantage âof SAS over other software packages is the documentationâ that includes âover 9,300 pagesâ for the âSAS/STAT product aloneâ and over â2,000 pagesâ for the Enterprise Miner productâ (Dean, 2014, p. 51). As additional evidence of the advantages of SAS products, Dean states he has never encountered an âanalytical challengeâ he has ânot been able to accomplish with SASâ and notes that recent, âmajor changesâ in the SAS âarchitectureâ have enabled it âto take better advantage of the processing power and falling price per FLOP (floating point operations per second) of modern computing clustersâ (Dean, 2014, pp. 50, 52). With his explanations of the big data computing environment (i.e. hardware, systems architectures, software, and programming languages) and some aspects of the big data preparation phase completed in part one, Dean turns to part two in which he addresses in depth exactly how big data in general and predictive modeling in particular enable âvalue creation for business leaders and practitioners.â a phrase he uses as the subtitle of his book.
Dean introduces part two of his book by stating he will address over the next seven chapters the âmethodology, algorithms, and approaches that can be applied toâ data mining projects, including a general four-step âprocess of building modelsâ he says âhas been developed and refined by many practitioners over many years,â and also including the âsEMMA approach,â a âdata mining methodology, created by SAS, that focuses on logical organization of the model development phase of data mining projectsâ (Dean, 2014, pp. 54, 58, 61). The sEMMA approach, according to Dean, âhas been in place for over a decade and proven useful for thousands and thousands of usersâ (Dean, 2014, p. 54). Dean says his objective is to âexplain a methodology for predictive modelingâ since accurately âpredicting future behaviorâ provides people and organizations âa distinct advantage regardless of the venueâ (Dean, 2014, p. 54). In his explanation of data mining methodologies, Dean notes he will discuss âthe types of target models, their characteristicsâ and their business applications (Dean, 2014, p. 54). In addition, Dean states he will discuss âa number of predictive modeling techniquesâ including the âfundamental ideas behindâ them, âtheir origins, how they differ, and some of their drawbacksâ (Dean, 2014, p. 54). Finally, Dean says he will explain some âmore modern methods for analysis or analysis for specific types of dataâ (Dean, 2014, p. 54).
Dean begins chapter four by stating predictive modeling is one of the primary data mining endeavors and he defines it as a process in which collected âhistorical data (the past)â are explored to âidentify patterns in the data that are seen through some methodology (the model), and then using the modelâ to predict âwhat will happen in the future (scoring new data)â (Dean, 2014, p. 55). Next, Dean discusses the multi-disciplinary nature of the field utilizing a Venn diagram from SAS Enterprise Miner training documentation that includes the following contributing disciplines: data mining, knowledge discovery and data mining (KDD), statistics, machine learning, data bases, data science, pattern recognition, computational neuroscience, and artificial intelligence (AI) (Dean, 2014, pp. 56). Dean notes his tendency to use âalgorithms that come primarily from statistics and machine learningâ and he explains how these two disciplines, residing in different âuniversity departmentsâ as they do, produce graduates with different knowledge and skills. Graduates in statistics, according to Dean, tend to understand âa great deal of theoryâ but have âlimited programming skills,â while graduates in computer science tend to âbe great programmersâ who understand âhow computer languages interact with computer hardware, but have limited training in how to analyze dataâ (Dean, 2014, p. 56). The result of this, Dean explains, is âjob applicants will likely know only half the algorithms commonly used in modeling,â with the statisticians knowing âregression, General Linear Models (GLMs), and decision treesâ and the computer scientists knowing âneural networks, support vector machines, and Bayesian methodsâ (Dean, 2014, p. 56). Before moving into a deeper discussion of the process of building predictive models, Dean notes a few âkey points about predictive modeling,â namely that a) âsometimes models are wrong,â b) âthe farther your time horizon, the more uncertainty there is,â and c) âaverages (or averaging techniques) do not predict extreme valuesâ (Dean, 2014, p. 57). Elaborating further, Dean says even though models may be wrong (i.e. there is a known margin of error), the models can still be useful for making decisions (Dean, 2014, p. 57). And finally, Dean emphasizes âlogic and reason should not be ignored because of a model resultâ Dean, 2014, p. 58).
In the next sections of chapter four, Dean explains in detail âa methodology for building modelsâ (Dean, 2014, p. 58). First, he discusses the general âprocess of building modelsâ as a âsimple, proven approach to building successful and profitable modelsâ and he explains the general process in four phases: â1. Prepare the data,â â2. Perform exploratory data analysis,â â3. Build your first model,â and â4. Iteratively build modelsâ (Dean, 2014, pp. 58-60). The first phase of preparing the data, Dean says, is likely completed by âa separate teamâ and requires understanding âthe data preparation process withinâ an organization, namely âwhat data existsâ in the organization (i.e. the sources of data) and how data from various sources âcan be combinedâ to âprovide insight that was previously not possibleâ (Dean, 2014, p. 58). Dean emphasizes the importance of having access to âincreasingly larger and more granular dataâ and points directly to the âIT organizationâ as the entity that should be âkeeping more data for longer and at finer levelsâ to ensure their organizations are not âbehind the trendâ and not âat risk for becoming irrelevantâ in the competitive landscape (Dean, 2014, pp. 58-59). The second phase of the model building process focuses on exploring the data to begin to âunderstandâ it and âto gain intuition about relationships between variablesâ (Dean, 2014, p. 59). Dean emphasizes âdomain expertiseâ is important at this phase in order to ensure âthorough analysisâ and recommendations which avoid unwarranted focus on patterns or discoveries that may seem important to the data miner, who, though skilled in data analysis, may not have the domain knowledge required to realize what appear to him to be significant patterns are insignificant in reality because the patterns are widely known by domain experts already (Dean, 2014, p. 59). Dean notes recent advances in âgraphical toolsâ for data mining have simplified the data exploration process â a process that was once much slower and often required âprogramming skills â to the degree that products from some large and small companies such as âSAS, IBM, and SAPâ and âQlikTech,â and âTableauâ enable users to easily âload data for visual explorationâ and âhave been proven to work withâ projects involving âbillions of observationsâ when âsufficient hardware resources are availableâ (Dean, 2014, p. 59). To ensure efficiency in the data exploration phase, Dean asserts the importance of adhering to the âprinciple of sufficiencyâ and to the âlaw of diminishing returnsâ so that exploration stops while the cost to benefit ratio is optimal (Dean, 20124, pp. 59-60). The third phase of the model-building process is to build the first model while acknowledging a âsuccessful model-building process will involve many iterationsâ (Dean, 2014, p. 60). Dean recommends working rapidly with a familiar method to build the first model and states he often prefers to âuse a decision treeâ because he is comfortable with it (Dean, 2014, p. 60). This first model created is used as the âchampion modelâ (benchmark) against which the next model iteration will be evaluated (Dean, 2014, p. 60). The fourth phase of the model-building process is where most time should be devoted and where the data miner will need to use âsome objective criteria that defines the best modelâ in order to determine if the most recent model iteration is âbetter than the champion modelâ (Dean, 2014, p. 60). Dean describes this step as âa feedback loopâ since the data miner will continue comparing the âbestâ model built thus far with the next model iteration until either âthe project objectives are metâ or some other constraint such as a deadline requires stopping (Dean, 2014, p. 60). With this summary of the general model-building process finished, next Dean explains SASâs sEMMA approach that âfocuses on the model development phaseâ of the general model-building process and âmakes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes and confirm a modelâs accuracyâ (Dean, 2014, p. 61). âThe acronym sEMMA,â Dean states, ârefers to the core process of conducting data miningâ and stands for âsample, explore, modify, model,â and âassessâ (Dean, 2014, p. 61). To explain sEMMA further, Dean addresses âa common misconceptionâ by emphasizing âsEMMA is not a data mining methodology but rather a logical organization of the functional tool setâ of the SAS Enterprise Miner product that can be used appropriately âas part of any iterative data mining methodology adopted by the clientâ (SAS, 2014, p. 61). Once the best model has been found under given constraints, then this âchampion modelâ can âbe deployed to scoreâ new data â this is âthe end result of data miningâ and the point at which âreturn on investmentâ is realized (Dean, 2014, p. 63). At this point, Dean explains some of the advantages of using SASâs Enterprise Miner, advantages such as automation of âthe deployment phase by supplying scoring code in SAS, C, Java, and PMMLâ and by capturing âcode for pre-processing activities (Dean, 2014, p. 63).
Following his overview of the model-building process, Dean identifies the three âtypes of target models,â discusses âtheir characteristics,â provides âinformation about their specific uses in business,â and then explains some common ways of âassessingâ (evaluating) predictive models (Dean, 2014, pp. 54, 64-70). The first target model Dean explains is âbinary classificationâ which in his experience âis the most common type of predictive modelâ (Dean, 2014, p. 64). Binary classification often is used to enable âdecision makersâ with a âsystem to arrive at a yes/no decision with confidenceâ and to do so fast, for example, to approve or disapprove a credit application or to launch or not launch a spacecraft (Dean, 2014, p. 64). Dean explains further, however, there are cases when a prediction of âthe probability that an event will or will not occurâ is âmuch more useful than the binary prediction itselfâ (Dean, 2014, p. 64). In the case of weather forecasts, for example, most people would prefer to know the âconfidence estimateâ (âdegree of confidence,â confidence level, probability) expressed as a percentage probability the weather forecaster assigns to each possible observation of the â it is easier to decide whether to carry an umbrella if one knows the weather forecaster predicts a ninety-five percent chance of rain than if one knows only that the weather forecaster predicts it will rain or it will not rain (Dean, 2014, p. 64). The second target model Dean explains is âmultilevel or nominal classificationâ which he describes as useful when one is interested in creating âmore than two levelsâ of classification (Dean, 2014, p. 64). As an example, Dean describes how preventing credit card fraud while facilitating timely transactions could mean the initial decision regarding a transaction include not only the binary classifications of accept or decline, but also an exception classification of requires further review before a decision can be made to accept or decline (Dean, 2014, p. 64). Although beneficial in some cases, Dean notes nominal classification âposes some additional complications from a computational and also reporting perspectiveâ since it requires finding the probability of all events prior to computing the probability of the âlast level,â adds the âchallenge in computing the misclassification rate,â and requires âreport value be calibratedâ for easier interpretation readers of the report (Dean, 2014, p. 64). The final target model Dean explains is âinterval predictionâ which he describes as âused when the largest level is continuous on the number lineâ (Dean, 2014, p. 66). This model, according to Dean, is often used in the insurance industry which states generally determines premium prices based on âthree different types of interval predictive models including claim frequency, severity, and pure premiumâ (Dean, 2014, p. 67). Since insurance companies will implement the models differently based on the insurance companyâs âhistorical dataâ and each customerâs âspecific informationâ including, in the automotive sector as an example, the customerâs car type, yearly driving distance, and driving record, the insurance companies will arrive at different premium prices for certain classifications of customers (Dean, 2014, p. 67).
Having explained the three types of target models, Dean finishes chapter four by discussing how to evaluate âwhich model is bestâ given a particular predictive modeling problem and the available data set (Dean, 2014, p. 67). He establishes the components of a âmodelâ as âall the transformations, imputations, variable selection, variable binning, and so on manipulations that are applied to the data in addition to the chosen algorithm and its associated parametersâ (Dean, 2014, p. 67). Noting the inherent subjectivity of determining the âbestâ model, Dean asserts the massive ânumber of options and combinations makes a brute-forceâ approach âinfeasibleâ and therefore a âcommon set of assessment measuresâ has arisen; and partitioning a data set into a larger âtraining partitionâ and a smaller âvalidation partitionâ to use for âassessmentâ has become âbest practiceâ in order to understand the degree to which a âmodel will generalize to new incoming dataâ (Dean, 2014, pp. 67, 70). As the foundation for his explanation of âassessment measures,â Dean identifies âa setâ of them âbased on the 2×2 decision matrixâ he illustrates in a table with the potential outcomes of ânoneventâ and âeventâ for the row headings and the potential predictions of âpredicted noneventâ and âpredicted eventâ as the column headings (Dean, 2014, p. 68). The values in the four table cells logically follow as âtrue negative,â âfalse negative, âfalse positive,â and âtrue positiveâ (Dean, 2014, p. 68). This classification method is widely accepted, according to Dean, because it âclosely aligns with what most people associate as the âbestâ model, and it measures the model fit across all valuesâ (Dean, 2014, p. 68). In addition, Dean notes the âproportion of events to noneventsâ when using the classification method should be âapproximately equalâ or âthe values need to be adjusted for making proper decisionsâ (Dean, 2014, p. 68). Once all observations are classified according to the 2×2 decision matrix, then the âreceiver operating characteristics (ROC) are calculated for all points and displayed graphically for interpretationâ (Dean, 2014, p. 68). In another table in this section, Dean provides the âformulas to calculate different classification measuresâ such as the âclassification rate (accuracy),â âsensitivity (true positive rate),â and âI-specificity (false positive rate),â among others (Dean, 2014, p. 68). Other assessment measures explained by Dean are âliftâ and âgainâ and the statistical measures Akaikeâs Information Criterion (AIC), Bayesian Information Criteria (BIC), and the Kolmogorov-Smirnov (KS) statistic (Dean, 2014, pp. 69-70).
In chapter five, Dean begins discussing in greater detail what he calls âthe key part of data miningâ that follows and is founded upon the deployment and implementation phases that established the hardware infrastructure, supporting software platform, and data mining software to be used, as well as upon the data preparation phase, all phases that in âlarger organizationsâ will likely be performed by cross-functional teams including specialists from various business units and the information technology division (Dean, 2014, pp. 54, 58, 71). While Dean provides in chapter four an overview of predictive modeling processes and methodologies, including the foundational target model types and how to evaluate (assess) the effectiveness of those model types given particular modeling problems and data sets, it is in chapters five through chapter ten that Dean thoroughly discusses and explains the work of the âdata scientistsâ who are responsible for creating models that will predict the future and provide return on investment and competitive advantage to their organizations (Dean, 2014, pp. 55-70, 71-191). Data scientists, according to Dean, are responsible for executing best practice in trying âa number of different modeling techniques or algorithms and a number of attempts within a particular algorithm using different settings or parametersâ to find the best model for accomplishing the data mining objective (Dean, 2014, p. 71). Dean explains data scientists will need to conduct âmany trialsâ in a âbrute forceâ effort to âarrive at the best answerâ (Dean, 2014, p. 71). Although Dean notes he focuses primarily on âpredictive modeling or supervised learningâ which âhas a target variable,â the âtechniques can be usedâ also âin unsupervised approaches to identify the hidden structure of a set of dataâ (Dean, 2014, p. 72). Without covering Deanâs rather exhaustive explanation throughout chapter five of the most âcommon predictive modeling techniquesâ and throughout chapters six through ten of âa set of methodsâ that âaddress more modern methods for analysis or analysis for specific type of data,â let it suffice to say he presents in each section pertaining to a particular modeling method that methodâs âhistory,â an âexample or story to illustrate how the method can be used,â âa high-level mathematical approach to the method,â and âa reference sectionâ pointing to more in-depth materials on each method (Dean, 2014, pp. 54, 71-72). In chapter five, Dean explains modeling techniques such as recency, frequency and monetary (RFM) modeling, regression (originally known as ââleast squaresââ), generalized linear models (GLMs), neural networks, decision and regression trees, support vector machine (SVMs), Bayesian methods network classification, and âensemble methodsâ that combine models (Dean, 2014, pp. 71-126). In chapters six through ten, Dean explains modeling techniques such as segmentation, incremental response modeling, time series data mining, recommendation systems, and text analytics (Dean, 2014, pp. 127-180).
Sharing his industry experience in part three, Dean provides âa collection of cases that illustrate companies that have been able toâ collect big data and apply data âanalyticsâ to âwell-stored and well-prepared data and âfind business valueâ to âimprove the businessâ (Dean, 2014, p. 194). Deanâs case study of a âlarge U.S.-based financial servicesâ company demonstrates how it attained its âprimary objectiveâ to improve the accuracy of its predictive model used in marketing campaigns to âmove the model lift from 1.6 to 2.5â and thereby significantly increase the number of customers who responded to its marketing campaigns (Dean, 2014, pp. 198, 202-203). Additionally, the bank attained its second objective that âimproved operational processing efficiency and responsivenessâ and thereby increased âproductivity for employeesâ (Dean, 2014, pp. 198, 203). Another case study of âa technology manufacturerâ shows how it âused the distributed file system of Hadoop along with in-memory computational methodsâ to reduce the time required to âcompute a correlation matrixâ identifying sources of product quality issues âfrom hours to just a few minutesâ and thereby enable the manufacturer to detect and correct the source of product quality issues in the time required to prevent shipping defective products and to remedy the manufacturing problem and continue production of quality product as quickly as possible (Dean, 2014, pp. 216-219). Other of Deanâs case studies describe how the big data phenomenon created value for companies in health care, âonline brand management,â and targeted marketing of âsmartphone applicationsâ (Dean, 2014, pp. 205-208, 225).
Dean concludes his book by describing what he views as some of the âopportunitiesâ and âchallengesâ in the future of âbig data, data mining, and machine learningâ (Dean, 2014, p. 233). Regarding the challenges, first Dean discusses the focus in recent years on how difficult it seems to be to reproduce the results of published research studies and he advocates for âtighter controls and accountabilityâ in order to ensure âpeople and organizations are held accountable for their published research findingsâ and thereby create a âfirm foundationâ of knowledge from which to advance the public good (Dean, 2014, pp. 233-234). Second, Dean discusses issues of âprivacy with public data setsâ and focuses on how it is possible to âdeanonymizeâ large, publically available data sets by combining those sets with âmicrodata,â sets, i.e. data sets about âspecific peopleâ (Dean, 2014, p. 234-235). These two challenges combined raise issues concerning how to strike an ethical âbalance between data privacy and reproducible researchâ that âincludes questions of legality as well as technologyâ and their âcompeting interestsâ (Dean, 2014, p. 233-235). Regarding the opportunities, first Dean discusses the âinternet of thingsâ (IoT) and notes the great contribution of machine-to-machine (M2M) communication to the big data era and states IoT will âdevelop and mature, data volumes will continue to proliferate,â and M2M data will grow to the extent that âdata generated by humans will fall to a small percentage in the next ten yearsâ (Dean, 2014, p. 236). Organizations capable of âcapturing machine data and using it effectively,â according to Dean, will have great âcompetitive advantage in the data mining spaceâ in the near future (Dean, 2014, pp. 236-237). The next opportunity Dean explains is the trend toward greater standardization upon âopen sourceâ software which gives professionals greater freedom in transferring âtheir skillsâ across organizations and which will require organizations to integrate open source software with proprietary software to benefit both from less-expensive, open source standards and from the âoptimization,â âroutine updates, technical support, and quality control assuranceâ offered by traditional âcommercial vendorsâ (Dean, 2014, pp. 237-238). Finally, Dean discusses opportunities in the âfuture development of algorithmsâ and while he acknowledges ânew algorithms will be developed,â he also states âbig data practitioners will need toâ understand and apply âtraditional methodsâ since true advancements in algorithms will be âincrementalâ and slower than some will claim (Dean, 2014, pp. 238-239). Dean states his âpersonal research interest is in the ensemble tree and deep learning areasâ (Dean, 2014, p. 239). Additionally, he notes interesting developments made by the Defense Advanced Research Projects Agency (DARPA) on âa new programming paradigm for managing uncertain informationâ called Probabilistic Programming for Advanced Machine Learning (PPAML) (Dean, 2014, p. 239). The end of Deanâs discussion of the future of algorithms cites as testaments to the âsuccessâ of âdata miningâ and âanalyticsâ the recent advances made in the âscience of predictive algorithms,â in the ability of âmachinesâ to better explore and find âpatterns in data,â and in the IBM Watson systemâs capabilities in âinformation recall,â in comprehending ânuance and context,â and in applying algorithms to analyze natural language and âdeduce meaning,â (Dean, 2014, pp. 239-241). Regarding âthe term âbig data,ââ Dean concludes that even though it âmay become so overused that it loses its meaning,â the existence and âevolutionâ of its primary elements â âhardware, software and data mining techniques and the demand for working on large, complex analytical problemsâ â is guaranteed (Dean, 2014, p. 241).
References
Baehr, C. (2013). Developing a sustainable content strategy for a technical communication body of knowledge. Technical Communication. 60, 293-306.
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15, 662â679. Davenport, T. H., & Patil, D. J. (2012). Data scientist. Harvard business review, 90(5), 70-76.
Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future. SIGKDD Explorations, 14(2), 1-5.
Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, December). The Google file system. ACM SIGOPS Operating Systems Review, 37(5), 29-43.
Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication. Technical Communication Quarterly, 24:1, 70-104, DOI: 10.1080/10572252.2015.975955
Mahrt, M. & Scharkow, M. (2013). The value of big data in digital media research. Journal of Broadcasting & Electronic Media, 57 (1), 20-33.
McNely, B., Spinuzzi, C., & Teston, C. (2015). Contemporary research methodologies in technical communication. Technical Communication Quarterly, 24, 1-13.
Smith, J. (2016, February). Here’s how much money you make in the ‘sexiest job of the 21st century’. Business Insider. Retrieved from http://www.businessinsider.com/how-much-money-you-earn-in- the-sexiest-job-of-the-21st-century-2016-2
Wolfe, J. (2015). Teaching students to focus on the data in data visualization. Journal of Business and Technical Communication, 29, 344-359.
Â