AB10 – Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners.

In the introduction to his book on the big data phenomenon, Jared Dean notes recent examples of big data’s impact, provides an extended definition of big data, and discusses some prominent issues debated in the field currently (Dean, 2014, pp. 1-12). In part one, Dean describes what he calls “the computing environment” including elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact (Dean, 2014, pp. 23-25). In part two, Dean explains a broad set of tactics for “turning data into business value” through the “methodology, algorithms, and approaches that can be applied to your data mining activities” (Dean, 2014, pp. 53-54). In part three, Dean examines cases of “large multinational corporations” that completed big data projects and “overcame major challenges in using their data effectively” (Dean, 2014, p. 194). In the final chapter of his book, Dean notes some of the trends, “opportunities,” and “serious challenges” he sees in the future of “big data, data mining, and machine learning” (Dean, 2014, p. 233).

Including data mining and machine learning in the title of his book, Dean highlights two fields of practice that managed and processed relatively large volumes of data long before the popular term big data was first used and, according to Dean, became “co-opted for self-promotion by many people and organizations with little or no ties to storing and processing large amounts of data or data that requires large amounts of computation” (Dean, 2014, pp. 9-10). Although Dean does not explain yet why so much attention has become focused on data mining and related fields, he says data is the “new oil,” the natural resource that is “plentiful but difficult and sometimes messy to extract,” and he says this natural resource requires “infrastructure” to “transport, refine, and distribute it” (Dean, 2014, p. 12). While noting “’Big Data’ once meant petabyte scale” and was generally used in reference to “unstructured chunks of data mined or generated from the internet,” Dean proposes the usage of the term big data has “expanded to mean a situation where” organizations “have too much data to store effectively or compute efficiently using traditional methods” (Dean, 2014, p. 10). Furthermore, Dean proposes what he calls the current “big data era” is differentiated by both a) notable changes in perception, attitude, and behavior by those who have realized “organizations that use data to make decisions over time in fact do make better decisions” – and therefore attain competitive advantage warranting “investment in collecting and storing data for its potential future value” – and by b) the “rapid development, creation, and maturity of technologies to store, manipulate, and analyze this data in new and efficient ways” (Dean, 2014, pp. 4-5). The main example of big data Dean cites in his book’s introduction illustrates big data’s impact by highlighting how it enabled “scientists” “to identify genetic markers” that enabled them to discover the drug tamoxifen used to treat breast cancer “is not 80% effective in patients but 100% effective in 80% of patients and ineffective in the rest” (Dean, 2014, p. 2). In commenting on this example of big data’s impact, Dean states “this type of analysis was not possible before” the “era of big data” because the “volume and granularity of the data was missing,” the “computational resources” required for the analysis were too scarce and expensive, and the “algorithms or modeling techniques” were too immature (Dean, 2014, p. 2). These types of discoveries, Dean states, have revealed to organizations the “potential future value” of data and have resulted in a “virtuous circle” where the realization of the value of data leads to increased allocation of resources to collect, store, and analyze data which leads to more valuable discoveries (Dean, 2014, p. 4). Although Dean mentions “credit” for first using the term big data is generally given to John Mashey who “in the late 1990s” “gave a series of talks to small groups about this big data tidal wave that was coming,” he also notes the “first academic paper was presented in 2000, and published in 2003, by Francis X. Diebolt” (Dean, 2014, p. 3). With the broad parameters of Dean’s extended definition of big data thus outlined, Dean completes the introduction of his book with a discussion of some prominent issues debated in the field currently, such as when sampling data may continue to be preferable to using all available data or when the converse is true (Dean, 2014, pp. 13-21), when new sources of data should be incorporated into existing processes (Dean, 2014, p. 13), and, perhaps most importantly, when the benefit of the information produced by a big data process clearly outweighs the cost of the big data process and thereby value is created (Dean, 2014, p. 11).

Dean’s use of the term “data mining” in the first sentences of his initial remarks both to part one and to part two of his book emphasizes Dean’s awareness of big data’s lineage in previously existing academic disciplines and professional fields of practice, a lineage that can seem lost in recent mainstream explanations of the big data phenomenon that often invoke a few terms all beginning with the letter “v” (Dean, 2014, pp. 24, 54). In fact, Dean himself refers to these “v’s” of big data, although he adds the term “value” to the other three commonly used terms “volume,” “velocity,” and “variety” (Dean, 2014, p. 24). Dean states “data mining is going through a significant shift with the volume, variety, value, and velocity of data increasing significantly each year” and he discusses in a fair amount of detail throughout parts one and two of his book the resources and methodologies available to and applied by those capitalizing on the big data phenomenon (Dean, 2014, pp. 23-25).

Dean separates the data mining endeavor into two sets of elements corresponding to the first two parts of his book. Part one, which Dean calls the “the computing environment, includes elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact. Part two, which dean calls “turning data into business value” includes elements such as the “methodology, algorithms, and approaches that can be applied to” big data projects (Dean, 2014, pp. 23-25, 54).

Although Dean does not explicitly identify as such what many in the information technology (IT) industry call the workload – meaning the primary and supporting software applications and data required to accomplish some information technology objective, e.g. a data mining project – he does discuss at various points throughout his book how the software (applications) used in data mining has characteristics which determine how well the software will run on particular types of hardware and solution architectures (designs). Dean’s introductory remarks to part one describe this as the “interaction between hardware and software” and he notes specifically how “traditional data mining software was implemented by loading data into memory and running a single thread of execution over the data” and how this traditional implementation form determined the “process was constrained by the amount of memory available and the speed of the processor” (Dean, 2014, p. 24). This would mean in cases where the data volume was greater than the available RAM on the system, “the process would fail” (Dean, 2014, p. 24). In addition, Dean notes how software implemented with a “single thread of execution” cannot utilize the advantages of “multicore” CPUs and therefore contributes to imbalances in system utilization and thereby to project waste impeding performance/price optimization (Dean, 2014, pp. 24-25). Further emphasizing his point, Dean says “all software packages cannot take advantage of current hardware capacity” and he notes how “this is especially true of the distributed computing model,” a model he acknowledges as important by encouraging decisions makers “to ensure that algorithms are distributed and effectively leveraging” currently available computing resources (Dean, 2014, p. 25).

In the first chapter of part one, Dean begins discussing the hardware involved in big data and he focuses on five primary hardware components: the storage, the central processing unit (CPU), the graphical processing unit (GPU), the memory (RAM), and the network (Dean, 2014, pp. 27-34). Regarding the storage hardware, Dean notes how although the price to performance and price to capacity ratios are improving, they may not be improving fast enough to offset increasing data volumes, volumes he says are “doubling every few years” (Dean, 2014, p. 28). Dean also draws attention to a few hardware innovations important to large-scale data storage and processing such as how external storage subsystems (disk arrays) and solid-state drives (SSDs) provide CPUs with faster access to larger volumes of data by improving “throughput” rates that in turn decrease the amount of time required to analyze vast quantities of data (Dean, 2014, pp. 28-29). Still, even though these innovations in data storage and access are improving data processing and analysis capabilities, Dean emphasizes the importance of the overall systems and their inter-relationships and how big data analytics teams must ensure they choose “analytical software” that can take advantage of improvements in data storage, access, and processing technologies, for example by choosing software that can “augment memory by writing intermediate results to disk storage” since disk storage is less expensive than memory, and that can utilize multi-core processors by executing multi-thread computations in parallel since that improves utilization rates of CPUs and utilization rates are a primary measure of efficiency used in analyzing costs relative to benefits (Dean, 2014, pp. 24, 28-29). Regarding CPU hardware, Dean notes how although the “famous Moore’s law” described the rapid improvements in processing power that “continued into the 1990s,” the law did not ensure sustainment of those same kinds of improvements (Dean, 2014, pp. 29-30). In fact, Dean states “in the early 2000s, the Moore’s law free lunch was over, at least in terms of processing speed,” for various reasons, for example the heat generated by processors running at ultra-high frequencies is excessive, and therefore CPU manufacturers tried other means of improving CPU performance, for example by “adding extra threads of execution to their chips” (Dean, 2014, p. 30). Ultimately, even though the innovation in CPUs is different than it was in the Moore’s law years of the 1980s and 1990’s, innovation in CPUs continues and it remains true that CPU utilization rates are low relative to other system components such as mechanical disks, SSDs, and memory; therefore in Dean’s view it remains true the “mismatch that exists among disk, memory, and CPU” often is the primary problem constraining performance (Dean, 2014, pp. 29-30). Regarding GPUs, Dean discusses how they have recently begun to be used to augment system processing power; and he focuses on how some aspects of graphics problems and data mining problems are similar in that they require or benefit from performing “a huge number of very similar calculations” to solve “hard problems remarkably fast” (Dean, 2014, p. 31). Furthermore, Dean notes how in the past “the ability to develop code to run on the GPU was restrictive and costly,” but recent improvements in “programming interfaces for developing software” to exploit GPU resources are overcoming those barriers (Dean, 2014, p. 31). Regarding memory (RAM), Dean emphasizes its importance for data mining workloads due to its function as the “intermediary between the storage of data and the processing of mathematical operations that are performed by the CPU” (Dean, 2014, p.32). In discussing RAM, Dean provides some background by mentioning a few milestones in the development of RAM and related components, for example how previous 32-bit CPUs and operating systems (OSes) limited addressable memory to 4GB and how Intel’s and AMD’s introductions of 64-bit CPUs at commodity prices in the early 2000s, along with the release of 64-bit OSes to support their widespread adoption, expanded addressable RAM to 8TB at that time and thereby supported “data mining platforms that could store the entire data mining problem in memory” (Dean, 2014, p. 32). These types of advancements in technology coupled with improvements in the ratios of price to performance and price to capacity, including the “dramatic drop in the price of memory” during this same time period “created an opportunity to solve many data mining problems that previously were not feasible” (Dean, 2014, p. 32). Even with this optimistic view of overall advancements, Dean reiterates the advancement pace in RAM and in hard drives remains slow compared to the advancement pace in CPUs – RAM speeds “have increased by 10 times” while CPU speeds “have increased 10,000 times” and disk storage advancements have been slower than those in RAM – therefore it remains important to continue seeking higher degrees of optimization, for example by using distributed systems since “it is much less expensive to deploy a set of commodity systems” with high capacities of RAM than it is to use “expensive high-speed disk storage systems” (Dean, 2014. pp. 32-33). One of the disadvantages of distributed systems, however, is the network “bottleneck” existing between individual nodes in the cluster even when high-speed proprietary technologies such as Infiniband are used (Dean, 2014, pp. 33-34). In the case of less expensive, standard technologies, Dean notes the “standard network connection for an analytical computing cluster is 10 gigabit Ethernet (10 GbE), which has an upper-bound data transfer rate of 4 gigabytes per second (GB/sec)” (Dean, 2014, p. 34). Since Dean identifies the inter-node network as the slowest component of distributed computing systems, he emphasizes the importance of considering the “network component when evaluating data mining software” and notes skillful design and selection of the “software infrastructure” and “algorithms” is required to ensure efficient “parallelization” is possible while minimizing data movement and “communication between computers” (Dean, 2014, p. 34).

In the second chapter of part one, Dean discusses the advantages and disadvantages of different types of distributed systems and notes how the decreasing price of standard hardware – including “high-core/large-memory (massively parallel processing [MPP] systems) and clusters of moderate systems” – has improved the cost to benefit ratio of solving “harder problems” which he defines as “problems that consume much larger volumes of data, with much higher numbers of variables” (Dean, 2014, p. 36). The crucial development enabling this advance in “big data analytics,” according to Dean, is the capability and practice of moving “the analytics to the data” rather than moving the “data to the analytics” (Dean, 2014, p. 36). Cluster computing that effectively moves “the analytics to the data,” according to Dean, “can be divided into two main groups of distributed computing systems,” that is “database computing” systems based on the traditional relational database management system (RDBMS) and “file system computing” systems based on distributed file systems (Dean, 2014, pp. 36-37). Regarding database computing, Dean describes “MPP databases” that “began to evolve from traditional DBMS technologies in the 1980s” as “positioned as the most direct update for the organizational enterprise data warehouses (EDWs)” and explains “the technology behind” them as involving “commodity or specialized servers that hold data on multiple hard disks” (Dean, 2014, p. 37). In addition to MPP databases, Dean describes “in-memory databases (IMDBs)” as an evolution begun “in the 1990s” that has become a currently “popular solution used to accelerate mission-critical data transactions” for various industries willing to absorb the higher costs of high-capacity RAM in order to attain the increased performance possible when all data is stored in RAM (Dean, 2014, p. 37). Regarding file system computing, Dean notes while there are many available “platforms,” the “market is rapidly consolidating on Hadoop” due to the number of “distributions and tools that are compatible with its file system” (Dean, 2014, p. 37). Dean attributes the “initial development” of Hadoop to Doug Cutting and Mike Cafarella who “created” it “based” upon development they did on the “Apache open source web crawling project” called Nutch and upon “a paper published by Google that introduced the MapReduce paradigm for processing data on large clusters” (Dean, 2014, pp. 37-38). Summarizing the advantages of Hadoop, Dean explains it is “attractive because it can store and manage very large volumes of data on commodity hardware and can expand easily by adding hardware resources with incremental cost” (Dean, 2014, p. 38). This coupling of high-capacity data storage with incremental capital expenditures makes the cost to benefit ration appear more attractive to organizations and enables them to rationalize storing “all available data in Hadoop” wagering on its potential future value even while understanding the large volume of data stored in Hadoop “is rarely (actually probably never) in an appropriate form for data mining” without the need for additional resource expenditure to cleanse it, transform it, and even “augment” it with data stored in other repositories (Dean, 2014, pp. 38-39). At this point in chapter two of his book, Dean has completed an initial overview of hardware components and solution architectures commonly considered by those responsible for purchasing and implementing big data management projects. Members of the IT organization, according to Dean, are often responsible for these decisions, although they may be expected to collaborate with other organizational stakeholders to understand organizational needs and objectives and to explain the advantages and disadvantages of various technological options and potential “trade-offs” that should be considered (Dean, 2014, pp. 39-40). Near the end of chapter two, Dean illustrates some of these factors (criteria) and “big data technologies” in a comparative table which ranks some of the solutions he has discussed (e.g. IMDBs, MPPDBs, and Hadoop) according to the degree (high, medium, low) to which they possess some features or capabilities often required in big data solutions (e.g. maintaining data integrity, providing high-availability to the data, and handling “unstructured data”) (Dean, 2014, p. 40). Concluding chapter two, Dean notes selection of the optimal “computing platform” for big data projects depends “on many dimensions, primarily the volume of data (initial, working set, and output data volumes), the pattern of access to the data, and the algorithm for analysis” (Dean, 2014, p. 41). Furthermore, he states these dimensions will “vary” at different phases of the data analysis (Dean, 2014, p. 41).

In the third chapter of part one, Dean departs from the preparation of data and the systems used to store data and moves on to the “analytical tools” that enable people to “create value” from the data (Dean, 2014, p. 43). Dean’s discussion of analytical tools focuses on five software applications and programming languages commonly used for large-scale data processing and analysis and he notes some of the strengths and weaknesses of each one. Beginning with the open source data mining software “Weka (Waikato Environment for Knowledge Analysis),” Dean describes it as “fully implemented in Java” and “notable for its broad range of extremely advanced training algorithms, its work flow graphical user interface (GUI)” and its “data visualization” capabilities (Dean, 2014, pp. 43-44). A weakness of Weka in Dean’s view is that it “does not scale well for big data analytics” since it “is limited to available RAM resources” and therefore its “documentation directs users to its data preprocessing and filtering algorithms to sample big data before analysis” (Dean, 2014, p. 44). Even with this weakness, however, Dean states many of Weka’s “most powerful algorithms” are only available in Weka and that its GUI makes it “a good option” for those without Java programming experience who need “to prove value” quickly (Dean, 2014, p. 44). When organizations need to “design custom analytics platforms,” Dean cites “Java and JVM languages” as “common choices” because of Java’s “considerable development advantages over lower-level languages” such as FORTRAN and C that “execute directly on native hardware” especially since “technological advances in the Java platform” have improved its performance “for input/output and network-bound processes like those at the core of many open source big data applications” (Dean, 2014, pp. 44-45). As evidence of these improvements, Dean notes how Apache Hadoop, the popular “Java-based big data environment,” won the “2008 and 2009 TeraByte Sort Benchmark” (Dean, 2014, p. 45). Overall, Dean’s perspective is the performance improvements in Java and the increasing “scale and complexity” of “analytic applications” have converged such that Java’s advantages in “development efficiency” along with “its rich libraries, many application frameworks, inherent support for concurrency and network communications, and a preexisting open source code base for data mining functionality” now outweigh some of its known weaknesses such as “memory and CPU-bound  performance” issues (Dean, 2014, p. 45). In addition to Java, Dean mentions “Scala and Clojure” as “newer languages that also run on the JVM and are used for data mining applications” (Dean, 2014, p. 45). Dean describes “R” as an “open source fourth-generation programming language designed for statistical analysis” that is gaining in “prominence” and popularity in the rapidly expanding “data science community” as well as in academia and “in the private sector” (Dean, 2014, p. 47). Evidence of R’s growing popularity is its ranking in the “2013 TIOBE general survey of programming languages,” in which it ranked “in 18th place in overall development language popularity” alongside “commercial solutions like SAS (at 21st) and MATLAB (at 19th)” (Dean, 2014, p. 47). Among the advantages of R, there are “thousands of extension packages” enabling customization to include “everything from speech analysis, to genomic science, to text mining,” in addition to its “impressive graphics, free and polished integrated development environments (IDEs), programmatic access to and from many general-purpose languages, and interfaces with popular proprietary analytics solutions including MATLAB and SAS” (Dean, 2014, p. 47).  Python is described by Dean as “designed to be an extensible, high-level language with a large standard library and simple, expressive syntax” that “can be used interactively or programmatically” and that “is often “deployed for scripting, numerical analysis, and OO general-purpose and Web application development” (Dean, 2014, p. 49). Dean highlights Python’s “general programming strengths” and “many database, mathematical, and graphics libraries” as particularly beneficial “in the data exploration and data mining problem domains” (Dean, 2014, p. 49). Although Dean provides a long list of advantageous features of Python, he asserts “the maturity” of its “scikit-learn toolkit” as a primary factor in its recently higher adoption rates “in the data mining and data science communities” (Dean, 2014, p. 49). Last, Dean describes SAS as “the leading analytical software on the market” and cites reports by IDC, Forrester, and Gartner as evidence (Dean, 2014, p. 50). In their most recent reports at the time Dean published this book, both Forrester and Gartner had named SAS “as the leading vendor in predictive modeling and data mining” (Dean, 2014, p. 50). Dean describes “the SAS System” as composed of “a number of product areas including statistics, operations research, data management, engines for accessing data, and business intelligence (BI); although he states the products “SAS/STAT, SAS Enterprise Miner, and the SAS text analytics suite” are most “relevant” in the context of this book (Dean, 2014, p. 50). Dean explains “the SAS system” can be “divided into two main areas: procedures to perform an analysis and the fourth-generation language that allows users to manipulate data” (Dean, 2014, p. 50). Dean illustrates one of the advantages of SAS by providing an example of how a SAS proprietary “procedure, or PROC” simplifies the code required to perform specific analyses such as “building regression models” or “doing descriptive statistics” (Dean, 2014, p. 51). Another great advantage “of SAS over other software packages is the documentation” that includes “over 9,300 pages” for the “SAS/STAT product alone” and over “2,000 pages” for the Enterprise Miner product” (Dean, 2014, p. 51). As additional evidence of the advantages of SAS products, Dean states he has never encountered an “analytical challenge” he has “not been able to accomplish with SAS” and notes that recent, “major changes” in the SAS “architecture” have enabled it “to take better advantage of the processing power and falling price per FLOP (floating point operations per second) of modern computing clusters” (Dean, 2014, pp. 50, 52). With his explanations of the big data computing environment (i.e. hardware, systems architectures, software, and programming languages) and some aspects of the big data preparation phase completed in part one, Dean turns to part two in which he addresses in depth exactly how big data in general and predictive modeling in particular enable “value creation for business leaders and practitioners.” a phrase he uses as the subtitle of his book.

Dean introduces part two of his book by stating he will address over the next seven chapters the “methodology, algorithms, and approaches that can be applied to” data mining projects, including a general four-step “process of building models” he says “has been developed and refined by many practitioners over many years,” and also including the “sEMMA approach,” a “data mining methodology, created by SAS, that focuses on logical organization of the model development phase of data mining projects” (Dean, 2014, pp. 54, 58, 61). The sEMMA approach, according to Dean, “has been in place for over a decade and proven useful for thousands and thousands of users” (Dean, 2014, p. 54). Dean says his objective is to “explain a methodology for predictive modeling” since accurately “predicting future behavior” provides people and organizations “a distinct advantage regardless of the venue” (Dean, 2014, p. 54). In his explanation of data mining methodologies, Dean notes he will discuss “the types of target models, their characteristics” and their business applications (Dean, 2014, p. 54). In addition, Dean states he will discuss “a number of predictive modeling techniques” including the “fundamental ideas behind” them, “their origins, how they differ, and some of their drawbacks” (Dean, 2014, p. 54). Finally, Dean says he will explain some “more modern methods for analysis or analysis for specific types of data” (Dean, 2014, p. 54).

Dean begins chapter four by stating predictive modeling is one of the primary data mining endeavors and he defines it as a process in which collected “historical data (the past)” are explored to “identify patterns in the data that are seen through some methodology (the model), and then using the model” to predict “what will happen in the future (scoring new data)” (Dean, 2014, p. 55). Next, Dean discusses the multi-disciplinary nature of the field utilizing a Venn diagram from SAS Enterprise Miner training documentation that includes the following contributing disciplines: data mining, knowledge discovery and data mining (KDD), statistics, machine learning, data bases, data science, pattern recognition, computational neuroscience, and artificial intelligence (AI) (Dean, 2014, pp. 56). Dean notes his tendency to use “algorithms that come primarily from statistics and machine learning” and he explains how these two disciplines, residing in different “university departments” as they do, produce graduates with different knowledge and skills. Graduates in statistics, according to Dean, tend to understand “a great deal of theory” but have “limited programming skills,” while graduates in computer science tend to “be great programmers” who understand “how computer languages interact with computer hardware, but have limited training in how to analyze data” (Dean, 2014, p. 56). The result of this, Dean explains, is “job applicants will likely know only half the algorithms commonly used in modeling,” with the statisticians knowing “regression, General Linear Models (GLMs), and decision trees” and the computer scientists knowing “neural networks, support vector machines, and Bayesian methods” (Dean, 2014, p. 56). Before moving into a deeper discussion of the process of building predictive models, Dean notes a few “key points about predictive modeling,” namely that a) “sometimes models are wrong,” b) “the farther your time horizon, the more uncertainty there is,” and c) “averages (or averaging techniques) do not predict extreme values” (Dean, 2014, p. 57). Elaborating further, Dean says even though models may be wrong (i.e. there is a known margin of error), the models can still be useful for making decisions (Dean, 2014, p. 57). And finally, Dean emphasizes “logic and reason should not be ignored because of a model result” Dean, 2014, p. 58).

In the next sections of chapter four, Dean explains in detail “a methodology for building models” (Dean, 2014, p. 58). First, he discusses the general “process of building models” as a “simple, proven approach to building successful and profitable models” and he explains the general process in four phases: “1. Prepare the data,” “2. Perform exploratory data analysis,” “3. Build your first model,” and “4. Iteratively build models” (Dean, 2014, pp. 58-60). The first phase of preparing the data, Dean says, is likely completed by “a separate team” and requires understanding “the data preparation process within” an organization, namely “what data exists” in the organization (i.e. the sources of data) and how data from various sources “can be combined” to “provide insight that was previously not possible” (Dean, 2014, p. 58). Dean emphasizes the importance of having access to “increasingly larger and more granular data” and points directly to the “IT organization” as the entity that should be “keeping more data for longer and at finer levels” to ensure their organizations are not “behind the trend” and not “at risk for becoming irrelevant” in the competitive landscape (Dean, 2014, pp. 58-59). The second phase of the model building process focuses on exploring the data to begin to “understand” it and “to gain intuition about relationships between variables” (Dean, 2014, p. 59). Dean emphasizes “domain expertise” is important at this phase in order to ensure “thorough analysis” and recommendations which avoid unwarranted focus on patterns or discoveries that may seem important to the data miner, who, though skilled in data analysis, may not have the domain knowledge required to realize what appear to him to be significant patterns are insignificant in reality because the patterns are widely known by domain experts already (Dean, 2014, p. 59). Dean notes recent advances in “graphical tools” for data mining have simplified the data exploration process – a process that was once much slower and often required “programming skills – to the degree that products from some large and small companies such as “SAS, IBM, and SAP” and “QlikTech,” and “Tableau” enable users to easily “load data for visual exploration” and “have been proven to work with” projects involving “billions of observations” when “sufficient hardware resources are available” (Dean, 2014, p. 59). To ensure efficiency in the data exploration phase, Dean asserts the importance of adhering to the “principle of sufficiency” and to the “law of diminishing returns” so that exploration stops while the cost to benefit ratio is optimal (Dean, 20124, pp. 59-60). The third phase of the model-building process is to build the first model while acknowledging a “successful model-building process will involve many iterations” (Dean, 2014, p. 60). Dean recommends working rapidly with a familiar method to build the first model and states he often prefers to “use a decision tree” because he is comfortable with it (Dean, 2014, p. 60). This first model created is used as the “champion model” (benchmark) against which the next model iteration will be evaluated (Dean, 2014, p. 60). The fourth phase of the model-building process is where most time should be devoted and where the data miner will need to use “some objective criteria that defines the best model” in order to determine if the most recent model iteration is “better than the champion model” (Dean, 2014, p. 60). Dean describes this step as “a feedback loop” since the data miner will continue comparing the “best” model built thus far with the next model iteration until either “the project objectives are met” or some other constraint such as a deadline requires stopping (Dean, 2014, p. 60). With this summary of the general model-building process finished, next Dean explains SAS’s sEMMA approach that “focuses on the model development phase” of the general model-building process and “makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes and confirm a model’s accuracy” (Dean, 2014, p. 61). “The acronym sEMMA,” Dean states, “refers to the core process of conducting data mining” and stands for “sample, explore, modify, model,” and “assess” (Dean, 2014, p. 61). To explain sEMMA further, Dean addresses “a common misconception” by emphasizing “sEMMA is not a data mining methodology but rather a logical organization of the functional tool set” of the SAS Enterprise Miner product that can be used appropriately “as part of any iterative data mining methodology adopted by the client” (SAS, 2014, p. 61). Once the best model has been found under given constraints, then this “champion model” can “be deployed to score” new data – this is “the end result of data mining” and the point at which “return on investment” is realized (Dean, 2014, p. 63). At this point, Dean explains some of the advantages of using SAS’s Enterprise Miner, advantages such as automation of “the deployment phase by supplying scoring code in SAS, C, Java, and PMML” and by capturing “code for pre-processing activities” (Dean, 2014, p. 63).

Following his overview of the model-building process, Dean identifies the three “types of target models,” discusses “their characteristics,” provides “information about their specific uses in business,” and then explains some common ways of “assessing” (evaluating) predictive models (Dean, 2014, pp. 54, 64-70). The first target model Dean explains is “binary classification” which in his experience “is the most common type of predictive model” (Dean, 2014, p. 64). Binary classification often is used to enable “decision makers” with a “system to arrive at a yes/no decision with confidence” and to do so fast, for example, to approve or disapprove a credit application or to launch or not launch a spacecraft (Dean, 2014, p. 64). Dean explains further, however, there are cases when a prediction of “the probability that an event will or will not occur” is “much more useful than the binary prediction itself” (Dean, 2014, p. 64). In the case of weather forecasts, for example, most people would prefer to know the “confidence estimate” (“degree of confidence,” confidence level, probability) expressed as a percentage probability the weather forecaster assigns to each possible observation of the – it is easier to decide whether to carry an umbrella if one knows the weather forecaster predicts a ninety-five percent chance of rain than if one knows only that the weather forecaster predicts it will rain or it will not rain (Dean, 2014, p. 64). The second target model Dean explains is “multilevel or nominal classification” which he describes as useful when one is interested in creating “more than two levels” of classification (Dean, 2014, p. 64). As an example, Dean describes how preventing credit card fraud while facilitating timely transactions could mean the initial decision regarding a transaction include not only the binary classifications of accept or decline, but also an exception classification of requires further review before a decision can be made to accept or decline (Dean, 2014, p. 64). Although beneficial in some cases, Dean notes nominal classification “poses some additional complications from a computational and also reporting perspective” since it requires finding the probability of all events prior to computing the probability of the “last level,” adds the “challenge in computing the misclassification rate,” and requires “report value be calibrated” for easier interpretation readers of the report (Dean, 2014, p. 64). The final target model Dean explains is “interval prediction” which he describes as “used when the largest level is continuous on the number line” (Dean, 2014, p. 66). This model, according to Dean, is often used in the insurance industry which states generally determines premium prices based on “three different types of interval predictive models including claim frequency, severity, and pure premium” (Dean, 2014, p. 67). Since insurance companies will implement the models differently based on the insurance company’s “historical data” and each customer’s “specific information” including, in the automotive sector as an example, the customer’s car type, yearly driving distance, and driving record, the insurance companies will arrive at different premium prices for certain classifications of customers (Dean, 2014, p. 67).

Having explained the three types of target models, Dean finishes chapter four by discussing how to evaluate “which model is best” given a particular predictive modeling problem and the available data set (Dean, 2014, p. 67). He establishes the components of a “model” as “all the transformations, imputations, variable selection, variable binning, and so on manipulations that are applied to the data in addition to the chosen algorithm and its associated parameters” (Dean, 2014, p. 67). Noting the inherent subjectivity of determining the “best” model, Dean asserts the massive “number of options and combinations makes a brute-force” approach “infeasible” and therefore a “common set of assessment measures” has arisen; and partitioning a data set into a larger “training partition” and a smaller “validation partition” to use for “assessment” has become “best practice” in order to understand the degree to which a “model will generalize to new incoming data” (Dean, 2014, pp. 67, 70). As the foundation for his explanation of “assessment measures,” Dean identifies “a set” of them “based on the 2×2 decision matrix” he illustrates in a table with the potential outcomes of “nonevent” and “event” for the row headings and the potential predictions of “predicted nonevent” and “predicted event’ as the column headings (Dean, 2014, p. 68). The values in the four table cells logically follow as “true negative,” “false negative, “false positive,” and “true positive” (Dean, 2014, p. 68). This classification method is widely accepted, according to Dean, because it “closely aligns with what most people associate as the ‘best’ model, and it measures the model fit across all values” (Dean, 2014, p. 68). In addition, Dean notes the “proportion of events to nonevents” when using the classification method should be “approximately equal” or “the values need to be adjusted for making proper decisions” (Dean, 2014, p. 68). Once all observations are classified according to the 2×2 decision matrix, then the “receiver operating characteristics (ROC) are calculated for all points and displayed graphically for interpretation” (Dean, 2014, p. 68). In another table in this section, Dean provides the “formulas to calculate different classification measures” such as the “classification rate (accuracy),” “sensitivity (true positive rate),” and “I-specificity (false positive rate),” among others (Dean, 2014, p. 68). Other assessment measures explained by Dean are “lift” and “gain” and the statistical measures Akaike’s Information Criterion (AIC), Bayesian Information Criteria (BIC), and the Kolmogorov-Smirnov (KS) statistic (Dean, 2014, pp. 69-70).

In chapter five, Dean begins discussing in greater detail what he calls “the key part of data mining” that follows and is founded upon the deployment and implementation phases that established the hardware infrastructure, supporting software platform, and data mining software to be used, as well as upon the data preparation phase, all phases that in “larger organizations” will likely be performed by cross-functional teams including specialists from various business units and the information technology division (Dean, 2014, pp. 54, 58, 71). While Dean provides in chapter four an overview of predictive modeling processes and methodologies, including the foundational target model types and how to evaluate (assess) the effectiveness of those model types given particular modeling problems and data sets, it is in chapters five through chapter ten that Dean thoroughly discusses and explains the work of the “data scientists” who are responsible for creating models that will predict the future and provide return on investment and competitive advantage to their organizations (Dean, 2014, pp. 55-70, 71-191). Data scientists, according to Dean, are responsible for executing best practice in trying “a number of different modeling techniques or algorithms and a number of attempts within a particular algorithm using different settings or parameters” to find the best model for accomplishing the data mining objective (Dean, 2014, p. 71). Dean explains data scientists will need to conduct “many trials” in a “brute force” effort to “arrive at the best answer” (Dean, 2014, p. 71). Although Dean notes he focuses primarily on “predictive modeling or supervised learning” which “has a target variable,” the “techniques can be used” also “in unsupervised approaches to identify the hidden structure of a set of data” (Dean, 2014, p. 72). Without covering Dean’s rather exhaustive explanation throughout chapter five of the most “common predictive modeling techniques” and throughout chapters six through ten of “a set of methods” that “address more modern methods for analysis or analysis for specific type of data,” let it suffice to say he presents in each section pertaining to a particular modeling method that method’s “history,” an “example or story to illustrate how the method can be used,” “a high-level mathematical approach to the method,” and “a reference section” pointing to more in-depth materials on each method (Dean, 2014, pp. 54, 71-72). In chapter five, Dean explains modeling techniques such as recency, frequency and monetary (RFM) modeling, regression (originally known as “’least squares’”), generalized linear models (GLMs), neural networks, decision and regression trees, support vector machine (SVMs), Bayesian methods network classification, and “ensemble methods” that combine models (Dean, 2014, pp. 71-126). In chapters six through ten, Dean explains modeling techniques such as segmentation, incremental response modeling, time series data mining, recommendation systems, and text analytics (Dean, 2014, pp. 127-180).

Sharing his industry experience in part three, Dean provides “a collection of cases that illustrate companies that have been able to” collect big data and apply data “analytics” to “well-stored and well-prepared data and “find business value” to “improve the business” (Dean, 2014, p. 194). Dean’s case study of a “large U.S.-based financial services” company demonstrates how it attained its “primary objective” to improve the accuracy of its predictive model used in marketing campaigns to “move the model lift from 1.6 to 2.5” and thereby significantly increase the number of customers who responded to its marketing campaigns (Dean, 2014, pp. 198, 202-203). Additionally, the bank attained its second objective that “improved operational processing efficiency and responsiveness” and thereby increased “productivity for employees” (Dean, 2014, pp. 198, 203). Another case study of “a technology manufacturer” shows how it “used the distributed file system of Hadoop along with in-memory computational methods” to reduce the time required to “compute a correlation matrix” identifying sources of product quality issues “from hours to just a few minutes” and thereby enable the manufacturer to detect and correct the source of product quality issues in the time required to prevent shipping defective products and to remedy the manufacturing problem and continue production of quality product as quickly as possible (Dean, 2014, pp. 216-219). Other of Dean’s case studies describe how the big data phenomenon created value for companies in health care, “online brand management,” and targeted marketing of “smartphone applications” (Dean, 2014, pp. 205-208, 225).

Dean concludes his book by describing what he views as some of the “opportunities” and “challenges” in the future of “big data, data mining, and machine learning” (Dean, 2014, p. 233). Regarding the challenges, first Dean discusses the focus in recent years on how difficult it seems to be to reproduce the results of published research studies and he advocates for “tighter controls and accountability” in order to ensure “people and organizations are held accountable for their published research findings” and thereby create a “firm foundation” of knowledge from which to advance the public good (Dean, 2014, pp. 233-234). Second, Dean discusses issues of “privacy with public data sets” and focuses on how it is possible to “deanonymize” large, publicly available data sets by combining those sets with “microdata,” sets, i.e. data sets about “specific people” (Dean, 2014, p. 234-235). These two challenges combined raise issues concerning how to strike an ethical “balance between data privacy and reproducible research” that “includes questions of legality as well as technology” and their “competing interests” (Dean, 2014, p. 233-235). Regarding the opportunities, first Dean discusses the “internet of things” (IoT) and notes the great contribution of machine-to-machine (M2M) communication to the big data era and states IoT will “develop and mature, data volumes will continue to proliferate,” and M2M data will grow to the extent that “data generated by humans will fall to a small percentage in the next ten years” (Dean, 2014, p. 236). Organizations capable of “capturing machine data and using it effectively,” according to Dean, will have great “competitive advantage in the data mining space” in the near future (Dean, 2014, pp. 236-237). The next opportunity Dean explains is the trend toward greater standardization upon “open source” software which gives professionals greater freedom in transferring “their skills” across organizations and which will require organizations to integrate open source software with proprietary software to benefit both from less-expensive, open source standards and from the “optimization,” “routine updates, technical support, and quality control assurance” offered by traditional “commercial vendors” (Dean, 2014, pp. 237-238). Finally, Dean discusses opportunities in the “future development of algorithms” and while he acknowledges “new algorithms will be developed,” he also states “big data practitioners will need to” understand and apply “traditional methods” since true advancements in algorithms will be “incremental” and slower than some will claim (Dean, 2014, pp. 238-239). Dean states his “personal research interest is in the ensemble tree and deep learning areas” (Dean, 2014, p. 239). Additionally, he notes interesting developments made by the Defense Advanced Research Projects Agency (DARPA) on “a new programming paradigm for managing uncertain information” called Probabilistic Programming for Advanced Machine Learning (PPAML) (Dean, 2014, p. 239). The end of Dean’s discussion of the future of algorithms cites as testaments to the “success” of “data mining” and “analytics” the recent advances made in the “science of predictive algorithms,” in the ability of “machines” to better explore and find “patterns in data,” and in the IBM Watson system’s capabilities in “information recall,” in comprehending “nuance and context,” and in applying algorithms to analyze natural language and “deduce meaning,” (Dean, 2014, pp. 239-241). Regarding “the term ‘big data,’” Dean concludes that even though it “may become so overused that it loses its meaning,” the existence and “evolution” of its primary elements – “hardware, software and data mining techniques and the demand for working on large, complex analytical problems” – is guaranteed (Dean, 2014, p. 241).

AB09 – Wolfe, J. (2015). Teaching students to focus on the data in data visualization.

This “pedagogical reflection” by Joanna Wolfe utilizes “Perelman and Olbrechts-Tyteca’s concept of interpretative level” to elucidate the rhetorical decisions people make when selecting and presenting data and to provide a theoretical foundation for two exercises and a formal assignment Wolfe designed to teach “data visualization” and “writing about data” in communication courses (Wolfe, 2015, pp. 344-345, 348).

According to Wolfe, “interpretative level” is used by Perelman and Olbrechts-Tyteca “to describe the act of choosing between competing, valid interpretations” (Wolfe, 2015, p. 345). In relation to data specifically, Wolfe states the concept can be applied to describe “the choice we make to summarize data on variable x versus variable y” and explains further by emphasizing how people decide whether data are presented as, for example, “averages versus percentages or raw counts” and how those choices have “dramatic consequences for the stories we might tell about data” (Wolfe, 2015, pp. 345-346).

By focusing on interpretative level, Wolfe hopes to address what she perceives as a deficiency by technical communication textbooks to address strategic concerns that would encourage authors to “return to the data to reconsider what data are selected, how they are summarized, and whether they should be synthesized with other data for a more compelling argument” (Wolfe, 2015, p. 345). Although Wolfe praises the communication literature and technical communication textbooks for addressing tactical concerns such as aligning visualization designs with type of data or adjusting visualizations for specific audiences or considering “how to ethically represent data,” she proposes greater involvement with the data to address strategic concerns such as what is the rhetorical purpose and context and which tactics should be used to advance the overall rhetorical strategy (Wolfe, 2015, pp. 348).

In the main body of her paper, Wolfe explains the two exercises and formal assignment she designed to teach students the interpretative level concept and to enable them to practice using it by creating data visualizations from actual data sets (Wolfe, 2015, pp. 348-356). In the first exercise, she demonstrates how deciding on what variable to sort data in a table will determine which “’story or narrative’” is immediately perceived by most viewers (Wolfe, 2015, p. 349) and she explains how to have students practice creating data visualizations to present the “’fairest’” view of Olympics medal data (Wolfe, 2015, pp. 348-251). In the second exercise and in the formal assignment, Wolfe continues adding complexity by increasing the number of analytical points and potential visualization methods students should consider (Wolfe, 2015, pp. (351-355). This increased complexity allows Wolfe to discuss additional methods for visualizing data. She explains, for example, how to consolidate variables using point systems to provide an index score that better summarizes data and how to use stylistic and organizational choices in visualizations to reveal patterns in the data that enable viewers to “derive conclusions” aligned with the authors’ decisions regarding rhetorical strategy (Wolfe, 2015, pp. 353-356).

In conclusion, Wolfe proposes again that communication instruction regarding data visualization should go beyond teaching optimal data visualization tactics by introducing concepts such as interpretative level that encourage students to create rhetorical strategies – and to revisit the data and the analysis and the rhetorical purpose and context – and thereby invent “narratives” that attain those strategies (Wolfe, 2015, p. 357). This, according to Wolfe, will enable students to see data not “as pure, unmodifiable fact,” but “as a series of rhetorical choices” (Wolfe, 2015, pp. 357).

AB08 – McNely, B., Spinuzzi, C., & Teston, C. (2015). Contemporary research methodologies in technical communication.

In Technical Communication Quarterly’s most recent special issue on research methods and methodologies, the issue’s guest editors assert “methodological approaches” are important “markers for disciplinary identity” and thereby agree with previous guest editor, Goubil-Gambrell, who in the 1998 special issue “argued that ‘defining research methods is a part of disciplinary development’” (McNely, Spinuzzi, & Teston, 2015, p. 2). Furthermore, the authors of the 2015 special issue revere the 1998 special issue as a “landmark issue” including ideas that “informed a generation of technical communication scholars as they defined their own objects of study, enacted their research ethics, and thought through their metrics” (McNely, et al., 2015, p. 9).

It is in this tradition the authors of the 2015 special issue both desire to review “key methodological developments” and associated theories forming the technical communication “field’s current research identity” and to preview and “map future methodological approaches” and relevant theories (McNely, et al., 2015, p. 2). The editors argue the approaches and theories discussed in this special edition of the journal “not only respond to” what they view as substantial changes in “tools, technologies, spaces, and practices” in the field over the past two decades, but also “innovate” by describing and modeling how these changes are informing technical communicators’ emerging research methodologies and theories as those methodologies and theories relate to the “field’s objects of study, research ethics, and metrics” (i.e. “methodo-communicative issues”) (McNely, et al., 2015, pp. 1-2, 6-7).

Reviewing what they see as the fundamental theories and research methodologies of the field, the authors explore how a broad set of factors (e.g. assumptions, values, agency, tools, technology, and contexts) manifest in work produced along three vectors of theory and practice they identify as “sociocultural theories of writing and communication,” “associative theories and methodologies,” and “the new material turn” (McNely, et al., 2015, p. 2). The authors describe the sociocultural vector as developing from theoretical traditions in “social psychology, symbolic interactionism,” “learning theory,” and “activity theory,” among others, and as essentially involving “purposeful human actors,” “material surroundings,” “heterogeneous artifacts and tools,” and even “cognitive constructs” combining in “concrete interactions” – that is, situations – arising from synchronic and diachronic contextual variables scholars may identify, describe, measure, and use to explain phenomena and theorize about them (McNely, et al., 2015, pp. 2-4). The authors describe the associative vector as developing from theoretical traditions in “articulation theory,” “rhizomatics,” “distributed cognition,” and “actor-network theory (ANT)” (McNely, et al., 2015, p. 4) and as essentially involving “symmetry—a methodological stance that ascribes agency to a network of human and nonhuman actors rather than to specific human actors” and therefore leading researchers to “focus on associations among nodes” as objects at the methodological nexus (McNely, et al., 2015, p. 4). The authors describe the new material vector as developing from theoretical traditions in “science and technology studies, political science, rhetoric, and philosophy” (with the overlap of the specific traditions from political science and philosophy often “collected under the umbrella known as “object-oriented ontology”) and as essentially involving a “radically symmetrical perspective on relationships between humans and nonhumans—between people and things, whether those things are animal, vegetable, or mineral” and how these human and non-human entities integrate into “collectives” or “assemblages” that have “agency” one could view as “distributed and interdependent,” a phenomenon the authors cite Latour as labeling “interagentivity” (McNely, et al., 2015, p. 5).

Previewing the articles in this special issue, the editors acknowledge how technical communication methodologies have been “influenced by new materialisms and associative theories” and argue these methodologies “broaden the scope of social and rhetorical aspects” of the field and “encourage us to consider tools, technologies, and environs as potentially interagentive elements of practice” that enrich the field (McNely, et al., 2015, p. 6). At the same time, the editors mention how approaches such as “action research” and “participatory design” are advancing “traditional qualitative approaches” (McNely, et al., 2015, p. 6). In addition, the authors state “given the increasing importance of so-called ‘big data’ in a variety of knowledge work fields, mixed methods and statistical approaches to technical communication are likely to become more prominent” (McNely, et al., 2015, p. 6). Amidst these developments, the editor’s state their view that adopting “innovative methods” in order to “explore increasingly large date sets” while “remaining grounded in the values and aims that have guided technical communication methodologies over the previous three decades” may be one of the field’s greatest challenges (McNely, et al., 2015, p. 6).

In the final section of their paper, the authors explicitly return to what they seem to view as primary disciplinary characteristics (i.e. markers, identifiers), which they call “methodo-communicative issues,” and use those characteristics to compare the articles in the 1998 special issue with those in the 2015 special issue and to identify what they see as new or significant in the 2015 articles. The “methodo-communicative issues” or disciplinary characteristics they use are: “objects of study, research ethics, and metrics” (McNely, et al., 2015, pp. 6-7). Regarding objects of study, the authors note how in the 1998 special issue, Longo focuses on the “contextual nature of technical communication” while in the 2015 special issue, Read and Swarts focus on “networks and knowledge work” (McNely, et al., 2015, p. 7). Regarding ethics, the authors cite Blyer in the 1998 special issue as applying “critical” methods rather than “descriptive/explanatory methods” while in the 2015 special issue, Walton, Zraly, and Mugengana apply “visual methods” to create “ethically sound cross-cultural, community-based research” (McNely, et al., 2015, p. 7). Regarding metrics or “measurement,” the authors cite Charney in the 1998 special issue as contrasting the affordances of “empiricism” with “romanticism” while in the 2015 special issue, Graham, Kim, DeVasto, and Keith explore the affordances of “statistical genre analysis of larger data sets” (McNely, et al., 2015, p. 7). In their discussion of what is new or significant in the articles in the 2015 special issue, the editors highlight how some articles address particular methodo-communicative issues. Regarding metrics or “measurement,” for example, they highlight how Graham, Kim, DeVasto, and Keith apply Statistical Genre Analysis (SGA) – a hybrid research method combining rhetorical analysis with statistical analysis – to answer research questions such as which “specific genre features can be correlated with specific outcomes” across an “entire data set” rather than across selected exemplars (McNely, et al., 2015, p. 8).

In summary, the guest editors of this 2015 special issue on contemporary research methodologies both review the theoretical and methodological traditions of technical communication and preview the probable future direction of the field as portrayed in the articles included in this special issue.

AB07 – Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google File System.

When they published their paper in 2003, engineers at Google had already designed, developed, and implemented the Google File System (GFS) in an effort to sustain performance and control costs while providing the infrastructure, platform, and applications required to deliver Google’s services to users (Ghemawat, Gobioff, & Leung, 2003, p. 29). Although the authors acknowledge GFS has similar aims as existing distributed file systems, aims such as “performance, scalability, reliability, and availability,” they state GFS has dissimilar “design assumptions” arising from their “observations” of Google’s “application workloads and technological environment” (Ghemawat, et al., 2003, p. 29). In general, the authors describe GFS as “the storage platform for the generation and processing of data used by our service” and used by our “research and development efforts that require large data sets” (Ghemawat, et al., 2003, p. 29). In addition, they state that GFS is suitable for “large distributed data-intensive applications,” that it is capable of providing “high aggregate performance to a large number of clients,” and that it “is an important tool” that allows Google “to innovate and attack problems on the scale of the entire web” (Ghemawat, et al., 2003, pp. 29, 43).

In the introduction to their paper, the authors state the four primary characteristics of their “workloads and technological environment” as 1) “component failures are the norm rather than the exception,” 2) “files are large by traditional standards,” 3) “most files are mutated by appending new data rather than overwriting existing data,” and 4) “co-designing the applications and the file system API benefits the overall system by increasing flexibility” (Ghemawat, et al., 2003, p. 29).Each of these observations aligns with (results in) what the authors call their “radically different points in the design space” (Ghemawat, Gobioff, & Leung, 2003, p. 29) which they elaborate in some detail both in the numbered list in the paper’s introduction and in the bulleted list in the second section, “Assumptions,” of the paper’s second part, “Design Overview” (Ghemawat, et al., 2003, p. 30). Considering the authors’ first observation, for example, that the “quantity and quality of the components virtually guarantee” parts of the system will fail and “will not recover,” it is reasonable to assert the design premises (assumptions) that system specifications should include the system is made of “inexpensive commodity components” and it “must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis” (Ghemawat, et al., 2003, pp. 29-30). Considering the authors’ second observation, for example, “files are huge by traditional standards,” meaning “multi-GB files are common” and “the system stores a modest number of large files,” it is reasonable to assert the design premises (assumptions) that system parameters “such as I/O operation and block sizes” need “to be revisited” and re-defined in order to optimize the system for managing large files while maintaining support for managing small files (Ghemawat, et al., 2003, pp. 29-30). These two examples demonstrate the type of arguments and evidence the authors provide to support their claim GFS responds to fundamental differences between the data, workloads (software), and infrastructure (hardware) of traditional information technology and the data, workloads, and infrastructure Google needs to sustain its operations in contemporary and future information technology (Ghemawat, et al., 2003, pp. 29-33, 42-43). In the remaining passages of their paper’s introduction and in the first section of their design overview, the authors continue discussing Google’s technological environment by describing the third and fourth primary characteristics of the environment they have observed and by explaining corollary design premises and assumptions arising from those observations they applied to designing and developing GFS (Ghemawat, et al., 2003, pp. 29-30).

With the rationale for their work thus established, the authors move on in the remaining sections of their design overview to discuss the overall architecture of GFS. First, they introduce some features the authors imply are shared with other distributed file systems – for example an API supporting “the usual operations to create, delete, open, close, read, and write files,” — and some features the authors imply are unique to GFS – for example “snapshot and record append operations” (Ghemawat, et al., 2003, p. 30). Next, they describe the main software components (functions or roles) included in a GFS implementation on a given “cluster” (set) of machines, namely the “GFS clients,” the “GFS master,” and the “GFS chunkservers.” The GFS clients enable communication between applications requiring data and between the GFS master and GFS chunkservers providing data. The GFS master “maintains all file system metadata” and “controls system-wide activities.” The GFS chunkservers store the actual data (Ghemawat, et al., 2003, p. 31).

At this point in their paper, although the authors begin providing fairly detailed technical explanations for how these various GFS components interact, I will mention only a few points the authors emphasize as crucial to the success of GFS. First of all, in contrast with some other distributed file systems, GFS is a “single master” architecture that has both advantages and disadvantages (Ghemawat, et al., 2003, pp. 30-31). According to the authors, one advantage of “having a single master” is it “vastly simplifies” the “design” of GFS and “enables the master to make sophisticated chunk placement and replication decisions using global knowledge” (Ghemawat, et al., 2003, pp. 30-31). A disadvantage of having only one master, however, is its resources could be overwhelmed and it could become a “bottleneck” (Ghemawat, et al., 2003, p. 31). In order to overcome this potential disadvantage of the single master architecture, the authors explain how communication and data flows through the GFS architecture, namely that GFS clients “interact with the master for metadata operations,” but interact with the chunkservers for actual data operations (i.e. operations requiring alteration or movement of data) and thereby relieve the GFS master from performing “common operations” that could overwhelm it (Ghemawat, et al., 2003, p. 31, 43). Other important points include GFS’s relatively large data “chunk size,” its “relaxed consistency model,” its elimination of the need for substantial client cache, and its use of replication instead of RAID to solve fault tolerance issues (Ghemawat, et al., 2003, pp. 31-32, 42).

AB06 – Mahrt, M. & Scharkow, M. (2013). The value of big data in digital media research.

In their effort to promote “theory-driven” research strategies and to caution against the naïve embrace of “data-driven” research strategies that seems to have culminated recently in a veritable “’data rush’ promising new insights” into almost anything, the authors of this paper “review” a “diverse selection of literature on” digital media research methodologies and the Big Data phenomenon as they provide “an overview of ongoing debates” in this realm while arguing ultimately for a pragmatic approach based on “established principles of empirical research” and “the importance of methodological rigor and careful research design” (Mahrt & Scharkow, 2013, pp. 26, 20, 21, 30).

Mahrt and Scharkow acknowledge the advent of the Internet and other technologies has enticed “social scientists from various fields” to utilize “the massive amounts of publicly available data about Internet users” and some scholars have enjoyed success in “giving insight into previously inaccessible subject matters” (Mahrt & Scharkow, 2013, p. 21). Still, the authors note, there are some “inherent disadvantages” with sourcing data from the Internet in general and also from particular sites such as social media sites or gaming platforms (Mahrt & Scharkow, 2013, p. 21, 25). One of the most commonly cited problems with sourcing publicly available data from social media sites or gaming platforms or Internet usage is “the problem of random sampling on which all statistical inference is based, remains largely unsolved” (Mahrt & Scharkow, 2013, p. 25). The data in Big Data essentially are “huge” amounts of data “’naturally’ created by Internet users,” “not indexed in any meaningful way,” and with no “comprehensive overview” available (Mahrt & Scharkow, 2013, p. 21).

While Mahrt and Scharkow mention the positive attitude of “commercial researchers” toward a “golden future” for big data, they also mention the cautious attitude of academic researchers and explain how the “term Big Data has a relative meaning” (Mahrt & Scharkow, 2013, pp. 22, 25) contingent perhaps in part on these different attitudes. And although Mahrt and Scharkow imply most professionals would agree the big data concept “denotes bigger and bigger data sets over time,” they explain also how “in computer science” researchers emphasize the concept “refers to data sets that are too big” to manage with “regular storage and processing infrastructures” (Mahrt & Scharkow, 2013, p. 22). This emphasis on data volume and data management infrastructure familiar to computer scientists may seem to some researchers in “the social sciences and humanities as well as applied fields in business” too narrowly focused on computational or quantitative methods and this focus may seem exclusive and controversial in additional ways (Mahrt & Scharkow, 2013, pp. 22-23). Some of these additional controversies revolve around issues such as, for example, whether a “data analysis divide” may be developing that favors those with “the necessary analytical training and tools” over those without them (Mahrt & Scharkow, 2013, pp. 22-23), or whether an overemphasis on “data analysis” may have contributed to the “assumption that advanced analytical techniques make theories obsolete in the research process,” as if the numbers, the “observed data,” no longer require human interpretation to clarify meaning or to identify contextual or other confounding factors that may undermine the quality of the research and raise “concerns about the validity and generalizability of the results” (Mahrt & Scharkow, 2013, pp. 23-25).

Although Mahrt and Scharkow grant advances in “computer-mediated communication,” “social media,” and other types of “digital media” may be “fueling methodological innovation” such as analysis of large-scale data sets – or so-called Big Data – and that the opportunity to participate is alluring to “social scientists” in many fields, the authors conclude their paper by citing Herring and others urging researchers to commit to “methodological training,” “to learn to ask meaningful questions,” and to continually “assess” whether collection and analysis of massive amounts of data is truly valuable in any specific research endeavor (Mahrt & Scharkow, 2013, p. 20, 29-30). The advantages of automated, big data research are numerous, as Mahrt and Scharkow concede, for instance “convenience” and “efficiency,” or the elimination of research obstacles such as “artificial settings” and “observation effects,” or the “visualization” of massive “patterns in human behavior” previously impossible to discover and render (Mahrt & Scharkow, 2013, pp. 24-25). With those advantages understood and granted, the author’s argument seems a reasonable reminder of the “established principles of empirical research” and of the occasional need to reaffirm the value of the tradition (Mahrt & Scharkow, 2013, p. 21).

AB05 – Baehr, Craig. (2013). Developing a sustainable content strategy for a technical communication body of knowledge.

People responsible for planning, creating, and managing information and information systems in the world today identify with various academic disciplines and business and industrial fields. As Craig Baehr explains, this can make it difficult to find or to develop and sustain a body of knowledge that represents the “interdisciplinary nature” of the technical communication field (Baehr, 2013, p. 294). In his article, Baehr describes his experience working with a variety of other experts to develop and produce a “large-scale knowledge base” for those who identify with the “technical communication” field and to ensure that knowledge base embodies a “systematic approach” to formulating an “integrated or hybrid” “content strategy” that considers the “complex set of factors” involved in such long-term projects, factors such as the “human user,” “content assets,” “technology,” and “sustainable practices” (Baehr, 2013, pp. 293, 295, 305).

Baehr defines a “body of knowledge” as representing the “breadth and depth of knowledge in the field with overarching connections to other disciplines and industry-wide practices” (Baehr, 2013, p. 294). As the author discusses, the digital age presents a unique set of challenges for those collecting and presenting knowledge that will attract and help scholars and practitioners. One important consideration Baehr discusses is the “two dominant, perhaps philosophical, approaches that characterize how tacit knowledge evolves into a more concrete product,” an information and information systems product such as a website with an extensive content database and perhaps some embedded web applications. The two approaches Baehr describes are the “folksonomy” or “user-driven approach” and the “taxonomy” or “content-driven approach” (Baehr, 2013, p. 294). These two approaches affect aspects of the knowledge base such as the “findability” of its content and whether users are allowed to “tag” content to create a kind of “bottom-up classification” in addition to the top-down taxonomy created by the site’s navigation categories (Baehr, 2013, p. 294). In regard to this particular project, Baehr explains how the development team used both a user survey and topics created through user-generated content to create “three-tiered Topic Lists” for the site’s home page. While some of the highest-level topics such as “consulting” and “research” were taken from the user survey, second-level topics such as “big data,” and third-level topics such as “application development” were taken from user-generated topics on discussions boards and from topics the development team gleaned from current technical communication research (Baehr, 2013, p. 304).

In this article, Baehr’s primary concern is with providing an overview of the issues involved in developing digital knowledge bases in general and of his experience in developing a digital knowledge base for the technical communication field in particular. As mentioned, he concludes using “an integrated or hybrid” approach involving various methods to develop and organize the information content based upon a “sustainable content strategy” (Baehr, 2013, p. 293).

AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

AB03 – Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future.

Fan and Bifet (2012) state the aim of their article, and of the particular issue of the publication it introduces, as to provide an overview of the “current status” and future course of the academic discipline and business and industrial field involved in “mining big data.” Toward that aim, the authors say they will “introduce Big Data mining and its applications,” “summarize the papers presented in this issue,” note some of the field’s controversies and challenges, discuss the “importance of open-source software tools,” and draw a few conclusions regarding the field’s overall endeavor (Fan & Bifet, 2012, p. 1).

In their bulleted list of controversies surrounding the big data phenomenon, the authors begin by noting the controversy regarding whether there is any “need to distinguish Big Data analytics from data analytics” (Fan & Bifet, 2012, p. 3). From the perspectives of people who have been involved with data management, including knowledge discovery and data mining, since before “the term ‘Big Data’ appeared for the first time in 1998” (Fan & Bifet, 2012, p. 1), it seems reasonable to consider exactly how the big data of recent years are different from the data of past years.

Although Fan and Bifet acknowledge this controversy, in much of their article they proceed to explain how the big data analytics of today is different from the data analytics of past years. First, they say their conception of big data refers to datasets so large and complex those data sets have “outpaced our capability to process, analyze, store and understand” them with “our current methodologies or data mining software tools” (Fan & Bifet, 2013, p. 1). Next, they describe their conception of “Big Data mining” as “the capability of extracting useful information from these large datasets or streams of data that due to Laney’s “3 V’s in Big Data management” – volume, velocity, and variety – it has thus far been extremely difficult or impossible to do (Fan & Bifet, 2012, pp. 1, 2). In addition to Laney’s 3V’s the authors cite from a note Laney wrote or published in 2001, the authors cite Gartner as explaining two more V’s of big data in a definition of big data on Gartner’s website accessed in 2012 (Fan & Bifet, 2012, p. 2). While one of the Gartner V’s cited by Fan and Bifet is “variability” involving “changes in the structure of the data and how users want to interpret that data” seems to me related enough to Laney’s “variety” one could combine them for simplicity and convenience, the other of the Gartner V’s cited by the authors is “value” which Fan and Bifet interpret as meaning “business value that gives organizations a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach” seems to me unique enough from Laney’s V’s one should consider it a separate, fourth V that could be cited as a characteristic of big data (Fan & Bifet, 2012, p. 2).

In their discussion of how big data analytics can be applied to create value, the authors cite an Intel website accessed in 2012 to describe business applications such as customization of products or services for particular customers, technology applications that would improve “process time from hours to seconds,” healthcare applications for “mining the DNA of each person” in order “to discover, monitor and improve health aspects of everyone,” and public policy planning that could create “smart cities” “focused on sustainable economic development and high quality of life” (Fan & Bifet, 2012, p. 2). Continuing their discussion of the value or “usefulness” of big data, the authors describe the United Nations’ (UN) Global Pulse initiative as an effort begun in 2009 “to improve life in developing countries” by researching “innovative methods and techniques for analyzing real-time digital data,” by assembling a “free and open source” big data “technology toolkit,” and by establishing an “integrated, global network of Pulse Labs” in developing countries in order to enable them to utilize and apply big data (Fan & Bifet, 2012, p. 2).

Before Fan and Bifet mention Laney’s 3V’s of big data and cite Gartner’s fourth V – value – they describe some of the sources of data that have developed in “recent years” and that have contributed to “a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications” including both social media applications that enable end-users to easily generate content and an infrastructure of “mobile phones” that is “becoming the sensory gateway to get real-time data on people” (Fan & Bifet, 2012, p.1). In addition, they mention the “Internet of things (IoT)” and predict it “will raise the scale of data to an unprecedented level” as “people and devices” in private and public environments “are all loosely connected” to create “trillions” of endpoints contributing “the data” from which “valuable information must be discovered” and used to “help improve quality of life and make our world a better place” (Fan & Bifet, 2012, p.1).

Completing their introduction to the topic of big data and their discussion of some of its applications, Fan and Bifet turn in the third section of their paper to summarizing four selected articles from the December 2012 issue of Explorations, the newsletter of the Association for Computing Machinery’s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (KDD), the issue of that newsletter which their article introduces. In their opinion, these four articles “together” represent “very significant state-of-the-art research in Big Data Mining” (Fan & Bifet, 2012, p. 2). Their summaries of the four articles, two articles from researchers in academia and two articles from researchers in industry, discuss big data mining infrastructure and technologies, methods, and objectives. They say the first article, from researchers at Twitter, Inc., “presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter” which illustrates the “current state of data mining tools” is such that “most of the time is consumed in preparatory work” and in revising “preliminary models into robust solutions” (Fan & Bifet, 2012, p. 2). They summarize the second article, from researchers in academia, as being about “mining heterogeneous information networks” of “interconnected, multi-typed data” that “leverage the rich semantics of typed nodes and links in a network” to discover knowledge “from interconnected data” (Fan & Bifet, 2012, p. 2). The third article, also from researchers in academia, they summarize as providing an “overview of mining big graphs” by using the “PEGASUS tool” and as indicating potentially fruitful “research directions for big graph mining” (Fan & Bifet, 2012, p. 2). They summarize the fourth article, from a researcher at Netflix, as being about Netflix’s “recommender and personalization techniques” and as including a substantial section on whether “we need more data or better models to improve our learning methodology” (Fan & Bifet, 2012, pp. 2-3).

In the next section of their paper, the authors provide a seven-bullet list of controversies surrounding the “new hot topic” of “Big Data” (Fan & Bifet, 2012, p. 3). The first controversy on their list, one I mentioned earlier in this article, raises the issue of whether and how the recent and so-called “Big Data” phenomenon is any different from what has previously been referred to as simply data management or data analysis or data analytics, among other similar terms or concepts that have existed in various disciplines or fields or bodies of literature for quite some time. The second controversy mentioned by the authors concerns whether “Big Data” may be nothing more than hype resulting from efforts by “data management systems sellers” to profit from sales of systems capable of storing massive amounts of data to be processed and analyzed by Hadoop and related technologies when in reality smaller volumes of data and other strategies and methods may be more appropriate in some cases (Fan & Bifet, 2012, p. 3). The third controversy the authors note asserts that in the case at least of “real time analytics,” the “recency” of the data is more important than the volume of data. As the fourth controversy, the authors mention how some of Big Data’s “claims to accuracy are misleading” and they cite Taleb’s argument that as “the number of variables grow, the number of fake correlations also grow” and can result in some rather absurd correlations such as the one in which Leinweber found “the S&P 500 stock index was correlated with butter production in Bangladesh” (Fan & Bifet, 2012, p. 3). The fifth controversy the authors addresses the issue of data quality by proposing “bigger data are not always better data” and stating a couple of factors that can determine data quality, for example whether “the data is noisy or not,” and if it is representative” (Fan & Bifet, 2012, p. 3). The authors state the sixth controversy as an ethical issue, mainly whether “it is ethical that people can be analyzed without knowing it” (Fan & Bifet, 2012, p. 3). The final controversy addressed by Fan and Bifet concerns whether access to massive volumes of data and the capabilities to use it (including required infrastructure, knowledge, and skills) are unfairly or unjustly limited and could “create a division between the Big Data rich and poor” (Fan & Bifet, 2012, p. 3).

Fan and Bifet devote the fifth section of their paper to discussing “tools” and focus on the close relationships between big data, “the open source revolution,” and companies including “Facebook, Yahoo!, Twitter,” and “LinkedIn” that both contribute to and benefit from their involvement with “open source projects” such as the Apache Hadoop project (Fan & Bifet, 2012, p. 3) many consider the foundation of big data. After briefly introducing the “Hadoop Distributed File System (HDFS) and MapReduce” as the primary aspects of the Hadoop project that enable storage and processing of massive data sets, respectively, the authors mention a few other open source projects within the Hadoop ecosystem such as “Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper,” and “Apache Cassandra,” among others (Fan & Bifet, 2012, p. 3). Next, the authors discuss more of the “many open source initiatives” involved with big data (Fan & Bifet, 2012, p. 3). “Apache Mahout,” for example, is a “scalable machine learning and data mining open source software based mainly in Hadoop,” “R” is a “programming language and software environment,” “MOA” enables “stream data mining” or “data mining in real time,” and “Vowpal Wabbit” (VW) is a “parallel learning” algorithm known for speed and scalability (Fan & Bifet, 2012, p. 3). Regarding open-source “tools” for “Big Graph mining,” the authors mention “GraphLab” and “PEGASUS,” the latter of which they describe as a “big graph mining system built on top of MAPREDUCE” that enables discovery of “patterns and anomalies “in massive real-world graphs” (Fan & Bifet, 2012, pp. 3-4).

The sixth section of their article provides a seven-bullet list of what the authors consider “future important challenges in Big Data management and analytics” given the nature of big data as “large, diverse, and evolving” (Fan & Bifet, 2012, p. 4). First, they discuss the need to continue exploring architectures in order to ascertain clearly what would be the “optimal architecture” for “analytic systems” “to deal with historic data and with real-time data” simultaneously (Fan & Bifet, 2012, p. 4). Next, they state the importance of ensuring accurate findings and making accurate claims – in other words, “to achieve significant statistical results” – In big data research, especially since “it is easy to go wrong with huge data sets and thousands of questions to answer at once” (Fan & Bifet, 2012, p. 4). Third, they mention the need to expand the number of “distributed mining” methods since some “techniques are not trivial to paralyze” (Fan & Bifet, 2012, p. 4). Fourth, the authors note the importance of improving capabilities in analyzing data streams that are continuously “evolving over time” and “in some cases to detect change first” (Fan & Bifet, 2012, p. 4). Fifth, the authors note the challenge of storing massive amounts of data and emphasize the need to continue exploring the balance between gaining or sacrificing time or space given the “two main approaches” currently used to address the issue, namely either compressing (i.e. sacrificing time compressing to reduce required space to store) or sampling (i.e. using sample of data – “coresets” – in order to represent much larger data volumes) (Fan & Bifet, 2012, p. 4). Sixth, the authors admit “it is very difficult to find user-friendly visualizations” and it will be necessary to develop innovative “techniques” and “frameworks” “to tell and show” the “stories” of data (Fan & Bifet, 2012, p. 4). Last, the authors acknowledge massive amounts of potentially valuable data are being lost since much data being created today are “largely untagged file-based and unstructured data” (Fan & Bifet, 2012, p. 4). Quoting a “2012 IDC study on Big Data,” the authors say “currently only 3% of the potentially useful data is tagged, and even less is analyze” (Fan & Bifet, 2012, p. 4).

In the conclusion to their paper, Fan and Bifet predict “each data scientist will have to manage” increasing data volume, increasing data velocity, and increasing data variety in order to participate in “the new Final Frontier for scientific data research and for business applications” and to “help us discover knowledge that no one has discovered before” (Fan & Bifet, 2012, p. 4).

AB02 – Boyd, D., & Crawford, K. (2012). Critical questions for Big Data

As “social scientists and media studies scholars,” Boyd and Crawford (2012) consider it their responsibility to encourage and focus the public discussion regarding “Big Data” by asserting six claims they imply help define the many and important potential issues the “era of Big Data” has already presented to humanity and the diverse and competing interests that comprise it (Boyd & Crawford, 2012, pp. 662-663). Before asserting and explaining their claims, however, the authors define Big Data “as a cultural, technological, and scholarly phenomenon” that “is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets,” a phenomenon that has three primary components (fields or forces) interacting within it: 1) technology, 2) analysis, and 3) mythology (Boyd & Crawford, 2012, p. 663). Precisely because Big Data, as well as some “other socio-technical phenomenon,” elicit both “utopian and dystopian rhetoric” and visions of the future of humanity, Boyd and Crawford think it is “necessary to ask critical questions” about “what all this data means, who gets access to what data, how data analysis is deployed, and to what ends” (Boyd & Crawford, 2012, p. 664).

The authors’ first two claims are concerned essentially with epistemological issues regarding the nature of knowledge and truth (Boyd & Crawford, 2012, pp. 665-667. In explaining their first claim, “1. Big Data changes the definition of knowledge,” the authors draw parallels between Big Data as a “system of knowledge” and “’Fordism’” as a “manufacturing system of mass production.” According to the authors, both of these systems influence peoples’ “understanding” in certain ways. Fordism “produced a new understanding of labor, the human relationship to work, and society at large.” And Big Data “is already changing the objects of knowledge” and suggesting new concepts that may “inform how we understand human networks and community” (Boyd & Crawford, 2012, p. 665). In addition, the authors cite Burkholder, Latour, and others in describing how Big Data refers not only to the quantity of data, but also to the “tools and procedures” that enable people to process and analyze “large data sets,” and to the general “computational turn in thought and research” that accompanies these new instruments and methods (Boyd & Crawford, 2012, p. 665). In addition, the authors state “Big Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and categorization of reality” (Boyd & Crawford, 2012, p. 665). Finally, as counterpoint to the many potential benefits and positive aspects of Big Data they have emphasized thus far, the authors cite Anderson as one who has revealed the at times prejudicial and arrogant beliefs and attitudes of some quantitative proponents who summarily dismiss all qualitative or humanistic approaches to gathering evidence and formulating theories (Boyd & Crawford, 2012, pp. 665-666) as inferior.

In explaining their second claim, “2. Claims to objectivity and accuracy are misleading,” the authors continue considering some of the biases and misconceptions inherent in epistemologies that privilege “quantitative science and objective method” as the paths to knowledge and absolute truth. According to the authors, Big Data “is still subjective” and even when research subjects or variables are quantified, those quantifications do “not necessarily have a closer claim on objective truth.” In the view of the authors, the obsession of social science and the “humanistic disciplines” with attaining “the status of quantitative science and objective method” is at least to some extent misdirected (Boyd & Crawford, 2012, pp. 666-667), even if understandable given the apparent value society assigns to quantitative evidence. Citing Gitelman and Bollier, among others, the authors believe “all researchers are interpreters of data” not only when they draw conclusions based on their research findings, but also when they design their research and decide what will – and what will not – be measured. Overall, the authors argue against too eagerly embracing the positivistic perspective on knowledge and truth and argue in favor of critically examining research philosophies and methods and considering the limitations inherent within them (Boyd & Crawford, 2012, pp. 667-668).

The third and fourth claims the authors make could be considered to address research quality. Their third claim, “3. Big data are not always better data,” emphasizes the importance of quality control in research and highlights how “understanding sample, for example, is more important than ever.” Since “the public discourse around” massive and easily collected data streams such as Twitter “tends to focus on the raw number of tweets available” and since “raw numbers” would not be a “representative sample” of most populations about which researchers seek to make claims, public perceptions and opinion could be skewed by either mainstream media’s misleading reporting about valid research or by unprofessional researchers’ erroneous claims based upon invalid research methods and evidence (Boyd & Crawford, 2012, pp. 668-669). In addition to these issues of research design, the authors highlight how additional “methodological challenges” can arise “when researchers combine multiple large data sets,” challenges involving “not only the limits of the data set, but also the limits of which questions they can ask of a data set and what interpretations are appropriate” (Boyd & Crawford, 2012, pp. 669-670).

The authors fourth claim continues addressing research quality, but at the broader level of context. Their fourth claim, “4. Taken out of context, Big Data loses its meaning,” emphasizes the importance of considering how the research context affects research methods and research findings and conclusions. The authors imply attitudes toward mathematical modeling and data collection methods may cause researchers to select data more for their suitability to large-scale, computational, automated, quantitative data collection and analysis than for their suitability to discovering patterns or to answering research questions. As an example, the authors consider the evolution of the concept of human networks in sociology and focus on different ways of measuring “‘tie strength,’” a concept understood by many sociologists to indicate “the importance of individual relationships” (Boyd & Crawford, 2013, p. 670). Although recently developed concepts such as “articulated networks” and “behavioral networks” may appear at times to indicate tie strength equivalent to more traditional concepts such as “kinship networks,” the authors explain how the tie strength of kinship networks is based on more in-depth, context-sensitive data collection such as “surveys, interviews” and even “observation,” while the tie strength of articulated networks or behavioral networks may rely on nothing more than interaction frequency analysis; and “measuring tie strength through frequency or public articulation is a common mistake” (Boyd & Crawford, 2013, p. 671). In general, the authors urge caution against considering Big Data the panacea that will objectively and definitively answer all research questions. In their view, “the size of the data should fit the research question being asked; in some cases, small is best” (Boyd & Crawford, 2012, p. 670).

The authors’ final two claims address ethical issues related to Big Data, some of which seem to have arisen in parallel with its ascent. In their fifth claim, “5. Just because it is accessible does not make it ethical,” the authors focus primarily on whether “social media users” implicitly give permission to anyone to use publicly available data related to the user in all contexts, even contexts the user may not have imagined, such as in research studies or in the collectors’ data or information products and services (Boyd & Crawford, 2012, pp. 672-673). Citing Ess and others, the authors emphasize researchers and scholars have “accountability” for their actions, including those actions related to “the serious issues involved in the ethics of online data collections and analysis.” The authors encourage researchers and scholars to consider privacy issues and to proactively assess whether they should assume users have provided “informed consent” for the researchers to collect and analyze users’ publicly available data just because the data is publicly available” (Boyd & Crawford, 2013, pp. 672-673). In their sixth claim, “6. Limited access to Big Data creates new digital divides,” the authors note that although there is a prevalent perception Big Data “offers easy access to massive amounts to data,” the reality is access to Big Data and the ability to manage and analyze Big Data require resources unavailable to much of the population – and this “creates a new kind of digital divide: the Big Data rich and the Big Data poor” (Boyd & Crawford, 2013, pp. 673-674). “Whenever inequalities are explicitly written into the system,” the authors assert further, “they produce class-based structures (Boyd & Crawford, 2012, p. 675)

In their article overall, Boyd & Crawford maintain an optimistic tone while enumerating the many and myriad issues emanating from the phenomenon Big Data. In concluding, the authors encourage scholars, researchers, and society to “start questioning the underlying assumptions, values, and biases of this new wave of research” (Boyd & Crawford, 2012, p. 675).

AB01 – Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication.

A team of researchers determines to bring the power of “big data” into the toolkit of technical communication scholars by piloting a research method they “dub statistical genre analysis (SGA)” and describing and explaining the method in an article published in the journal Technical Communication Quarterly (Graham, Kim, Devasto, & Keith, 2015, pp. 70-71).

Acknowledging the value academic markets have begun assigning to findings, conclusions, and theories founded upon rigorous analysis of massive data sets, this team deconstructs the amorphous “big data” phenomenon and demonstrates how their SGA methodology can be used to quantitatively describe and visually represent the generic content (e.g. types of evidence and modes of reasoning) of rhetorical situations (e.g. committee meetings) and to discover input variables (e.g. conflicts of interest) that have statistically significant effects upon output variables (e.g. recommendations) of important policy-influencing entities such as the Food and Drug Administration’s (FDA) Oncologic Drugs Advisory Committee (ODAC) (Graham et al., 2015, pp. 86-89).

The authors believe there is much to gain by integrating the “humanistic and qualitative study of discourse with statistical methods” and although they respect the “craft character of rhetorical inquiry” (Graham et al., 2015, pp 71-72) and utilize “the inductive and qualitative nature of rhetorical analysis as a necessary” initial step in their hybrid method (Graham et al., 2015, p. 77), they conclude their mixed-method SGA approach can increase the “range and power” (Graham et al., 2015 p. 92) of “traditional, inductive approaches to genre analysis” (Graham et al., 2015, p. 86) by offering the advantages “of statistical insights” while avoiding the disadvantages of statistical sterility that can emerge when the qualitative humanist element is absent (Graham et al., 2015, p. 91).

In the conclusion of their article, the researchers identify two main benefits of their hybrid SGA method. The first benefit is communication genres “can be defined with more precision” since SGA documents the actual frequency of generic conventions as they exist within a large sample of the corpus, rather than being defined more generally since traditional rhetorical methods may document the opinions experts have of the “typical” frequency of generic conventions as they perceive them to exist within a limited sample of “exemplars” selected from a small sample of the corpus. In addition, the authors argue analysis of a massive number of texts may reveal generic conventions that do not appear in the limited sample of exemplars that may be studied by practitioners of the traditional rhetorical approach involving only “critical analysis and close reading.” The second benefit is communications scholars are enabled to move beyond critical opinion and to claim statistically significant correlations between “situational inputs and outputs” and “genre characteristics that have been empirically established” (Graham et al., 2015, p. 92).

Befitting the subject of their study, the authors devote a considerable portion of their article to describing their research methodology. In the third section titled “Statistical Genre Analysis,” they begin by noting they conducted the “current pilot study” on a “relatively small subset” of the available data in order to “demonstrate the potential of SGA.” Further, they outline their research questions, the answers to two of which indeed seem to attest to the strength SGA can contribute to both the evidence and the inferences used by communication scholars in their own arguments about the communications they study. As they do in the introduction, in this section also, the authors note the intellectual lineage of SGA in various disciplines, including “rhetorical studies, linguistics,” “health communication,” psychology, and “applied statistics” (Graham et al., 2015, pp. 71, 76).

As explained earlier, the communication artifacts studied by these researches are selected from among the various artifacts arising from the FDA’s ODAC meetings, specifically the textual transcriptions of presentations (essentially opening statements) given by the sponsors (pharmaceutical manufacturing companies) of the drugs under review during meetings which usually last one or two days (Graham et al., 2015, pp. 75-76). Not only in the arenas of technical communication and rhetoric, but also in the arenas of Science and Technology Studies (STS) and of Science, Technology, Engineering, and Math (STEM) public policy, managing conflicts of interests among ODAC participants and encouraging inclusion of all relevant stakeholders in ODAC meetings are prominent issues (Graham et al., 2015, p. 72). At the conclusion of ODAC meetings, voting participants vote either for or against the issue under consideration, generally “applications to market new drugs, new indications for already approved drugs, and appropriate research/study endpoints” (Graham et al., 2015, pp. 74-76).

It is within this context the authors attempted to answer the following two research questions, among others, regarding all ODAC meetings and sponsor presentations given at those meetings between 2009 and 2012: “1. How does the distribution of stakeholders affect the distribution of votes?” and “3. How does the distribution of evidence and forms of reasoning in sponsor presentations affect the distribution of votes?” (Graham et al., 2015, pp. 75-76). Notice both of these research questions ask whether certain input variables affect certain output variables. And in this case, the output variables are votes either for or against an action that will have serious consequences for people and organizations. Put another way, this is a political (or deliberative rhetoric) situation and the ability to predict with a high degree of certainty which inputs produce which outputs could be quite valuable, given those inputs and outputs could determine substantial budget allocations, consulting fees, and pharmaceutical sales – essentially, success or failure – among other things.

Toward the aim of asking and answering research questions with such potentially high stakes, the authors applied their SGA mixed-methods approach, which they explain included four phases of research conducted over approximately six months to one year and included at least four researchers. The authors explain SGA “requires first an extensive data preparation phase” after which the researchers “subjected” the data “to various statistical tests to directly address the research questions.” They describe the four phases of their SGA method as “(a) coding schema development, (b) directed content analysis, (c) meeting data and participant demographics extraction, and (d) statistical analyses.” Before moving into a deeper discussion of their own “coding schema” development, as well as the other phases of their SGA approach, the authors cite numerous influences from scholars in “behavioral research,” “multivariate statistics,” “corpus linguistics,” and “quantitative work in English for specific purposes,” while explaining the specific statistical “techniques” they apply “can be found in canonical works of multivariate statistics such as Keppel’s (1991) Design and Analysis and Johnson and Wichern’s (2007) Applied Multivariate Statistical Analysis” (Graham et al., 2015, pp. 75-77). One important distinction the authors make between their method and these other methods is while the other methods operate at the more granular “word and sentence level” that facilitates formulation of “coding schema amenable to automated content analysis,” the authors operate at the less granular paragraph level that requires human intervention in order to formulate coding schema reflecting nuances only discernable at higher cognitive levels, for example whether particular evidentiary artifacts (transcripts) are based on randomized controlled trials (RCTs) addressing issues of “efficacy” or RCTs addressing issues of “safety and treatment-related hazards” (Graham et al., 2015, pp. 77-78). Choosing the longer, more complex paragraph as their unit of analysis requires the research method to depend upon “the inductive and qualitative nature of rhetorical analysis as a necessary precursor to both qualitative coding and statistical testing” (Graham et al., 2015, p. 77).

In the final section of their explanation of SGA, their research methodology, the authors summarize their statistical methods including both “descriptive statistics” and “inferential statistics” and how they applied these two types of statistical methods, respectively, to “provide a quantitative representation of the data set” (e.g. “mean, median, and standard deviation”) and to “estimate the relationship between variables” (e.g. “statistically significant impacts”) (Graham et al., 2015, pp. 81-83).

Returning to the point of the authors’ research – namely demonstrating how SGA empowers scholars to provide confident answers to research questions and therefore to create and assert knowledge clearly valued by societal interests – their SGA enables them to state their “multiple regression analysis” found “RCT-efficacy data and conflict of interest remained as the only significant predictors of approval rates. Oddly, the use of efficacy data seems to lower the chance of approval, whereas a greater presence of conflict of interest increases the probability of approval” (Graham et al., 2015, p. 89). Obviously, this finding encourages entities aiming to increase the probability of approval to allocate resources toward increasing the presence of conflicts of interests since that is the only input variable demonstrated to contribute to achieving their aim. On the other hand, this finding provides evidence entities claiming conflicts of interests illegally (or at least undesirably) affect ODAC participants’ votes can use to bolster their arguments “stricter controls on conflicts of interests should be deployed (Graham et al., 2015, p. 92).