AB02 – Boyd, D., & Crawford, K. (2012). Critical questions for Big Data

As “social scientists and media studies scholars,” Boyd and Crawford (2012) consider it their responsibility to encourage and focus the public discussion regarding “Big Data” by asserting six claims they imply help define the many and important potential issues the “era of Big Data” has already presented to humanity and the diverse and competing interests that comprise it (Boyd & Crawford, 2012, pp. 662-663). Before asserting and explaining their claims, however, the authors define Big Data “as a cultural, technological, and scholarly phenomenon” that “is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets,” a phenomenon that has three primary components (fields or forces) interacting within it: 1) technology, 2) analysis, and 3) mythology (Boyd & Crawford, 2012, p. 663). Precisely because Big Data, as well as some “other socio-technical phenomenon,” elicit both “utopian and dystopian rhetoric” and visions of the future of humanity, Boyd and Crawford think it is “necessary to ask critical questions” about “what all this data means, who gets access to what data, how data analysis is deployed, and to what ends” (Boyd & Crawford, 2012, p. 664).

The authors’ first two claims are concerned essentially with epistemological issues regarding the nature of knowledge and truth (Boyd & Crawford, 2012, pp. 665-667. In explaining their first claim, “1. Big Data changes the definition of knowledge,” the authors draw parallels between Big Data as a “system of knowledge” and “’Fordism’” as a “manufacturing system of mass production.” According to the authors, both of these systems influence peoples’ “understanding” in certain ways. Fordism “produced a new understanding of labor, the human relationship to work, and society at large.” And Big Data “is already changing the objects of knowledge” and suggesting new concepts that may “inform how we understand human networks and community” (Boyd & Crawford, 2012, p. 665). In addition, the authors cite Burkholder, Latour, and others in describing how Big Data refers not only to the quantity of data, but also to the “tools and procedures” that enable people to process and analyze “large data sets,” and to the general “computational turn in thought and research” that accompanies these new instruments and methods (Boyd & Crawford, 2012, p. 665). In addition, the authors state “Big Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and categorization of reality” (Boyd & Crawford, 2012, p. 665). Finally, as counterpoint to the many potential benefits and positive aspects of Big Data they have emphasized thus far, the authors cite Anderson as one who has revealed the at times prejudicial and arrogant beliefs and attitudes of some quantitative proponents who summarily dismiss all qualitative or humanistic approaches to gathering evidence and formulating theories (Boyd & Crawford, 2012, pp. 665-666) as inferior.

In explaining their second claim, “2. Claims to objectivity and accuracy are misleading,” the authors continue considering some of the biases and misconceptions inherent in epistemologies that privilege “quantitative science and objective method” as the paths to knowledge and absolute truth. According to the authors, Big Data “is still subjective” and even when research subjects or variables are quantified, those quantifications do “not necessarily have a closer claim on objective truth.” In the view of the authors, the obsession of social science and the “humanistic disciplines” with attaining “the status of quantitative science and objective method” is at least to some extent misdirected (Boyd & Crawford, 2012, pp. 666-667), even if understandable given the apparent value society assigns to quantitative evidence. Citing Gitelman and Bollier, among others, the authors believe “all researchers are interpreters of data” not only when they draw conclusions based on their research findings, but also when they design their research and decide what will – and what will not – be measured. Overall, the authors argue against too eagerly embracing the positivistic perspective on knowledge and truth and argue in favor of critically examining research philosophies and methods and considering the limitations inherent within them (Boyd & Crawford, 2012, pp. 667-668).

The third and fourth claims the authors make could be considered to address research quality. Their third claim, “3. Big data are not always better data,” emphasizes the importance of quality control in research and highlights how “understanding sample, for example, is more important than ever.” Since “the public discourse around” massive and easily collected data streams such as Twitter “tends to focus on the raw number of tweets available” and since “raw numbers” would not be a “representative sample” of most populations about which researchers seek to make claims, public perceptions and opinion could be skewed by either mainstream media’s misleading reporting about valid research or by unprofessional researchers’ erroneous claims based upon invalid research methods and evidence (Boyd & Crawford, 2012, pp. 668-669). In addition to these issues of research design, the authors highlight how additional “methodological challenges” can arise “when researchers combine multiple large data sets,” challenges involving “not only the limits of the data set, but also the limits of which questions they can ask of a data set and what interpretations are appropriate” (Boyd & Crawford, 2012, pp. 669-670).

The authors fourth claim continues addressing research quality, but at the broader level of context. Their fourth claim, “4. Taken out of context, Big Data loses its meaning,” emphasizes the importance of considering how the research context affects research methods and research findings and conclusions. The authors imply attitudes toward mathematical modeling and data collection methods may cause researchers to select data more for their suitability to large-scale, computational, automated, quantitative data collection and analysis than for their suitability to discovering patterns or to answering research questions. As an example, the authors consider the evolution of the concept of human networks in sociology and focus on different ways of measuring “‘tie strength,’” a concept understood by many sociologists to indicate “the importance of individual relationships” (Boyd & Crawford, 2013, p. 670). Although recently developed concepts such as “articulated networks” and “behavioral networks” may appear at times to indicate tie strength equivalent to more traditional concepts such as “kinship networks,” the authors explain how the tie strength of kinship networks is based on more in-depth, context-sensitive data collection such as “surveys, interviews” and even “observation,” while the tie strength of articulated networks or behavioral networks may rely on nothing more than interaction frequency analysis; and “measuring tie strength through frequency or public articulation is a common mistake” (Boyd & Crawford, 2013, p. 671). In general, the authors urge caution against considering Big Data the panacea that will objectively and definitively answer all research questions. In their view, “the size of the data should fit the research question being asked; in some cases, small is best” (Boyd & Crawford, 2012, p. 670).

The authors’ final two claims address ethical issues related to Big Data, some of which seem to have arisen in parallel with its ascent. In their fifth claim, “5. Just because it is accessible does not make it ethical,” the authors focus primarily on whether “social media users” implicitly give permission to anyone to use publicly available data related to the user in all contexts, even contexts the user may not have imagined, such as in research studies or in the collectors’ data or information products and services (Boyd & Crawford, 2012, pp. 672-673). Citing Ess and others, the authors emphasize researchers and scholars have “accountability” for their actions, including those actions related to “the serious issues involved in the ethics of online data collections and analysis.” The authors encourage researchers and scholars to consider privacy issues and to proactively assess whether they should assume users have provided “informed consent” for the researchers to collect and analyze users’ publicly available data just because the data is publicly available” (Boyd & Crawford, 2013, pp. 672-673). In their sixth claim, “6. Limited access to Big Data creates new digital divides,” the authors note that although there is a prevalent perception Big Data “offers easy access to massive amounts to data,” the reality is access to Big Data and the ability to manage and analyze Big Data require resources unavailable to much of the population – and this “creates a new kind of digital divide: the Big Data rich and the Big Data poor” (Boyd & Crawford, 2013, pp. 673-674). “Whenever inequalities are explicitly written into the system,” the authors assert further, “they produce class-based structures (Boyd & Crawford, 2012, p. 675)

In their article overall, Boyd & Crawford maintain an optimistic tone while enumerating the many and myriad issues emanating from the phenomenon Big Data. In concluding, the authors encourage scholars, researchers, and society to “start questioning the underlying assumptions, values, and biases of this new wave of research” (Boyd & Crawford, 2012, p. 675).

AB01 – Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication.

A team of researchers determines to bring the power of “big data” into the toolkit of technical communication scholars by piloting a research method they “dub statistical genre analysis (SGA)” and describing and explaining the method in an article published in the journal Technical Communication Quarterly (Graham, Kim, Devasto, & Keith, 2015, pp. 70-71).

Acknowledging the value academic markets have begun assigning to findings, conclusions, and theories founded upon rigorous analysis of massive data sets, this team deconstructs the amorphous “big data” phenomenon and demonstrates how their SGA methodology can be used to quantitatively describe and visually represent the generic content (e.g. types of evidence and modes of reasoning) of rhetorical situations (e.g. committee meetings) and to discover input variables (e.g. conflicts of interest) that have statistically significant effects upon output variables (e.g. recommendations) of important policy-influencing entities such as the Food and Drug Administration’s (FDA) Oncologic Drugs Advisory Committee (ODAC) (Graham et al., 2015, pp. 86-89).

The authors believe there is much to gain by integrating the “humanistic and qualitative study of discourse with statistical methods” and although they respect the “craft character of rhetorical inquiry” (Graham et al., 2015, pp 71-72) and utilize “the inductive and qualitative nature of rhetorical analysis as a necessary” initial step in their hybrid method (Graham et al., 2015, p. 77), they conclude their mixed-method SGA approach can increase the “range and power” (Graham et al., 2015 p. 92) of “traditional, inductive approaches to genre analysis” (Graham et al., 2015, p. 86) by offering the advantages “of statistical insights” while avoiding the disadvantages of statistical sterility that can emerge when the qualitative humanist element is absent (Graham et al., 2015, p. 91).

In the conclusion of their article, the researchers identify two main benefits of their hybrid SGA method. The first benefit is communication genres “can be defined with more precision” since SGA documents the actual frequency of generic conventions as they exist within a large sample of the corpus, rather than being defined more generally since traditional rhetorical methods may document the opinions experts have of the “typical” frequency of generic conventions as they perceive them to exist within a limited sample of “exemplars” selected from a small sample of the corpus. In addition, the authors argue analysis of a massive number of texts may reveal generic conventions that do not appear in the limited sample of exemplars that may be studied by practitioners of the traditional rhetorical approach involving only “critical analysis and close reading.” The second benefit is communications scholars are enabled to move beyond critical opinion and to claim statistically significant correlations between “situational inputs and outputs” and “genre characteristics that have been empirically established” (Graham et al., 2015, p. 92).

Befitting the subject of their study, the authors devote a considerable portion of their article to describing their research methodology. In the third section titled “Statistical Genre Analysis,” they begin by noting they conducted the “current pilot study” on a “relatively small subset” of the available data in order to “demonstrate the potential of SGA.” Further, they outline their research questions, the answers to two of which indeed seem to attest to the strength SGA can contribute to both the evidence and the inferences used by communication scholars in their own arguments about the communications they study. As they do in the introduction, in this section also, the authors note the intellectual lineage of SGA in various disciplines, including “rhetorical studies, linguistics,” “health communication,” psychology, and “applied statistics” (Graham et al., 2015, pp. 71, 76).

As explained earlier, the communication artifacts studied by these researches are selected from among the various artifacts arising from the FDA’s ODAC meetings, specifically the textual transcriptions of presentations (essentially opening statements) given by the sponsors (pharmaceutical manufacturing companies) of the drugs under review during meetings which usually last one or two days (Graham et al., 2015, pp. 75-76). Not only in the arenas of technical communication and rhetoric, but also in the arenas of Science and Technology Studies (STS) and of Science, Technology, Engineering, and Math (STEM) public policy, managing conflicts of interests among ODAC participants and encouraging inclusion of all relevant stakeholders in ODAC meetings are prominent issues (Graham et al., 2015, p. 72). At the conclusion of ODAC meetings, voting participants vote either for or against the issue under consideration, generally “applications to market new drugs, new indications for already approved drugs, and appropriate research/study endpoints” (Graham et al., 2015, pp. 74-76).

It is within this context the authors attempted to answer the following two research questions, among others, regarding all ODAC meetings and sponsor presentations given at those meetings between 2009 and 2012: “1. How does the distribution of stakeholders affect the distribution of votes?” and “3. How does the distribution of evidence and forms of reasoning in sponsor presentations affect the distribution of votes?” (Graham et al., 2015, pp. 75-76). Notice both of these research questions ask whether certain input variables affect certain output variables. And in this case, the output variables are votes either for or against an action that will have serious consequences for people and organizations. Put another way, this is a political (or deliberative rhetoric) situation and the ability to predict with a high degree of certainty which inputs produce which outputs could be quite valuable, given those inputs and outputs could determine substantial budget allocations, consulting fees, and pharmaceutical sales – essentially, success or failure – among other things.

Toward the aim of asking and answering research questions with such potentially high stakes, the authors applied their SGA mixed-methods approach, which they explain included four phases of research conducted over approximately six months to one year and included at least four researchers. The authors explain SGA “requires first an extensive data preparation phase” after which the researchers “subjected” the data “to various statistical tests to directly address the research questions.” They describe the four phases of their SGA method as “(a) coding schema development, (b) directed content analysis, (c) meeting data and participant demographics extraction, and (d) statistical analyses.” Before moving into a deeper discussion of their own “coding schema” development, as well as the other phases of their SGA approach, the authors cite numerous influences from scholars in “behavioral research,” “multivariate statistics,” “corpus linguistics,” and “quantitative work in English for specific purposes,” while explaining the specific statistical “techniques” they apply “can be found in canonical works of multivariate statistics such as Keppel’s (1991) Design and Analysis and Johnson and Wichern’s (2007) Applied Multivariate Statistical Analysis” (Graham et al., 2015, pp. 75-77). One important distinction the authors make between their method and these other methods is while the other methods operate at the more granular “word and sentence level” that facilitates formulation of “coding schema amenable to automated content analysis,” the authors operate at the less granular paragraph level that requires human intervention in order to formulate coding schema reflecting nuances only discernable at higher cognitive levels, for example whether particular evidentiary artifacts (transcripts) are based on randomized controlled trials (RCTs) addressing issues of “efficacy” or RCTs addressing issues of “safety and treatment-related hazards” (Graham et al., 2015, pp. 77-78). Choosing the longer, more complex paragraph as their unit of analysis requires the research method to depend upon “the inductive and qualitative nature of rhetorical analysis as a necessary precursor to both qualitative coding and statistical testing” (Graham et al., 2015, p. 77).

In the final section of their explanation of SGA, their research methodology, the authors summarize their statistical methods including both “descriptive statistics” and “inferential statistics” and how they applied these two types of statistical methods, respectively, to “provide a quantitative representation of the data set” (e.g. “mean, median, and standard deviation”) and to “estimate the relationship between variables” (e.g. “statistically significant impacts”) (Graham et al., 2015, pp. 81-83).

Returning to the point of the authors’ research – namely demonstrating how SGA empowers scholars to provide confident answers to research questions and therefore to create and assert knowledge clearly valued by societal interests – their SGA enables them to state their “multiple regression analysis” found “RCT-efficacy data and conflict of interest remained as the only significant predictors of approval rates. Oddly, the use of efficacy data seems to lower the chance of approval, whereas a greater presence of conflict of interest increases the probability of approval” (Graham et al., 2015, p. 89). Obviously, this finding encourages entities aiming to increase the probability of approval to allocate resources toward increasing the presence of conflicts of interests since that is the only input variable demonstrated to contribute to achieving their aim. On the other hand, this finding provides evidence entities claiming conflicts of interests illegally (or at least undesirably) affect ODAC participants’ votes can use to bolster their arguments “stricter controls on conflicts of interests should be deployed (Graham et al., 2015, p. 92).