Big Data – Wabitechnology

AB08 – McNely, B., Spinuzzi, C., & Teston, C. (2015). Contemporary research methodologies in technical communication.

In Technical Communication Quarterly’s most recent special issue on research methods and methodologies, the issue’s guest editors assert “methodological approaches” are important “markers for disciplinary identity” and thereby agree with previous guest editor, Goubil-Gambrell, who in the 1998 special issue “argued that ‘defining research methods is a part of disciplinary development’” (McNely, Spinuzzi, & Teston, 2015, p. 2). Furthermore, the authors of the 2015 special issue revere the 1998 special issue as a “landmark issue” including ideas that “informed a generation of technical communication scholars as they defined their own objects of study, enacted their research ethics, and thought through their metrics” (McNely, et al., 2015, p. 9).

It is in this tradition the authors of the 2015 special issue both desire to review “key methodological developments” and associated theories forming the technical communication “field’s current research identity” and to preview and “map future methodological approaches” and relevant theories (McNely, et al., 2015, p. 2). The editors argue the approaches and theories discussed in this special edition of the journal “not only respond to” what they view as substantial changes in “tools, technologies, spaces, and practices” in the field over the past two decades, but also “innovate” by describing and modeling how these changes are informing technical communicators’ emerging research methodologies and theories as those methodologies and theories relate to the “field’s objects of study, research ethics, and metrics” (i.e. “methodo-communicative issues”) (McNely, et al., 2015, pp. 1-2, 6-7).

Reviewing what they see as the fundamental theories and research methodologies of the field, the authors explore how a broad set of factors (e.g. assumptions, values, agency, tools, technology, and contexts) manifest in work produced along three vectors of theory and practice they identify as “sociocultural theories of writing and communication,” “associative theories and methodologies,” and “the new material turn” (McNely, et al., 2015, p. 2). The authors describe the sociocultural vector as developing from theoretical traditions in “social psychology, symbolic interactionism,” “learning theory,” and “activity theory,” among others, and as essentially involving “purposeful human actors,” “material surroundings,” “heterogeneous artifacts and tools,” and even “cognitive constructs” combining in “concrete interactions” – that is, situations – arising from synchronic and diachronic contextual variables scholars may identify, describe, measure, and use to explain phenomena and theorize about them (McNely, et al., 2015, pp. 2-4). The authors describe the associative vector as developing from theoretical traditions in “articulation theory,” “rhizomatics,” “distributed cognition,” and “actor-network theory (ANT)” (McNely, et al., 2015, p. 4) and as essentially involving “symmetry—a methodological stance that ascribes agency to a network of human and nonhuman actors rather than to specific human actors” and therefore leading researchers to “focus on associations among nodes” as objects at the methodological nexus (McNely, et al., 2015, p. 4). The authors describe the new material vector as developing from theoretical traditions in “science and technology studies, political science, rhetoric, and philosophy” (with the overlap of the specific traditions from political science and philosophy often “collected under the umbrella known as “object-oriented ontology”) and as essentially involving a “radically symmetrical perspective on relationships between humans and nonhumans—between people and things, whether those things are animal, vegetable, or mineral” and how these human and non-human entities integrate into “collectives” or “assemblages” that have “agency” one could view as “distributed and interdependent,” a phenomenon the authors cite Latour as labeling “interagentivity” (McNely, et al., 2015, p. 5).

Previewing the articles in this special issue, the editors acknowledge how technical communication methodologies have been “influenced by new materialisms and associative theories” and argue these methodologies “broaden the scope of social and rhetorical aspects” of the field and “encourage us to consider tools, technologies, and environs as potentially interagentive elements of practice” that enrich the field (McNely, et al., 2015, p. 6). At the same time, the editors mention how approaches such as “action research” and “participatory design” are advancing “traditional qualitative approaches” (McNely, et al., 2015, p. 6). In addition, the authors state “given the increasing importance of so-called ‘big data’ in a variety of knowledge work fields, mixed methods and statistical approaches to technical communication are likely to become more prominent” (McNely, et al., 2015, p. 6). Amidst these developments, the editor’s state their view that adopting “innovative methods” in order to “explore increasingly large date sets” while “remaining grounded in the values and aims that have guided technical communication methodologies over the previous three decades” may be one of the field’s greatest challenges (McNely, et al., 2015, p. 6).

In the final section of their paper, the authors explicitly return to what they seem to view as primary disciplinary characteristics (i.e. markers, identifiers), which they call “methodo-communicative issues,” and use those characteristics to compare the articles in the 1998 special issue with those in the 2015 special issue and to identify what they see as new or significant in the 2015 articles. The “methodo-communicative issues” or disciplinary characteristics they use are: “objects of study, research ethics, and metrics” (McNely, et al., 2015, pp. 6-7). Regarding objects of study, the authors note how in the 1998 special issue, Longo focuses on the “contextual nature of technical communication” while in the 2015 special issue, Read and Swarts focus on “networks and knowledge work” (McNely, et al., 2015, p. 7). Regarding ethics, the authors cite Blyer in the 1998 special issue as applying “critical” methods rather than “descriptive/explanatory methods” while in the 2015 special issue, Walton, Zraly, and Mugengana apply “visual methods” to create “ethically sound cross-cultural, community-based research” (McNely, et al., 2015, p. 7). Regarding metrics or “measurement,” the authors cite Charney in the 1998 special issue as contrasting the affordances of “empiricism” with “romanticism” while in the 2015 special issue, Graham, Kim, DeVasto, and Keith explore the affordances of “statistical genre analysis of larger data sets” (McNely, et al., 2015, p. 7). In their discussion of what is new or significant in the articles in the 2015 special issue, the editors highlight how some articles address particular methodo-communicative issues. Regarding metrics or “measurement,” for example, they highlight how Graham, Kim, DeVasto, and Keith apply Statistical Genre Analysis (SGA) – a hybrid research method combining rhetorical analysis with statistical analysis – to answer research questions such as which “specific genre features can be correlated with specific outcomes” across an “entire data set” rather than across selected exemplars (McNely, et al., 2015, p. 8).

In summary, the guest editors of this 2015 special issue on contemporary research methodologies both review the theoretical and methodological traditions of technical communication and preview the probable future direction of the field as portrayed in the articles included in this special issue.

AB07 – Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google File System.

When they published their paper in 2003, engineers at Google had already designed, developed, and implemented the Google File System (GFS) in an effort to sustain performance and control costs while providing the infrastructure, platform, and applications required to deliver Google’s services to users (Ghemawat, Gobioff, & Leung, 2003, p. 29). Although the authors acknowledge GFS has similar aims as existing distributed file systems, aims such as “performance, scalability, reliability, and availability,” they state GFS has dissimilar “design assumptions” arising from their “observations” of Google’s “application workloads and technological environment” (Ghemawat, et al., 2003, p. 29). In general, the authors describe GFS as “the storage platform for the generation and processing of data used by our service” and used by our “research and development efforts that require large data sets” (Ghemawat, et al., 2003, p. 29). In addition, they state that GFS is suitable for “large distributed data-intensive applications,” that it is capable of providing “high aggregate performance to a large number of clients,” and that it “is an important tool” that allows Google “to innovate and attack problems on the scale of the entire web” (Ghemawat, et al., 2003, pp. 29, 43).

In the introduction to their paper, the authors state the four primary characteristics of their “workloads and technological environment” as 1) “component failures are the norm rather than the exception,” 2) “files are large by traditional standards,” 3) “most files are mutated by appending new data rather than overwriting existing data,” and 4) “co-designing the applications and the file system API benefits the overall system by increasing flexibility” (Ghemawat, et al., 2003, p. 29).Each of these observations aligns with (results in) what the authors call their “radically different points in the design space” (Ghemawat, Gobioff, & Leung, 2003, p. 29) which they elaborate in some detail both in the numbered list in the paper’s introduction and in the bulleted list in the second section, “Assumptions,” of the paper’s second part, “Design Overview” (Ghemawat, et al., 2003, p. 30). Considering the authors’ first observation, for example, that the “quantity and quality of the components virtually guarantee” parts of the system will fail and “will not recover,” it is reasonable to assert the design premises (assumptions) that system specifications should include the system is made of “inexpensive commodity components” and it “must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis” (Ghemawat, et al., 2003, pp. 29-30). Considering the authors’ second observation, for example, “files are huge by traditional standards,” meaning “multi-GB files are common” and “the system stores a modest number of large files,” it is reasonable to assert the design premises (assumptions) that system parameters “such as I/O operation and block sizes” need “to be revisited” and re-defined in order to optimize the system for managing large files while maintaining support for managing small files (Ghemawat, et al., 2003, pp. 29-30). These two examples demonstrate the type of arguments and evidence the authors provide to support their claim GFS responds to fundamental differences between the data, workloads (software), and infrastructure (hardware) of traditional information technology and the data, workloads, and infrastructure Google needs to sustain its operations in contemporary and future information technology (Ghemawat, et al., 2003, pp. 29-33, 42-43). In the remaining passages of their paper’s introduction and in the first section of their design overview, the authors continue discussing Google’s technological environment by describing the third and fourth primary characteristics of the environment they have observed and by explaining corollary design premises and assumptions arising from those observations they applied to designing and developing GFS (Ghemawat, et al., 2003, pp. 29-30).

With the rationale for their work thus established, the authors move on in the remaining sections of their design overview to discuss the overall architecture of GFS. First, they introduce some features the authors imply are shared with other distributed file systems – for example an API supporting “the usual operations to create, delete, open, close, read, and write files,” — and some features the authors imply are unique to GFS – for example “snapshot and record append operations” (Ghemawat, et al., 2003, p. 30). Next, they describe the main software components (functions or roles) included in a GFS implementation on a given “cluster” (set) of machines, namely the “GFS clients,” the “GFS master,” and the “GFS chunkservers.” The GFS clients enable communication between applications requiring data and between the GFS master and GFS chunkservers providing data. The GFS master “maintains all file system metadata” and “controls system-wide activities.” The GFS chunkservers store the actual data (Ghemawat, et al., 2003, p. 31).

At this point in their paper, although the authors begin providing fairly detailed technical explanations for how these various GFS components interact, I will mention only a few points the authors emphasize as crucial to the success of GFS. First of all, in contrast with some other distributed file systems, GFS is a “single master” architecture that has both advantages and disadvantages (Ghemawat, et al., 2003, pp. 30-31). According to the authors, one advantage of “having a single master” is it “vastly simplifies” the “design” of GFS and “enables the master to make sophisticated chunk placement and replication decisions using global knowledge” (Ghemawat, et al., 2003, pp. 30-31). A disadvantage of having only one master, however, is its resources could be overwhelmed and it could become a “bottleneck” (Ghemawat, et al., 2003, p. 31). In order to overcome this potential disadvantage of the single master architecture, the authors explain how communication and data flows through the GFS architecture, namely that GFS clients “interact with the master for metadata operations,” but interact with the chunkservers for actual data operations (i.e. operations requiring alteration or movement of data) and thereby relieve the GFS master from performing “common operations” that could overwhelm it (Ghemawat, et al., 2003, p. 31, 43). Other important points include GFS’s relatively large data “chunk size,” its “relaxed consistency model,” its elimination of the need for substantial client cache, and its use of replication instead of RAID to solve fault tolerance issues (Ghemawat, et al., 2003, pp. 31-32, 42).

AB06 – Mahrt, M. & Scharkow, M. (2013). The value of big data in digital media research.

In their effort to promote “theory-driven” research strategies and to caution against the naïve embrace of “data-driven” research strategies that seems to have culminated recently in a veritable “’data rush’ promising new insights” into almost anything, the authors of this paper “review” a “diverse selection of literature on” digital media research methodologies and the Big Data phenomenon as they provide “an overview of ongoing debates” in this realm while arguing ultimately for a pragmatic approach based on “established principles of empirical research” and “the importance of methodological rigor and careful research design” (Mahrt & Scharkow, 2013, pp. 26, 20, 21, 30).

Mahrt and Scharkow acknowledge the advent of the Internet and other technologies has enticed “social scientists from various fields” to utilize “the massive amounts of publicly available data about Internet users” and some scholars have enjoyed success in “giving insight into previously inaccessible subject matters” (Mahrt & Scharkow, 2013, p. 21). Still, the authors note, there are some “inherent disadvantages” with sourcing data from the Internet in general and also from particular sites such as social media sites or gaming platforms (Mahrt & Scharkow, 2013, p. 21, 25). One of the most commonly cited problems with sourcing publicly available data from social media sites or gaming platforms or Internet usage is “the problem of random sampling on which all statistical inference is based, remains largely unsolved” (Mahrt & Scharkow, 2013, p. 25). The data in Big Data essentially are “huge” amounts of data “’naturally’ created by Internet users,” “not indexed in any meaningful way,” and with no “comprehensive overview” available (Mahrt & Scharkow, 2013, p. 21).

While Mahrt and Scharkow mention the positive attitude of “commercial researchers” toward a “golden future” for big data, they also mention the cautious attitude of academic researchers and explain how the “term Big Data has a relative meaning” (Mahrt & Scharkow, 2013, pp. 22, 25) contingent perhaps in part on these different attitudes. And although Mahrt and Scharkow imply most professionals would agree the big data concept “denotes bigger and bigger data sets over time,” they explain also how “in computer science” researchers emphasize the concept “refers to data sets that are too big” to manage with “regular storage and processing infrastructures” (Mahrt & Scharkow, 2013, p. 22). This emphasis on data volume and data management infrastructure familiar to computer scientists may seem to some researchers in “the social sciences and humanities as well as applied fields in business” too narrowly focused on computational or quantitative methods and this focus may seem exclusive and controversial in additional ways (Mahrt & Scharkow, 2013, pp. 22-23). Some of these additional controversies revolve around issues such as, for example, whether a “data analysis divide” may be developing that favors those with “the necessary analytical training and tools” over those without them (Mahrt & Scharkow, 2013, pp. 22-23), or whether an overemphasis on “data analysis” may have contributed to the “assumption that advanced analytical techniques make theories obsolete in the research process,” as if the numbers, the “observed data,” no longer require human interpretation to clarify meaning or to identify contextual or other confounding factors that may undermine the quality of the research and raise “concerns about the validity and generalizability of the results” (Mahrt & Scharkow, 2013, pp. 23-25).

Although Mahrt and Scharkow grant advances in “computer-mediated communication,” “social media,” and other types of “digital media” may be “fueling methodological innovation” such as analysis of large-scale data sets – or so-called Big Data – and that the opportunity to participate is alluring to “social scientists” in many fields, the authors conclude their paper by citing Herring and others urging researchers to commit to “methodological training,” “to learn to ask meaningful questions,” and to continually “assess” whether collection and analysis of massive amounts of data is truly valuable in any specific research endeavor (Mahrt & Scharkow, 2013, p. 20, 29-30). The advantages of automated, big data research are numerous, as Mahrt and Scharkow concede, for instance “convenience” and “efficiency,” or the elimination of research obstacles such as “artificial settings” and “observation effects,” or the “visualization” of massive “patterns in human behavior” previously impossible to discover and render (Mahrt & Scharkow, 2013, pp. 24-25). With those advantages understood and granted, the author’s argument seems a reasonable reminder of the “established principles of empirical research” and of the occasional need to reaffirm the value of the tradition (Mahrt & Scharkow, 2013, p. 21).

AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

AB03 – Fan, W. & Bifet, A. (2012). Mining big data: Current status, and forecast to the future.

Fan and Bifet (2012) state the aim of their article, and of the particular issue of the publication it introduces, as to provide an overview of the “current status” and future course of the academic discipline and business and industrial field involved in “mining big data.” Toward that aim, the authors say they will “introduce Big Data mining and its applications,” “summarize the papers presented in this issue,” note some of the field’s controversies and challenges, discuss the “importance of open-source software tools,” and draw a few conclusions regarding the field’s overall endeavor (Fan & Bifet, 2012, p. 1).

In their bulleted list of controversies surrounding the big data phenomenon, the authors begin by noting the controversy regarding whether there is any “need to distinguish Big Data analytics from data analytics” (Fan & Bifet, 2012, p. 3). From the perspectives of people who have been involved with data management, including knowledge discovery and data mining, since before “the term ‘Big Data’ appeared for the first time in 1998” (Fan & Bifet, 2012, p. 1), it seems reasonable to consider exactly how the big data of recent years are different from the data of past years.

Although Fan and Bifet acknowledge this controversy, in much of their article they proceed to explain how the big data analytics of today is different from the data analytics of past years. First, they say their conception of big data refers to datasets so large and complex those data sets have “outpaced our capability to process, analyze, store and understand” them with “our current methodologies or data mining software tools” (Fan & Bifet, 2013, p. 1). Next, they describe their conception of “Big Data mining” as “the capability of extracting useful information from these large datasets or streams of data that due to Laney’s “3 V’s in Big Data management” – volume, velocity, and variety – it has thus far been extremely difficult or impossible to do (Fan & Bifet, 2012, pp. 1, 2). In addition to Laney’s 3V’s the authors cite from a note Laney wrote or published in 2001, the authors cite Gartner as explaining two more V’s of big data in a definition of big data on Gartner’s website accessed in 2012 (Fan & Bifet, 2012, p. 2). While one of the Gartner V’s cited by Fan and Bifet is “variability” involving “changes in the structure of the data and how users want to interpret that data” seems to me related enough to Laney’s “variety” one could combine them for simplicity and convenience, the other of the Gartner V’s cited by the authors is “value” which Fan and Bifet interpret as meaning “business value that gives organizations a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach” seems to me unique enough from Laney’s V’s one should consider it a separate, fourth V that could be cited as a characteristic of big data (Fan & Bifet, 2012, p. 2).

In their discussion of how big data analytics can be applied to create value, the authors cite an Intel website accessed in 2012 to describe business applications such as customization of products or services for particular customers, technology applications that would improve “process time from hours to seconds,” healthcare applications for “mining the DNA of each person” in order “to discover, monitor and improve health aspects of everyone,” and public policy planning that could create “smart cities” “focused on sustainable economic development and high quality of life” (Fan & Bifet, 2012, p. 2). Continuing their discussion of the value or “usefulness” of big data, the authors describe the United Nations’ (UN) Global Pulse initiative as an effort begun in 2009 “to improve life in developing countries” by researching “innovative methods and techniques for analyzing real-time digital data,” by assembling a “free and open source” big data “technology toolkit,” and by establishing an “integrated, global network of Pulse Labs” in developing countries in order to enable them to utilize and apply big data (Fan & Bifet, 2012, p. 2).

Before Fan and Bifet mention Laney’s 3V’s of big data and cite Gartner’s fourth V – value – they describe some of the sources of data that have developed in “recent years” and that have contributed to “a dramatic increase in our ability to collect data from various sensors, devices, in different formats, from independent or connected applications” including both social media applications that enable end-users to easily generate content and an infrastructure of “mobile phones” that is “becoming the sensory gateway to get real-time data on people” (Fan & Bifet, 2012, p.1). In addition, they mention the “Internet of things (IoT)” and predict it “will raise the scale of data to an unprecedented level” as “people and devices” in private and public environments “are all loosely connected” to create “trillions” of endpoints contributing “the data” from which “valuable information must be discovered” and used to “help improve quality of life and make our world a better place” (Fan & Bifet, 2012, p.1).

Completing their introduction to the topic of big data and their discussion of some of its applications, Fan and Bifet turn in the third section of their paper to summarizing four selected articles from the December 2012 issue of Explorations, the newsletter of the Association for Computing Machinery’s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (KDD), the issue of that newsletter which their article introduces. In their opinion, these four articles “together” represent “very significant state-of-the-art research in Big Data Mining” (Fan & Bifet, 2012, p. 2). Their summaries of the four articles, two articles from researchers in academia and two articles from researchers in industry, discuss big data mining infrastructure and technologies, methods, and objectives. They say the first article, from researchers at Twitter, Inc., “presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter” which illustrates the “current state of data mining tools” is such that “most of the time is consumed in preparatory work” and in revising “preliminary models into robust solutions” (Fan & Bifet, 2012, p. 2). They summarize the second article, from researchers in academia, as being about “mining heterogeneous information networks” of “interconnected, multi-typed data” that “leverage the rich semantics of typed nodes and links in a network” to discover knowledge “from interconnected data” (Fan & Bifet, 2012, p. 2). The third article, also from researchers in academia, they summarize as providing an “overview of mining big graphs” by using the “PEGASUS tool” and as indicating potentially fruitful “research directions for big graph mining” (Fan & Bifet, 2012, p. 2). They summarize the fourth article, from a researcher at Netflix, as being about Netflix’s “recommender and personalization techniques” and as including a substantial section on whether “we need more data or better models to improve our learning methodology” (Fan & Bifet, 2012, pp. 2-3).

In the next section of their paper, the authors provide a seven-bullet list of controversies surrounding the “new hot topic” of “Big Data” (Fan & Bifet, 2012, p. 3). The first controversy on their list, one I mentioned earlier in this article, raises the issue of whether and how the recent and so-called “Big Data” phenomenon is any different from what has previously been referred to as simply data management or data analysis or data analytics, among other similar terms or concepts that have existed in various disciplines or fields or bodies of literature for quite some time. The second controversy mentioned by the authors concerns whether “Big Data” may be nothing more than hype resulting from efforts by “data management systems sellers” to profit from sales of systems capable of storing massive amounts of data to be processed and analyzed by Hadoop and related technologies when in reality smaller volumes of data and other strategies and methods may be more appropriate in some cases (Fan & Bifet, 2012, p. 3). The third controversy the authors note asserts that in the case at least of “real time analytics,” the “recency” of the data is more important than the volume of data. As the fourth controversy, the authors mention how some of Big Data’s “claims to accuracy are misleading” and they cite Taleb’s argument that as “the number of variables grow, the number of fake correlations also grow” and can result in some rather absurd correlations such as the one in which Leinweber found “the S&P 500 stock index was correlated with butter production in Bangladesh” (Fan & Bifet, 2012, p. 3). The fifth controversy the authors addresses the issue of data quality by proposing “bigger data are not always better data” and stating a couple of factors that can determine data quality, for example whether “the data is noisy or not,” and if it is representative” (Fan & Bifet, 2012, p. 3). The authors state the sixth controversy as an ethical issue, mainly whether “it is ethical that people can be analyzed without knowing it” (Fan & Bifet, 2012, p. 3). The final controversy addressed by Fan and Bifet concerns whether access to massive volumes of data and the capabilities to use it (including required infrastructure, knowledge, and skills) are unfairly or unjustly limited and could “create a division between the Big Data rich and poor” (Fan & Bifet, 2012, p. 3).

Fan and Bifet devote the fifth section of their paper to discussing “tools” and focus on the close relationships between big data, “the open source revolution,” and companies including “Facebook, Yahoo!, Twitter,” and “LinkedIn” that both contribute to and benefit from their involvement with “open source projects” such as the Apache Hadoop project (Fan & Bifet, 2012, p. 3) many consider the foundation of big data. After briefly introducing the “Hadoop Distributed File System (HDFS) and MapReduce” as the primary aspects of the Hadoop project that enable storage and processing of massive data sets, respectively, the authors mention a few other open source projects within the Hadoop ecosystem such as “Apache Pig, Apache Hive, Apache HBase, Apache ZooKeeper,” and “Apache Cassandra,” among others (Fan & Bifet, 2012, p. 3). Next, the authors discuss more of the “many open source initiatives” involved with big data (Fan & Bifet, 2012, p. 3). “Apache Mahout,” for example, is a “scalable machine learning and data mining open source software based mainly in Hadoop,” “R” is a “programming language and software environment,” “MOA” enables “stream data mining” or “data mining in real time,” and “Vowpal Wabbit” (VW) is a “parallel learning” algorithm known for speed and scalability (Fan & Bifet, 2012, p. 3). Regarding open-source “tools” for “Big Graph mining,” the authors mention “GraphLab” and “PEGASUS,” the latter of which they describe as a “big graph mining system built on top of MAPREDUCE” that enables discovery of “patterns and anomalies “in massive real-world graphs” (Fan & Bifet, 2012, pp. 3-4).

The sixth section of their article provides a seven-bullet list of what the authors consider “future important challenges in Big Data management and analytics” given the nature of big data as “large, diverse, and evolving” (Fan & Bifet, 2012, p. 4). First, they discuss the need to continue exploring architectures in order to ascertain clearly what would be the “optimal architecture” for “analytic systems” “to deal with historic data and with real-time data” simultaneously (Fan & Bifet, 2012, p. 4). Next, they state the importance of ensuring accurate findings and making accurate claims – in other words, “to achieve significant statistical results” – In big data research, especially since “it is easy to go wrong with huge data sets and thousands of questions to answer at once” (Fan & Bifet, 2012, p. 4). Third, they mention the need to expand the number of “distributed mining” methods since some “techniques are not trivial to paralyze” (Fan & Bifet, 2012, p. 4). Fourth, the authors note the importance of improving capabilities in analyzing data streams that are continuously “evolving over time” and “in some cases to detect change first” (Fan & Bifet, 2012, p. 4). Fifth, the authors note the challenge of storing massive amounts of data and emphasize the need to continue exploring the balance between gaining or sacrificing time or space given the “two main approaches” currently used to address the issue, namely either compressing (i.e. sacrificing time compressing to reduce required space to store) or sampling (i.e. using sample of data – “coresets” – in order to represent much larger data volumes) (Fan & Bifet, 2012, p. 4). Sixth, the authors admit “it is very difficult to find user-friendly visualizations” and it will be necessary to develop innovative “techniques” and “frameworks” “to tell and show” the “stories” of data (Fan & Bifet, 2012, p. 4). Last, the authors acknowledge massive amounts of potentially valuable data are being lost since much data being created today are “largely untagged file-based and unstructured data” (Fan & Bifet, 2012, p. 4). Quoting a “2012 IDC study on Big Data,” the authors say “currently only 3% of the potentially useful data is tagged, and even less is analyze” (Fan & Bifet, 2012, p. 4).

In the conclusion to their paper, Fan and Bifet predict “each data scientist will have to manage” increasing data volume, increasing data velocity, and increasing data variety in order to participate in “the new Final Frontier for scientific data research and for business applications” and to “help us discover knowledge that no one has discovered before” (Fan & Bifet, 2012, p. 4).

AB02 – Boyd, D., & Crawford, K. (2012). Critical questions for Big Data

As “social scientists and media studies scholars,” Boyd and Crawford (2012) consider it their responsibility to encourage and focus the public discussion regarding “Big Data” by asserting six claims they imply help define the many and important potential issues the “era of Big Data” has already presented to humanity and the diverse and competing interests that comprise it (Boyd & Crawford, 2012, pp. 662-663). Before asserting and explaining their claims, however, the authors define Big Data “as a cultural, technological, and scholarly phenomenon” that “is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets,” a phenomenon that has three primary components (fields or forces) interacting within it: 1) technology, 2) analysis, and 3) mythology (Boyd & Crawford, 2012, p. 663). Precisely because Big Data, as well as some “other socio-technical phenomenon,” elicit both “utopian and dystopian rhetoric” and visions of the future of humanity, Boyd and Crawford think it is “necessary to ask critical questions” about “what all this data means, who gets access to what data, how data analysis is deployed, and to what ends” (Boyd & Crawford, 2012, p. 664).

The authors’ first two claims are concerned essentially with epistemological issues regarding the nature of knowledge and truth (Boyd & Crawford, 2012, pp. 665-667. In explaining their first claim, “1. Big Data changes the definition of knowledge,” the authors draw parallels between Big Data as a “system of knowledge” and “’Fordism’” as a “manufacturing system of mass production.” According to the authors, both of these systems influence peoples’ “understanding” in certain ways. Fordism “produced a new understanding of labor, the human relationship to work, and society at large.” And Big Data “is already changing the objects of knowledge” and suggesting new concepts that may “inform how we understand human networks and community” (Boyd & Crawford, 2012, p. 665). In addition, the authors cite Burkholder, Latour, and others in describing how Big Data refers not only to the quantity of data, but also to the “tools and procedures” that enable people to process and analyze “large data sets,” and to the general “computational turn in thought and research” that accompanies these new instruments and methods (Boyd & Crawford, 2012, p. 665). In addition, the authors state “Big Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and categorization of reality” (Boyd & Crawford, 2012, p. 665). Finally, as counterpoint to the many potential benefits and positive aspects of Big Data they have emphasized thus far, the authors cite Anderson as one who has revealed the at times prejudicial and arrogant beliefs and attitudes of some quantitative proponents who summarily dismiss all qualitative or humanistic approaches to gathering evidence and formulating theories (Boyd & Crawford, 2012, pp. 665-666) as inferior.

In explaining their second claim, “2. Claims to objectivity and accuracy are misleading,” the authors continue considering some of the biases and misconceptions inherent in epistemologies that privilege “quantitative science and objective method” as the paths to knowledge and absolute truth. According to the authors, Big Data “is still subjective” and even when research subjects or variables are quantified, those quantifications do “not necessarily have a closer claim on objective truth.” In the view of the authors, the obsession of social science and the “humanistic disciplines” with attaining “the status of quantitative science and objective method” is at least to some extent misdirected (Boyd & Crawford, 2012, pp. 666-667), even if understandable given the apparent value society assigns to quantitative evidence. Citing Gitelman and Bollier, among others, the authors believe “all researchers are interpreters of data” not only when they draw conclusions based on their research findings, but also when they design their research and decide what will – and what will not – be measured. Overall, the authors argue against too eagerly embracing the positivistic perspective on knowledge and truth and argue in favor of critically examining research philosophies and methods and considering the limitations inherent within them (Boyd & Crawford, 2012, pp. 667-668).

The third and fourth claims the authors make could be considered to address research quality. Their third claim, “3. Big data are not always better data,” emphasizes the importance of quality control in research and highlights how “understanding sample, for example, is more important than ever.” Since “the public discourse around” massive and easily collected data streams such as Twitter “tends to focus on the raw number of tweets available” and since “raw numbers” would not be a “representative sample” of most populations about which researchers seek to make claims, public perceptions and opinion could be skewed by either mainstream media’s misleading reporting about valid research or by unprofessional researchers’ erroneous claims based upon invalid research methods and evidence (Boyd & Crawford, 2012, pp. 668-669). In addition to these issues of research design, the authors highlight how additional “methodological challenges” can arise “when researchers combine multiple large data sets,” challenges involving “not only the limits of the data set, but also the limits of which questions they can ask of a data set and what interpretations are appropriate” (Boyd & Crawford, 2012, pp. 669-670).

The authors fourth claim continues addressing research quality, but at the broader level of context. Their fourth claim, “4. Taken out of context, Big Data loses its meaning,” emphasizes the importance of considering how the research context affects research methods and research findings and conclusions. The authors imply attitudes toward mathematical modeling and data collection methods may cause researchers to select data more for their suitability to large-scale, computational, automated, quantitative data collection and analysis than for their suitability to discovering patterns or to answering research questions. As an example, the authors consider the evolution of the concept of human networks in sociology and focus on different ways of measuring “‘tie strength,’” a concept understood by many sociologists to indicate “the importance of individual relationships” (Boyd & Crawford, 2013, p. 670). Although recently developed concepts such as “articulated networks” and “behavioral networks” may appear at times to indicate tie strength equivalent to more traditional concepts such as “kinship networks,” the authors explain how the tie strength of kinship networks is based on more in-depth, context-sensitive data collection such as “surveys, interviews” and even “observation,” while the tie strength of articulated networks or behavioral networks may rely on nothing more than interaction frequency analysis; and “measuring tie strength through frequency or public articulation is a common mistake” (Boyd & Crawford, 2013, p. 671). In general, the authors urge caution against considering Big Data the panacea that will objectively and definitively answer all research questions. In their view, “the size of the data should fit the research question being asked; in some cases, small is best” (Boyd & Crawford, 2012, p. 670).

The authors’ final two claims address ethical issues related to Big Data, some of which seem to have arisen in parallel with its ascent. In their fifth claim, “5. Just because it is accessible does not make it ethical,” the authors focus primarily on whether “social media users” implicitly give permission to anyone to use publicly available data related to the user in all contexts, even contexts the user may not have imagined, such as in research studies or in the collectors’ data or information products and services (Boyd & Crawford, 2012, pp. 672-673). Citing Ess and others, the authors emphasize researchers and scholars have “accountability” for their actions, including those actions related to “the serious issues involved in the ethics of online data collections and analysis.” The authors encourage researchers and scholars to consider privacy issues and to proactively assess whether they should assume users have provided “informed consent” for the researchers to collect and analyze users’ publicly available data just because the data is publicly available” (Boyd & Crawford, 2013, pp. 672-673). In their sixth claim, “6. Limited access to Big Data creates new digital divides,” the authors note that although there is a prevalent perception Big Data “offers easy access to massive amounts to data,” the reality is access to Big Data and the ability to manage and analyze Big Data require resources unavailable to much of the population – and this “creates a new kind of digital divide: the Big Data rich and the Big Data poor” (Boyd & Crawford, 2013, pp. 673-674). “Whenever inequalities are explicitly written into the system,” the authors assert further, “they produce class-based structures (Boyd & Crawford, 2012, p. 675)

In their article overall, Boyd & Crawford maintain an optimistic tone while enumerating the many and myriad issues emanating from the phenomenon Big Data. In concluding, the authors encourage scholars, researchers, and society to “start questioning the underlying assumptions, values, and biases of this new wave of research” (Boyd & Crawford, 2012, p. 675).

AB01 – Graham, S. S., Kim, S.-Y., Devasto, M. D., & Keith, W. (2015). Statistical genre analysis: Toward big data methodologies in technical communication.

A team of researchers determines to bring the power of “big data” into the toolkit of technical communication scholars by piloting a research method they “dub statistical genre analysis (SGA)” and describing and explaining the method in an article published in the journal Technical Communication Quarterly (Graham, Kim, Devasto, & Keith, 2015, pp. 70-71).

Acknowledging the value academic markets have begun assigning to findings, conclusions, and theories founded upon rigorous analysis of massive data sets, this team deconstructs the amorphous “big data” phenomenon and demonstrates how their SGA methodology can be used to quantitatively describe and visually represent the generic content (e.g. types of evidence and modes of reasoning) of rhetorical situations (e.g. committee meetings) and to discover input variables (e.g. conflicts of interest) that have statistically significant effects upon output variables (e.g. recommendations) of important policy-influencing entities such as the Food and Drug Administration’s (FDA) Oncologic Drugs Advisory Committee (ODAC) (Graham et al., 2015, pp. 86-89).

The authors believe there is much to gain by integrating the “humanistic and qualitative study of discourse with statistical methods” and although they respect the “craft character of rhetorical inquiry” (Graham et al., 2015, pp 71-72) and utilize “the inductive and qualitative nature of rhetorical analysis as a necessary” initial step in their hybrid method (Graham et al., 2015, p. 77), they conclude their mixed-method SGA approach can increase the “range and power” (Graham et al., 2015 p. 92) of “traditional, inductive approaches to genre analysis” (Graham et al., 2015, p. 86) by offering the advantages “of statistical insights” while avoiding the disadvantages of statistical sterility that can emerge when the qualitative humanist element is absent (Graham et al., 2015, p. 91).

In the conclusion of their article, the researchers identify two main benefits of their hybrid SGA method. The first benefit is communication genres “can be defined with more precision” since SGA documents the actual frequency of generic conventions as they exist within a large sample of the corpus, rather than being defined more generally since traditional rhetorical methods may document the opinions experts have of the “typical” frequency of generic conventions as they perceive them to exist within a limited sample of “exemplars” selected from a small sample of the corpus. In addition, the authors argue analysis of a massive number of texts may reveal generic conventions that do not appear in the limited sample of exemplars that may be studied by practitioners of the traditional rhetorical approach involving only “critical analysis and close reading.” The second benefit is communications scholars are enabled to move beyond critical opinion and to claim statistically significant correlations between “situational inputs and outputs” and “genre characteristics that have been empirically established” (Graham et al., 2015, p. 92).

Befitting the subject of their study, the authors devote a considerable portion of their article to describing their research methodology. In the third section titled “Statistical Genre Analysis,” they begin by noting they conducted the “current pilot study” on a “relatively small subset” of the available data in order to “demonstrate the potential of SGA.” Further, they outline their research questions, the answers to two of which indeed seem to attest to the strength SGA can contribute to both the evidence and the inferences used by communication scholars in their own arguments about the communications they study. As they do in the introduction, in this section also, the authors note the intellectual lineage of SGA in various disciplines, including “rhetorical studies, linguistics,” “health communication,” psychology, and “applied statistics” (Graham et al., 2015, pp. 71, 76).

As explained earlier, the communication artifacts studied by these researches are selected from among the various artifacts arising from the FDA’s ODAC meetings, specifically the textual transcriptions of presentations (essentially opening statements) given by the sponsors (pharmaceutical manufacturing companies) of the drugs under review during meetings which usually last one or two days (Graham et al., 2015, pp. 75-76). Not only in the arenas of technical communication and rhetoric, but also in the arenas of Science and Technology Studies (STS) and of Science, Technology, Engineering, and Math (STEM) public policy, managing conflicts of interests among ODAC participants and encouraging inclusion of all relevant stakeholders in ODAC meetings are prominent issues (Graham et al., 2015, p. 72). At the conclusion of ODAC meetings, voting participants vote either for or against the issue under consideration, generally “applications to market new drugs, new indications for already approved drugs, and appropriate research/study endpoints” (Graham et al., 2015, pp. 74-76).

It is within this context the authors attempted to answer the following two research questions, among others, regarding all ODAC meetings and sponsor presentations given at those meetings between 2009 and 2012: “1. How does the distribution of stakeholders affect the distribution of votes?” and “3. How does the distribution of evidence and forms of reasoning in sponsor presentations affect the distribution of votes?” (Graham et al., 2015, pp. 75-76). Notice both of these research questions ask whether certain input variables affect certain output variables. And in this case, the output variables are votes either for or against an action that will have serious consequences for people and organizations. Put another way, this is a political (or deliberative rhetoric) situation and the ability to predict with a high degree of certainty which inputs produce which outputs could be quite valuable, given those inputs and outputs could determine substantial budget allocations, consulting fees, and pharmaceutical sales – essentially, success or failure – among other things.

Toward the aim of asking and answering research questions with such potentially high stakes, the authors applied their SGA mixed-methods approach, which they explain included four phases of research conducted over approximately six months to one year and included at least four researchers. The authors explain SGA “requires first an extensive data preparation phase” after which the researchers “subjected” the data “to various statistical tests to directly address the research questions.” They describe the four phases of their SGA method as “(a) coding schema development, (b) directed content analysis, (c) meeting data and participant demographics extraction, and (d) statistical analyses.” Before moving into a deeper discussion of their own “coding schema” development, as well as the other phases of their SGA approach, the authors cite numerous influences from scholars in “behavioral research,” “multivariate statistics,” “corpus linguistics,” and “quantitative work in English for specific purposes,” while explaining the specific statistical “techniques” they apply “can be found in canonical works of multivariate statistics such as Keppel’s (1991) Design and Analysis and Johnson and Wichern’s (2007) Applied Multivariate Statistical Analysis” (Graham et al., 2015, pp. 75-77). One important distinction the authors make between their method and these other methods is while the other methods operate at the more granular “word and sentence level” that facilitates formulation of “coding schema amenable to automated content analysis,” the authors operate at the less granular paragraph level that requires human intervention in order to formulate coding schema reflecting nuances only discernable at higher cognitive levels, for example whether particular evidentiary artifacts (transcripts) are based on randomized controlled trials (RCTs) addressing issues of “efficacy” or RCTs addressing issues of “safety and treatment-related hazards” (Graham et al., 2015, pp. 77-78). Choosing the longer, more complex paragraph as their unit of analysis requires the research method to depend upon “the inductive and qualitative nature of rhetorical analysis as a necessary precursor to both qualitative coding and statistical testing” (Graham et al., 2015, p. 77).

In the final section of their explanation of SGA, their research methodology, the authors summarize their statistical methods including both “descriptive statistics” and “inferential statistics” and how they applied these two types of statistical methods, respectively, to “provide a quantitative representation of the data set” (e.g. “mean, median, and standard deviation”) and to “estimate the relationship between variables” (e.g. “statistically significant impacts”) (Graham et al., 2015, pp. 81-83).

Returning to the point of the authors’ research – namely demonstrating how SGA empowers scholars to provide confident answers to research questions and therefore to create and assert knowledge clearly valued by societal interests – their SGA enables them to state their “multiple regression analysis” found “RCT-efficacy data and conflict of interest remained as the only significant predictors of approval rates. Oddly, the use of efficacy data seems to lower the chance of approval, whereas a greater presence of conflict of interest increases the probability of approval” (Graham et al., 2015, p. 89). Obviously, this finding encourages entities aiming to increase the probability of approval to allocate resources toward increasing the presence of conflicts of interests since that is the only input variable demonstrated to contribute to achieving their aim. On the other hand, this finding provides evidence entities claiming conflicts of interests illegally (or at least undesirably) affect ODAC participants’ votes can use to bolster their arguments “stricter controls on conflicts of interests should be deployed (Graham et al., 2015, p. 92).