AB10 – Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners.

In the introduction to his book on the big data phenomenon, Jared Dean notes recent examples of big data’s impact, provides an extended definition of big data, and discusses some prominent issues debated in the field currently (Dean, 2014, pp. 1-12). In part one, Dean describes what he calls “the computing environment” including elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact (Dean, 2014, pp. 23-25). In part two, Dean explains a broad set of tactics for “turning data into business value” through the “methodology, algorithms, and approaches that can be applied to your data mining activities” (Dean, 2014, pp. 53-54). In part three, Dean examines cases of “large multinational corporations” that completed big data projects and “overcame major challenges in using their data effectively” (Dean, 2014, p. 194). In the final chapter of his book, Dean notes some of the trends, “opportunities,” and “serious challenges” he sees in the future of “big data, data mining, and machine learning” (Dean, 2014, p. 233).

Including data mining and machine learning in the title of his book, Dean highlights two fields of practice that managed and processed relatively large volumes of data long before the popular term big data was first used and, according to Dean, became “co-opted for self-promotion by many people and organizations with little or no ties to storing and processing large amounts of data or data that requires large amounts of computation” (Dean, 2014, pp. 9-10). Although Dean does not explain yet why so much attention has become focused on data mining and related fields, he says data is the “new oil,” the natural resource that is “plentiful but difficult and sometimes messy to extract,” and he says this natural resource requires “infrastructure” to “transport, refine, and distribute it” (Dean, 2014, p. 12). While noting “’Big Data’ once meant petabyte scale” and was generally used in reference to “unstructured chunks of data mined or generated from the internet,” Dean proposes the usage of the term big data has “expanded to mean a situation where” organizations “have too much data to store effectively or compute efficiently using traditional methods” (Dean, 2014, p. 10). Furthermore, Dean proposes what he calls the current “big data era” is differentiated by both a) notable changes in perception, attitude, and behavior by those who have realized “organizations that use data to make decisions over time in fact do make better decisions” – and therefore attain competitive advantage warranting “investment in collecting and storing data for its potential future value” – and by b) the “rapid development, creation, and maturity of technologies to store, manipulate, and analyze this data in new and efficient ways” (Dean, 2014, pp. 4-5). The main example of big data Dean cites in his book’s introduction illustrates big data’s impact by highlighting how it enabled “scientists” “to identify genetic markers” that enabled them to discover the drug tamoxifen used to treat breast cancer “is not 80% effective in patients but 100% effective in 80% of patients and ineffective in the rest” (Dean, 2014, p. 2). In commenting on this example of big data’s impact, Dean states “this type of analysis was not possible before” the “era of big data” because the “volume and granularity of the data was missing,” the “computational resources” required for the analysis were too scarce and expensive, and the “algorithms or modeling techniques” were too immature (Dean, 2014, p. 2). These types of discoveries, Dean states, have revealed to organizations the “potential future value” of data and have resulted in a “virtuous circle” where the realization of the value of data leads to increased allocation of resources to collect, store, and analyze data which leads to more valuable discoveries (Dean, 2014, p. 4). Although Dean mentions “credit” for first using the term big data is generally given to John Mashey who “in the late 1990s” “gave a series of talks to small groups about this big data tidal wave that was coming,” he also notes the “first academic paper was presented in 2000, and published in 2003, by Francis X. Diebolt” (Dean, 2014, p. 3). With the broad parameters of Dean’s extended definition of big data thus outlined, Dean completes the introduction of his book with a discussion of some prominent issues debated in the field currently, such as when sampling data may continue to be preferable to using all available data or when the converse is true (Dean, 2014, pp. 13-21), when new sources of data should be incorporated into existing processes (Dean, 2014, p. 13), and, perhaps most importantly, when the benefit of the information produced by a big data process clearly outweighs the cost of the big data process and thereby value is created (Dean, 2014, p. 11).

Dean’s use of the term “data mining” in the first sentences of his initial remarks both to part one and to part two of his book emphasizes Dean’s awareness of big data’s lineage in previously existing academic disciplines and professional fields of practice, a lineage that can seem lost in recent mainstream explanations of the big data phenomenon that often invoke a few terms all beginning with the letter “v” (Dean, 2014, pp. 24, 54). In fact, Dean himself refers to these “v’s” of big data, although he adds the term “value” to the other three commonly used terms “volume,” “velocity,” and “variety” (Dean, 2014, p. 24). Dean states “data mining is going through a significant shift with the volume, variety, value, and velocity of data increasing significantly each year” and he discusses in a fair amount of detail throughout parts one and two of his book the resources and methodologies available to and applied by those capitalizing on the big data phenomenon (Dean, 2014, pp. 23-25).

Dean separates the data mining endeavor into two sets of elements corresponding to the first two parts of his book. Part one, which Dean calls the “the computing environment, includes elements such as hardware, systems architectures, programming languages, and software used in big data projects, as well as how these elements interact. Part two, which dean calls “turning data into business value” includes elements such as the “methodology, algorithms, and approaches that can be applied to” big data projects (Dean, 2014, pp. 23-25, 54).

Although Dean does not explicitly identify as such what many in the information technology (IT) industry call the workload – meaning the primary and supporting software applications and data required to accomplish some information technology objective, e.g. a data mining project – he does discuss at various points throughout his book how the software (applications) used in data mining has characteristics which determine how well the software will run on particular types of hardware and solution architectures (designs). Dean’s introductory remarks to part one describe this as the “interaction between hardware and software” and he notes specifically how “traditional data mining software was implemented by loading data into memory and running a single thread of execution over the data” and how this traditional implementation form determined the “process was constrained by the amount of memory available and the speed of the processor” (Dean, 2014, p. 24). This would mean in cases where the data volume was greater than the available RAM on the system, “the process would fail” (Dean, 2014, p. 24). In addition, Dean notes how software implemented with a “single thread of execution” cannot utilize the advantages of “multicore” CPUs and therefore contributes to imbalances in system utilization and thereby to project waste impeding performance/price optimization (Dean, 2014, pp. 24-25). Further emphasizing his point, Dean says “all software packages cannot take advantage of current hardware capacity” and he notes how “this is especially true of the distributed computing model,” a model he acknowledges as important by encouraging decisions makers “to ensure that algorithms are distributed and effectively leveraging” currently available computing resources (Dean, 2014, p. 25).

In the first chapter of part one, Dean begins discussing the hardware involved in big data and he focuses on five primary hardware components: the storage, the central processing unit (CPU), the graphical processing unit (GPU), the memory (RAM), and the network (Dean, 2014, pp. 27-34). Regarding the storage hardware, Dean notes how although the price to performance and price to capacity ratios are improving, they may not be improving fast enough to offset increasing data volumes, volumes he says are “doubling every few years” (Dean, 2014, p. 28). Dean also draws attention to a few hardware innovations important to large-scale data storage and processing such as how external storage subsystems (disk arrays) and solid-state drives (SSDs) provide CPUs with faster access to larger volumes of data by improving “throughput” rates that in turn decrease the amount of time required to analyze vast quantities of data (Dean, 2014, pp. 28-29). Still, even though these innovations in data storage and access are improving data processing and analysis capabilities, Dean emphasizes the importance of the overall systems and their inter-relationships and how big data analytics teams must ensure they choose “analytical software” that can take advantage of improvements in data storage, access, and processing technologies, for example by choosing software that can “augment memory by writing intermediate results to disk storage” since disk storage is less expensive than memory, and that can utilize multi-core processors by executing multi-thread computations in parallel since that improves utilization rates of CPUs and utilization rates are a primary measure of efficiency used in analyzing costs relative to benefits (Dean, 2014, pp. 24, 28-29). Regarding CPU hardware, Dean notes how although the “famous Moore’s law” described the rapid improvements in processing power that “continued into the 1990s,” the law did not ensure sustainment of those same kinds of improvements (Dean, 2014, pp. 29-30). In fact, Dean states “in the early 2000s, the Moore’s law free lunch was over, at least in terms of processing speed,” for various reasons, for example the heat generated by processors running at ultra-high frequencies is excessive, and therefore CPU manufacturers tried other means of improving CPU performance, for example by “adding extra threads of execution to their chips” (Dean, 2014, p. 30). Ultimately, even though the innovation in CPUs is different than it was in the Moore’s law years of the 1980s and 1990’s, innovation in CPUs continues and it remains true that CPU utilization rates are low relative to other system components such as mechanical disks, SSDs, and memory; therefore in Dean’s view it remains true the “mismatch that exists among disk, memory, and CPU” often is the primary problem constraining performance (Dean, 2014, pp. 29-30). Regarding GPUs, Dean discusses how they have recently begun to be used to augment system processing power; and he focuses on how some aspects of graphics problems and data mining problems are similar in that they require or benefit from performing “a huge number of very similar calculations” to solve “hard problems remarkably fast” (Dean, 2014, p. 31). Furthermore, Dean notes how in the past “the ability to develop code to run on the GPU was restrictive and costly,” but recent improvements in “programming interfaces for developing software” to exploit GPU resources are overcoming those barriers (Dean, 2014, p. 31). Regarding memory (RAM), Dean emphasizes its importance for data mining workloads due to its function as the “intermediary between the storage of data and the processing of mathematical operations that are performed by the CPU” (Dean, 2014, p.32). In discussing RAM, Dean provides some background by mentioning a few milestones in the development of RAM and related components, for example how previous 32-bit CPUs and operating systems (OSes) limited addressable memory to 4GB and how Intel’s and AMD’s introductions of 64-bit CPUs at commodity prices in the early 2000s, along with the release of 64-bit OSes to support their widespread adoption, expanded addressable RAM to 8TB at that time and thereby supported “data mining platforms that could store the entire data mining problem in memory” (Dean, 2014, p. 32). These types of advancements in technology coupled with improvements in the ratios of price to performance and price to capacity, including the “dramatic drop in the price of memory” during this same time period “created an opportunity to solve many data mining problems that previously were not feasible” (Dean, 2014, p. 32). Even with this optimistic view of overall advancements, Dean reiterates the advancement pace in RAM and in hard drives remains slow compared to the advancement pace in CPUs – RAM speeds “have increased by 10 times” while CPU speeds “have increased 10,000 times” and disk storage advancements have been slower than those in RAM – therefore it remains important to continue seeking higher degrees of optimization, for example by using distributed systems since “it is much less expensive to deploy a set of commodity systems” with high capacities of RAM than it is to use “expensive high-speed disk storage systems” (Dean, 2014. pp. 32-33). One of the disadvantages of distributed systems, however, is the network “bottleneck” existing between individual nodes in the cluster even when high-speed proprietary technologies such as Infiniband are used (Dean, 2014, pp. 33-34). In the case of less expensive, standard technologies, Dean notes the “standard network connection for an analytical computing cluster is 10 gigabit Ethernet (10 GbE), which has an upper-bound data transfer rate of 4 gigabytes per second (GB/sec)” (Dean, 2014, p. 34). Since Dean identifies the inter-node network as the slowest component of distributed computing systems, he emphasizes the importance of considering the “network component when evaluating data mining software” and notes skillful design and selection of the “software infrastructure” and “algorithms” is required to ensure efficient “parallelization” is possible while minimizing data movement and “communication between computers” (Dean, 2014, p. 34).

In the second chapter of part one, Dean discusses the advantages and disadvantages of different types of distributed systems and notes how the decreasing price of standard hardware – including “high-core/large-memory (massively parallel processing [MPP] systems) and clusters of moderate systems” – has improved the cost to benefit ratio of solving “harder problems” which he defines as “problems that consume much larger volumes of data, with much higher numbers of variables” (Dean, 2014, p. 36). The crucial development enabling this advance in “big data analytics,” according to Dean, is the capability and practice of moving “the analytics to the data” rather than moving the “data to the analytics” (Dean, 2014, p. 36). Cluster computing that effectively moves “the analytics to the data,” according to Dean, “can be divided into two main groups of distributed computing systems,” that is “database computing” systems based on the traditional relational database management system (RDBMS) and “file system computing” systems based on distributed file systems (Dean, 2014, pp. 36-37). Regarding database computing, Dean describes “MPP databases” that “began to evolve from traditional DBMS technologies in the 1980s” as “positioned as the most direct update for the organizational enterprise data warehouses (EDWs)” and explains “the technology behind” them as involving “commodity or specialized servers that hold data on multiple hard disks” (Dean, 2014, p. 37). In addition to MPP databases, Dean describes “in-memory databases (IMDBs)” as an evolution begun “in the 1990s” that has become a currently “popular solution used to accelerate mission-critical data transactions” for various industries willing to absorb the higher costs of high-capacity RAM in order to attain the increased performance possible when all data is stored in RAM (Dean, 2014, p. 37). Regarding file system computing, Dean notes while there are many available “platforms,” the “market is rapidly consolidating on Hadoop” due to the number of “distributions and tools that are compatible with its file system” (Dean, 2014, p. 37). Dean attributes the “initial development” of Hadoop to Doug Cutting and Mike Cafarella who “created” it “based” upon development they did on the “Apache open source web crawling project” called Nutch and upon “a paper published by Google that introduced the MapReduce paradigm for processing data on large clusters” (Dean, 2014, pp. 37-38). Summarizing the advantages of Hadoop, Dean explains it is “attractive because it can store and manage very large volumes of data on commodity hardware and can expand easily by adding hardware resources with incremental cost” (Dean, 2014, p. 38). This coupling of high-capacity data storage with incremental capital expenditures makes the cost to benefit ration appear more attractive to organizations and enables them to rationalize storing “all available data in Hadoop” wagering on its potential future value even while understanding the large volume of data stored in Hadoop “is rarely (actually probably never) in an appropriate form for data mining” without the need for additional resource expenditure to cleanse it, transform it, and even “augment” it with data stored in other repositories (Dean, 2014, pp. 38-39). At this point in chapter two of his book, Dean has completed an initial overview of hardware components and solution architectures commonly considered by those responsible for purchasing and implementing big data management projects. Members of the IT organization, according to Dean, are often responsible for these decisions, although they may be expected to collaborate with other organizational stakeholders to understand organizational needs and objectives and to explain the advantages and disadvantages of various technological options and potential “trade-offs” that should be considered (Dean, 2014, pp. 39-40). Near the end of chapter two, Dean illustrates some of these factors (criteria) and “big data technologies” in a comparative table which ranks some of the solutions he has discussed (e.g. IMDBs, MPPDBs, and Hadoop) according to the degree (high, medium, low) to which they possess some features or capabilities often required in big data solutions (e.g. maintaining data integrity, providing high-availability to the data, and handling “unstructured data”) (Dean, 2014, p. 40). Concluding chapter two, Dean notes selection of the optimal “computing platform” for big data projects depends “on many dimensions, primarily the volume of data (initial, working set, and output data volumes), the pattern of access to the data, and the algorithm for analysis” (Dean, 2014, p. 41). Furthermore, he states these dimensions will “vary” at different phases of the data analysis (Dean, 2014, p. 41).

In the third chapter of part one, Dean departs from the preparation of data and the systems used to store data and moves on to the “analytical tools” that enable people to “create value” from the data (Dean, 2014, p. 43). Dean’s discussion of analytical tools focuses on five software applications and programming languages commonly used for large-scale data processing and analysis and he notes some of the strengths and weaknesses of each one. Beginning with the open source data mining software “Weka (Waikato Environment for Knowledge Analysis),” Dean describes it as “fully implemented in Java” and “notable for its broad range of extremely advanced training algorithms, its work flow graphical user interface (GUI)” and its “data visualization” capabilities (Dean, 2014, pp. 43-44). A weakness of Weka in Dean’s view is that it “does not scale well for big data analytics” since it “is limited to available RAM resources” and therefore its “documentation directs users to its data preprocessing and filtering algorithms to sample big data before analysis” (Dean, 2014, p. 44). Even with this weakness, however, Dean states many of Weka’s “most powerful algorithms” are only available in Weka and that its GUI makes it “a good option” for those without Java programming experience who need “to prove value” quickly (Dean, 2014, p. 44). When organizations need to “design custom analytics platforms,” Dean cites “Java and JVM languages” as “common choices” because of Java’s “considerable development advantages over lower-level languages” such as FORTRAN and C that “execute directly on native hardware” especially since “technological advances in the Java platform” have improved its performance “for input/output and network-bound processes like those at the core of many open source big data applications” (Dean, 2014, pp. 44-45). As evidence of these improvements, Dean notes how Apache Hadoop, the popular “Java-based big data environment,” won the “2008 and 2009 TeraByte Sort Benchmark” (Dean, 2014, p. 45). Overall, Dean’s perspective is the performance improvements in Java and the increasing “scale and complexity” of “analytic applications” have converged such that Java’s advantages in “development efficiency” along with “its rich libraries, many application frameworks, inherent support for concurrency and network communications, and a preexisting open source code base for data mining functionality” now outweigh some of its known weaknesses such as “memory and CPU-bound  performance” issues (Dean, 2014, p. 45). In addition to Java, Dean mentions “Scala and Clojure” as “newer languages that also run on the JVM and are used for data mining applications” (Dean, 2014, p. 45). Dean describes “R” as an “open source fourth-generation programming language designed for statistical analysis” that is gaining in “prominence” and popularity in the rapidly expanding “data science community” as well as in academia and “in the private sector” (Dean, 2014, p. 47). Evidence of R’s growing popularity is its ranking in the “2013 TIOBE general survey of programming languages,” in which it ranked “in 18th place in overall development language popularity” alongside “commercial solutions like SAS (at 21st) and MATLAB (at 19th)” (Dean, 2014, p. 47). Among the advantages of R, there are “thousands of extension packages” enabling customization to include “everything from speech analysis, to genomic science, to text mining,” in addition to its “impressive graphics, free and polished integrated development environments (IDEs), programmatic access to and from many general-purpose languages, and interfaces with popular proprietary analytics solutions including MATLAB and SAS” (Dean, 2014, p. 47).  Python is described by Dean as “designed to be an extensible, high-level language with a large standard library and simple, expressive syntax” that “can be used interactively or programmatically” and that “is often “deployed for scripting, numerical analysis, and OO general-purpose and Web application development” (Dean, 2014, p. 49). Dean highlights Python’s “general programming strengths” and “many database, mathematical, and graphics libraries” as particularly beneficial “in the data exploration and data mining problem domains” (Dean, 2014, p. 49). Although Dean provides a long list of advantageous features of Python, he asserts “the maturity” of its “scikit-learn toolkit” as a primary factor in its recently higher adoption rates “in the data mining and data science communities” (Dean, 2014, p. 49). Last, Dean describes SAS as “the leading analytical software on the market” and cites reports by IDC, Forrester, and Gartner as evidence (Dean, 2014, p. 50). In their most recent reports at the time Dean published this book, both Forrester and Gartner had named SAS “as the leading vendor in predictive modeling and data mining” (Dean, 2014, p. 50). Dean describes “the SAS System” as composed of “a number of product areas including statistics, operations research, data management, engines for accessing data, and business intelligence (BI); although he states the products “SAS/STAT, SAS Enterprise Miner, and the SAS text analytics suite” are most “relevant” in the context of this book (Dean, 2014, p. 50). Dean explains “the SAS system” can be “divided into two main areas: procedures to perform an analysis and the fourth-generation language that allows users to manipulate data” (Dean, 2014, p. 50). Dean illustrates one of the advantages of SAS by providing an example of how a SAS proprietary “procedure, or PROC” simplifies the code required to perform specific analyses such as “building regression models” or “doing descriptive statistics” (Dean, 2014, p. 51). Another great advantage “of SAS over other software packages is the documentation” that includes “over 9,300 pages” for the “SAS/STAT product alone” and over “2,000 pages” for the Enterprise Miner product” (Dean, 2014, p. 51). As additional evidence of the advantages of SAS products, Dean states he has never encountered an “analytical challenge” he has “not been able to accomplish with SAS” and notes that recent, “major changes” in the SAS “architecture” have enabled it “to take better advantage of the processing power and falling price per FLOP (floating point operations per second) of modern computing clusters” (Dean, 2014, pp. 50, 52). With his explanations of the big data computing environment (i.e. hardware, systems architectures, software, and programming languages) and some aspects of the big data preparation phase completed in part one, Dean turns to part two in which he addresses in depth exactly how big data in general and predictive modeling in particular enable “value creation for business leaders and practitioners.” a phrase he uses as the subtitle of his book.

Dean introduces part two of his book by stating he will address over the next seven chapters the “methodology, algorithms, and approaches that can be applied to” data mining projects, including a general four-step “process of building models” he says “has been developed and refined by many practitioners over many years,” and also including the “sEMMA approach,” a “data mining methodology, created by SAS, that focuses on logical organization of the model development phase of data mining projects” (Dean, 2014, pp. 54, 58, 61). The sEMMA approach, according to Dean, “has been in place for over a decade and proven useful for thousands and thousands of users” (Dean, 2014, p. 54). Dean says his objective is to “explain a methodology for predictive modeling” since accurately “predicting future behavior” provides people and organizations “a distinct advantage regardless of the venue” (Dean, 2014, p. 54). In his explanation of data mining methodologies, Dean notes he will discuss “the types of target models, their characteristics” and their business applications (Dean, 2014, p. 54). In addition, Dean states he will discuss “a number of predictive modeling techniques” including the “fundamental ideas behind” them, “their origins, how they differ, and some of their drawbacks” (Dean, 2014, p. 54). Finally, Dean says he will explain some “more modern methods for analysis or analysis for specific types of data” (Dean, 2014, p. 54).

Dean begins chapter four by stating predictive modeling is one of the primary data mining endeavors and he defines it as a process in which collected “historical data (the past)” are explored to “identify patterns in the data that are seen through some methodology (the model), and then using the model” to predict “what will happen in the future (scoring new data)” (Dean, 2014, p. 55). Next, Dean discusses the multi-disciplinary nature of the field utilizing a Venn diagram from SAS Enterprise Miner training documentation that includes the following contributing disciplines: data mining, knowledge discovery and data mining (KDD), statistics, machine learning, data bases, data science, pattern recognition, computational neuroscience, and artificial intelligence (AI) (Dean, 2014, pp. 56). Dean notes his tendency to use “algorithms that come primarily from statistics and machine learning” and he explains how these two disciplines, residing in different “university departments” as they do, produce graduates with different knowledge and skills. Graduates in statistics, according to Dean, tend to understand “a great deal of theory” but have “limited programming skills,” while graduates in computer science tend to “be great programmers” who understand “how computer languages interact with computer hardware, but have limited training in how to analyze data” (Dean, 2014, p. 56). The result of this, Dean explains, is “job applicants will likely know only half the algorithms commonly used in modeling,” with the statisticians knowing “regression, General Linear Models (GLMs), and decision trees” and the computer scientists knowing “neural networks, support vector machines, and Bayesian methods” (Dean, 2014, p. 56). Before moving into a deeper discussion of the process of building predictive models, Dean notes a few “key points about predictive modeling,” namely that a) “sometimes models are wrong,” b) “the farther your time horizon, the more uncertainty there is,” and c) “averages (or averaging techniques) do not predict extreme values” (Dean, 2014, p. 57). Elaborating further, Dean says even though models may be wrong (i.e. there is a known margin of error), the models can still be useful for making decisions (Dean, 2014, p. 57). And finally, Dean emphasizes “logic and reason should not be ignored because of a model result” Dean, 2014, p. 58).

In the next sections of chapter four, Dean explains in detail “a methodology for building models” (Dean, 2014, p. 58). First, he discusses the general “process of building models” as a “simple, proven approach to building successful and profitable models” and he explains the general process in four phases: “1. Prepare the data,” “2. Perform exploratory data analysis,” “3. Build your first model,” and “4. Iteratively build models” (Dean, 2014, pp. 58-60). The first phase of preparing the data, Dean says, is likely completed by “a separate team” and requires understanding “the data preparation process within” an organization, namely “what data exists” in the organization (i.e. the sources of data) and how data from various sources “can be combined” to “provide insight that was previously not possible” (Dean, 2014, p. 58). Dean emphasizes the importance of having access to “increasingly larger and more granular data” and points directly to the “IT organization” as the entity that should be “keeping more data for longer and at finer levels” to ensure their organizations are not “behind the trend” and not “at risk for becoming irrelevant” in the competitive landscape (Dean, 2014, pp. 58-59). The second phase of the model building process focuses on exploring the data to begin to “understand” it and “to gain intuition about relationships between variables” (Dean, 2014, p. 59). Dean emphasizes “domain expertise” is important at this phase in order to ensure “thorough analysis” and recommendations which avoid unwarranted focus on patterns or discoveries that may seem important to the data miner, who, though skilled in data analysis, may not have the domain knowledge required to realize what appear to him to be significant patterns are insignificant in reality because the patterns are widely known by domain experts already (Dean, 2014, p. 59). Dean notes recent advances in “graphical tools” for data mining have simplified the data exploration process – a process that was once much slower and often required “programming skills – to the degree that products from some large and small companies such as “SAS, IBM, and SAP” and “QlikTech,” and “Tableau” enable users to easily “load data for visual exploration” and “have been proven to work with” projects involving “billions of observations” when “sufficient hardware resources are available” (Dean, 2014, p. 59). To ensure efficiency in the data exploration phase, Dean asserts the importance of adhering to the “principle of sufficiency” and to the “law of diminishing returns” so that exploration stops while the cost to benefit ratio is optimal (Dean, 20124, pp. 59-60). The third phase of the model-building process is to build the first model while acknowledging a “successful model-building process will involve many iterations” (Dean, 2014, p. 60). Dean recommends working rapidly with a familiar method to build the first model and states he often prefers to “use a decision tree” because he is comfortable with it (Dean, 2014, p. 60). This first model created is used as the “champion model” (benchmark) against which the next model iteration will be evaluated (Dean, 2014, p. 60). The fourth phase of the model-building process is where most time should be devoted and where the data miner will need to use “some objective criteria that defines the best model” in order to determine if the most recent model iteration is “better than the champion model” (Dean, 2014, p. 60). Dean describes this step as “a feedback loop” since the data miner will continue comparing the “best” model built thus far with the next model iteration until either “the project objectives are met” or some other constraint such as a deadline requires stopping (Dean, 2014, p. 60). With this summary of the general model-building process finished, next Dean explains SAS’s sEMMA approach that “focuses on the model development phase” of the general model-building process and “makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes and confirm a model’s accuracy” (Dean, 2014, p. 61). “The acronym sEMMA,” Dean states, “refers to the core process of conducting data mining” and stands for “sample, explore, modify, model,” and “assess” (Dean, 2014, p. 61). To explain sEMMA further, Dean addresses “a common misconception” by emphasizing “sEMMA is not a data mining methodology but rather a logical organization of the functional tool set” of the SAS Enterprise Miner product that can be used appropriately “as part of any iterative data mining methodology adopted by the client” (SAS, 2014, p. 61). Once the best model has been found under given constraints, then this “champion model” can “be deployed to score” new data – this is “the end result of data mining” and the point at which “return on investment” is realized (Dean, 2014, p. 63). At this point, Dean explains some of the advantages of using SAS’s Enterprise Miner, advantages such as automation of “the deployment phase by supplying scoring code in SAS, C, Java, and PMML” and by capturing “code for pre-processing activities” (Dean, 2014, p. 63).

Following his overview of the model-building process, Dean identifies the three “types of target models,” discusses “their characteristics,” provides “information about their specific uses in business,” and then explains some common ways of “assessing” (evaluating) predictive models (Dean, 2014, pp. 54, 64-70). The first target model Dean explains is “binary classification” which in his experience “is the most common type of predictive model” (Dean, 2014, p. 64). Binary classification often is used to enable “decision makers” with a “system to arrive at a yes/no decision with confidence” and to do so fast, for example, to approve or disapprove a credit application or to launch or not launch a spacecraft (Dean, 2014, p. 64). Dean explains further, however, there are cases when a prediction of “the probability that an event will or will not occur” is “much more useful than the binary prediction itself” (Dean, 2014, p. 64). In the case of weather forecasts, for example, most people would prefer to know the “confidence estimate” (“degree of confidence,” confidence level, probability) expressed as a percentage probability the weather forecaster assigns to each possible observation of the – it is easier to decide whether to carry an umbrella if one knows the weather forecaster predicts a ninety-five percent chance of rain than if one knows only that the weather forecaster predicts it will rain or it will not rain (Dean, 2014, p. 64). The second target model Dean explains is “multilevel or nominal classification” which he describes as useful when one is interested in creating “more than two levels” of classification (Dean, 2014, p. 64). As an example, Dean describes how preventing credit card fraud while facilitating timely transactions could mean the initial decision regarding a transaction include not only the binary classifications of accept or decline, but also an exception classification of requires further review before a decision can be made to accept or decline (Dean, 2014, p. 64). Although beneficial in some cases, Dean notes nominal classification “poses some additional complications from a computational and also reporting perspective” since it requires finding the probability of all events prior to computing the probability of the “last level,” adds the “challenge in computing the misclassification rate,” and requires “report value be calibrated” for easier interpretation readers of the report (Dean, 2014, p. 64). The final target model Dean explains is “interval prediction” which he describes as “used when the largest level is continuous on the number line” (Dean, 2014, p. 66). This model, according to Dean, is often used in the insurance industry which states generally determines premium prices based on “three different types of interval predictive models including claim frequency, severity, and pure premium” (Dean, 2014, p. 67). Since insurance companies will implement the models differently based on the insurance company’s “historical data” and each customer’s “specific information” including, in the automotive sector as an example, the customer’s car type, yearly driving distance, and driving record, the insurance companies will arrive at different premium prices for certain classifications of customers (Dean, 2014, p. 67).

Having explained the three types of target models, Dean finishes chapter four by discussing how to evaluate “which model is best” given a particular predictive modeling problem and the available data set (Dean, 2014, p. 67). He establishes the components of a “model” as “all the transformations, imputations, variable selection, variable binning, and so on manipulations that are applied to the data in addition to the chosen algorithm and its associated parameters” (Dean, 2014, p. 67). Noting the inherent subjectivity of determining the “best” model, Dean asserts the massive “number of options and combinations makes a brute-force” approach “infeasible” and therefore a “common set of assessment measures” has arisen; and partitioning a data set into a larger “training partition” and a smaller “validation partition” to use for “assessment” has become “best practice” in order to understand the degree to which a “model will generalize to new incoming data” (Dean, 2014, pp. 67, 70). As the foundation for his explanation of “assessment measures,” Dean identifies “a set” of them “based on the 2×2 decision matrix” he illustrates in a table with the potential outcomes of “nonevent” and “event” for the row headings and the potential predictions of “predicted nonevent” and “predicted event’ as the column headings (Dean, 2014, p. 68). The values in the four table cells logically follow as “true negative,” “false negative, “false positive,” and “true positive” (Dean, 2014, p. 68). This classification method is widely accepted, according to Dean, because it “closely aligns with what most people associate as the ‘best’ model, and it measures the model fit across all values” (Dean, 2014, p. 68). In addition, Dean notes the “proportion of events to nonevents” when using the classification method should be “approximately equal” or “the values need to be adjusted for making proper decisions” (Dean, 2014, p. 68). Once all observations are classified according to the 2×2 decision matrix, then the “receiver operating characteristics (ROC) are calculated for all points and displayed graphically for interpretation” (Dean, 2014, p. 68). In another table in this section, Dean provides the “formulas to calculate different classification measures” such as the “classification rate (accuracy),” “sensitivity (true positive rate),” and “I-specificity (false positive rate),” among others (Dean, 2014, p. 68). Other assessment measures explained by Dean are “lift” and “gain” and the statistical measures Akaike’s Information Criterion (AIC), Bayesian Information Criteria (BIC), and the Kolmogorov-Smirnov (KS) statistic (Dean, 2014, pp. 69-70).

In chapter five, Dean begins discussing in greater detail what he calls “the key part of data mining” that follows and is founded upon the deployment and implementation phases that established the hardware infrastructure, supporting software platform, and data mining software to be used, as well as upon the data preparation phase, all phases that in “larger organizations” will likely be performed by cross-functional teams including specialists from various business units and the information technology division (Dean, 2014, pp. 54, 58, 71). While Dean provides in chapter four an overview of predictive modeling processes and methodologies, including the foundational target model types and how to evaluate (assess) the effectiveness of those model types given particular modeling problems and data sets, it is in chapters five through chapter ten that Dean thoroughly discusses and explains the work of the “data scientists” who are responsible for creating models that will predict the future and provide return on investment and competitive advantage to their organizations (Dean, 2014, pp. 55-70, 71-191). Data scientists, according to Dean, are responsible for executing best practice in trying “a number of different modeling techniques or algorithms and a number of attempts within a particular algorithm using different settings or parameters” to find the best model for accomplishing the data mining objective (Dean, 2014, p. 71). Dean explains data scientists will need to conduct “many trials” in a “brute force” effort to “arrive at the best answer” (Dean, 2014, p. 71). Although Dean notes he focuses primarily on “predictive modeling or supervised learning” which “has a target variable,” the “techniques can be used” also “in unsupervised approaches to identify the hidden structure of a set of data” (Dean, 2014, p. 72). Without covering Dean’s rather exhaustive explanation throughout chapter five of the most “common predictive modeling techniques” and throughout chapters six through ten of “a set of methods” that “address more modern methods for analysis or analysis for specific type of data,” let it suffice to say he presents in each section pertaining to a particular modeling method that method’s “history,” an “example or story to illustrate how the method can be used,” “a high-level mathematical approach to the method,” and “a reference section” pointing to more in-depth materials on each method (Dean, 2014, pp. 54, 71-72). In chapter five, Dean explains modeling techniques such as recency, frequency and monetary (RFM) modeling, regression (originally known as “’least squares’”), generalized linear models (GLMs), neural networks, decision and regression trees, support vector machine (SVMs), Bayesian methods network classification, and “ensemble methods” that combine models (Dean, 2014, pp. 71-126). In chapters six through ten, Dean explains modeling techniques such as segmentation, incremental response modeling, time series data mining, recommendation systems, and text analytics (Dean, 2014, pp. 127-180).

Sharing his industry experience in part three, Dean provides “a collection of cases that illustrate companies that have been able to” collect big data and apply data “analytics” to “well-stored and well-prepared data and “find business value” to “improve the business” (Dean, 2014, p. 194). Dean’s case study of a “large U.S.-based financial services” company demonstrates how it attained its “primary objective” to improve the accuracy of its predictive model used in marketing campaigns to “move the model lift from 1.6 to 2.5” and thereby significantly increase the number of customers who responded to its marketing campaigns (Dean, 2014, pp. 198, 202-203). Additionally, the bank attained its second objective that “improved operational processing efficiency and responsiveness” and thereby increased “productivity for employees” (Dean, 2014, pp. 198, 203). Another case study of “a technology manufacturer” shows how it “used the distributed file system of Hadoop along with in-memory computational methods” to reduce the time required to “compute a correlation matrix” identifying sources of product quality issues “from hours to just a few minutes” and thereby enable the manufacturer to detect and correct the source of product quality issues in the time required to prevent shipping defective products and to remedy the manufacturing problem and continue production of quality product as quickly as possible (Dean, 2014, pp. 216-219). Other of Dean’s case studies describe how the big data phenomenon created value for companies in health care, “online brand management,” and targeted marketing of “smartphone applications” (Dean, 2014, pp. 205-208, 225).

Dean concludes his book by describing what he views as some of the “opportunities” and “challenges” in the future of “big data, data mining, and machine learning” (Dean, 2014, p. 233). Regarding the challenges, first Dean discusses the focus in recent years on how difficult it seems to be to reproduce the results of published research studies and he advocates for “tighter controls and accountability” in order to ensure “people and organizations are held accountable for their published research findings” and thereby create a “firm foundation” of knowledge from which to advance the public good (Dean, 2014, pp. 233-234). Second, Dean discusses issues of “privacy with public data sets” and focuses on how it is possible to “deanonymize” large, publicly available data sets by combining those sets with “microdata,” sets, i.e. data sets about “specific people” (Dean, 2014, p. 234-235). These two challenges combined raise issues concerning how to strike an ethical “balance between data privacy and reproducible research” that “includes questions of legality as well as technology” and their “competing interests” (Dean, 2014, p. 233-235). Regarding the opportunities, first Dean discusses the “internet of things” (IoT) and notes the great contribution of machine-to-machine (M2M) communication to the big data era and states IoT will “develop and mature, data volumes will continue to proliferate,” and M2M data will grow to the extent that “data generated by humans will fall to a small percentage in the next ten years” (Dean, 2014, p. 236). Organizations capable of “capturing machine data and using it effectively,” according to Dean, will have great “competitive advantage in the data mining space” in the near future (Dean, 2014, pp. 236-237). The next opportunity Dean explains is the trend toward greater standardization upon “open source” software which gives professionals greater freedom in transferring “their skills” across organizations and which will require organizations to integrate open source software with proprietary software to benefit both from less-expensive, open source standards and from the “optimization,” “routine updates, technical support, and quality control assurance” offered by traditional “commercial vendors” (Dean, 2014, pp. 237-238). Finally, Dean discusses opportunities in the “future development of algorithms” and while he acknowledges “new algorithms will be developed,” he also states “big data practitioners will need to” understand and apply “traditional methods” since true advancements in algorithms will be “incremental” and slower than some will claim (Dean, 2014, pp. 238-239). Dean states his “personal research interest is in the ensemble tree and deep learning areas” (Dean, 2014, p. 239). Additionally, he notes interesting developments made by the Defense Advanced Research Projects Agency (DARPA) on “a new programming paradigm for managing uncertain information” called Probabilistic Programming for Advanced Machine Learning (PPAML) (Dean, 2014, p. 239). The end of Dean’s discussion of the future of algorithms cites as testaments to the “success” of “data mining” and “analytics” the recent advances made in the “science of predictive algorithms,” in the ability of “machines” to better explore and find “patterns in data,” and in the IBM Watson system’s capabilities in “information recall,” in comprehending “nuance and context,” and in applying algorithms to analyze natural language and “deduce meaning,” (Dean, 2014, pp. 239-241). Regarding “the term ‘big data,’” Dean concludes that even though it “may become so overused that it loses its meaning,” the existence and “evolution” of its primary elements – “hardware, software and data mining techniques and the demand for working on large, complex analytical problems” – is guaranteed (Dean, 2014, p. 241).

Leave a Reply

Your email address will not be published. Required fields are marked *