251

Machine-Learning Techniques for Predictive Analytics

LEARNING OBJECTIVES

■■ Understand the basic concepts and definitions of artificial neural networks (ANN)

■■ Learn the different types of ANN architectures ■■ Understand the concept and structure of support vector machines (SVM)

■■ Learn the advantages and disadvantages of SVM compared to ANN

■■ Understand the concept and formulation of k@nearest neighbor (kNN) algorithm

■■ Learn the advantages and disadvantages of kNN compared to ANN and SVM

■■ Understand the basic principles of Bayesian learning and Naïve Bayes algorithm

■■ Learn the basics of Bayesian Belief Networks and how they are used in predictive analytics

■■ Understand different types of ensemble models and their pros and cons in predictive analytics

Predictive modeling is perhaps the most commonly practiced branch in data science and business analytics. It allows decision makers to estimate what the future holds by means of learning from the past (i.e., historical data). In this chapter, we study the internal structures, capabilities/limitations, and applications of the most popular pre- dictive modeling techniques, such as artificial neural networks, support vector machines, k@nearest neighbor, Bayesian learning, and ensemble models. Most of these techniques are capable of addressing both classification- and regression-type prediction problems. Often, they are applied to complex prediction problems where other, more traditional techniques are not capable of producing satisfactory results. In addition to the ones cov- ered in this chapter, other notable prediction modeling techniques include regression (linear or nonlinear), logistic regression (for classification-type prediction problems), and different types of decision trees (covered in Chapter 4).

5.1 Opening Vignette: Predictive Modeling Helps Better Understand and Manage Complex Medical Procedures 252

5.2 Basic Concepts of Neural Networks 255

C H A P T E R

5

252 Part II • Predictive Analytics/Machine Learning

5.3 Neural Network Architectures 259 5.4 Support Vector Machines 263 5.5 Process-Based Approach to the Use of SVM 271 5.6 Nearest Neighbor Method for Prediction 274 5.7 Naïve Bayes Method for Classification 278 5.8 Bayesian Networks 287 5.9 Ensemble Modeling 293

5.1 OPENING VIGNETTE: Predictive Modeling Helps Better Understand and Manage Complex Medical Procedures

Healthcare has become one of the most important issues to have a direct impact on the quality of life in the United States and around the world. While the demand for healthcare services is increasing because of the aging population, the supply side is having problems keeping up with the level and quality of service. To close the gap, healthcare systems ought to significantly improve their operational effectiveness and efficiency. Effectiveness (doing the right thing, such as diagnosing and treating accurately) and efficiency (doing it the right way, such as using the least amount of resources and time) are the two fundamental pillars upon which the healthcare sys- tem can be revived. A promising way to improve healthcare is to take advantage of predictive modeling techniques along with large and feature-rich data sources (true reflections of medical and healthcare experiences) to support accurate and timely decision making.

According to the American Heart Association, cardiovascular disease (CVD) is the underlying cause for over 20 percent of deaths in the United States. Since 1900, CVD has been the number-one killer every year except 1918, which was the year of the great flu pandemic. CVD kills more people than the next four leading causes of deaths combined: cancer, chronic lower respiratory disease, accidents, and diabetes mellitus. Of all CVD deaths, more than half are attributed to coronary diseases. Not only does CVD take a huge toll on the personal health and well-being of the population, but also it is a great drain on the healthcare resources in the United States and elsewhere in the world. The direct and indirect costs associated with CVD for a year are estimated to be in excess of $500 billion . A common surgical procedure to cure a large variant of CVD is called coronary artery bypass grafting (CABG). Even though the cost of a CABG surgery depends on the patient and service provider–related factors, the average rate is between $50,000 and $100,000 in the United States. As an illustrative example, Delen et al. (2012) carried out an analytics study using various predictive modeling methods to predict the outcome of a CABG and applied an information fusion–based sensitivity analysis on the trained models to better understand the importance of the prognostic factors. The main goal was to illustrate that predictive and explanatory analysis of large and feature-rich data sets provides invaluable information to make more efficient and effective decisions in healthcare.

THE RESEARCH METHOD

Figure 5.1 shows the model development and testing process used by Delen et al. (2012). They employed four different types of prediction models (artificial neural networks, sup- port vector machines, and two types of decision trees—C5 and CART) and went through

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 253

a large number of experimental runs to calibrate the modeling parameters for each model type. Once the models were developed, the researchers went to the text data set. Finally, the trained models were exposed to a sensitivity analysis procedure that measured the contribution of the variables. Table 5.1 shows the test results for the four different types of prediction models.

Training and calibrating the

model

Training and calibrating the

model

Preprocessed Data

(in Excel format)

Partitioned data (training, testing,

and validation)

Partitioned data (training, testing,

and validation)

Training and calibrating the

model

ANN

SVM

DT/C5

DT/CART

Testing the model

Testing the model

Conducting sensitivity analysis

Tabulated Model Testing Results

(Accuracy, Sensitivity, and

Specificity)

Integrated (fused) Sensitivity

Analysis Results

Conducting sensitivity analysis

Testing the model

Conducting sensitivity analysis

Training and calibrating the

model

Testing the model

Conducting sensitivity analysis

FIGURE 5.1 Process Map for Training and Testing of the Four Predictive Models.

254 Part II • Predictive Analytics/Machine Learning

THE RESULTS

In this study, Delen et al. (2012) showed the power of data mining in predicting the outcome and in analyzing the prognostic factors of complex medical procedures such as CABG surgery. The researchers showed that using a number of prediction methods (as opposed to only one) in a competitive experimental setting has the potential to produce better predictive as well as explanatory results. Among the four methods that they used, SVM produced the best results with prediction accuracy of 88 percent on the test data sample. The information fusion–based sensitivity analysis revealed the ranked importance of the independent variables. Some of the top variables identified in this analysis having to overlap with the most important variables identified in previously conducted clinical and biological studies confirm the validity and effectiveness of the proposed data mining methodology.

From the managerial standpoint, clinical decision support systems that use the out- come of data mining studies (such as the ones presented in this case study) are not meant to replace healthcare managers and/or medical professionals. Rather, they intend to support them in making accurate and timely decisions to optimally allocate resources to increase the quantity and quality of medical services. There still is a long way to go before we can see these decision aids being used extensively in healthcare practices. Among others, there are behavioral, ethical, and political reasons for this resistance to adoption. Maybe the need and government incentives for better healthcare systems will expedite the adoption.

u QUESTIONS FOR THE OPENING VIGNETTE

1. Why is it important to study medical procedures? What is the value in predicting outcomes?

2. What factors do you think are the most important in better understanding and managing healthcare? Consider both managerial and clinical aspects of healthcare.

3. What would be the impact of predictive modeling on healthcare and medicine? Can predictive modeling replace medical or managerial personnel?

TABLE 5.1 Prediction Accuracy Results for All Four Model Types Based on the Test Data Set

Model Type1 Confusion Matrices2

Pos (1) Neg (0) Accuracy3 Sensitivity3 Specificity3

ANN Pos (1) 749 230

74.72% 76.51% 72.93% Neg (0) 265 714

SVM Pos (1) 876 103

87.74% 89.48% 86.01% Neg (0) 137 842

C5 Pos (1) 876 103

79.62% 80.29% 78.96% Neg (0) 137 842

CART Pos (1) 660 319

71.15% 67.42% 74.87% Neg (0) 246 733

1Acronyms for model types: artificial neural networks (ANN), support vector machines (SVM), popular decision tree algorithm (C5), classification and regression trees (CART). 2Prediction results for the test data samples are shown in a confusion matrix where the rows represent the actuals and columns represent the predicted cases. 3Accuracy, sensitivity, and specificity are the three performance measures that were used in comparing the four prediction models.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 255

4. What were the outcomes of the study? Who can use these results? How can the results be implemented?

5. Search the Internet to locate two additional cases that used predictive modeling to understand and manage complex medical procedures.

WHAT WE CAN LEARN FROM THIS VIGNETTE

As you will see in this chapter, predictive modeling techniques can be applied to a wide range of problem areas, from standard business problems of assessing cus- tomer needs to understanding and enhancing the efficiency of production processes to improving healthcare and medicine. This vignette illustrates an innovative application of predictive modeling to better predict, understand, and manage CABG procedures. As the results indicate, these sophisticated analytics techniques are capable of predict- ing and explaining such complex phenomena. Evidence-based medicine is a relatively new term coined in the healthcare arena where the main idea is to dig deeply into past experiences to discover new and useful knowledge to improve medical and manage- rial procedures in healthcare. As we all know, healthcare needs all the help that it can get. Compared to traditional research, which is clinical and biological in nature, data-driven studies provide an out-of-the-box view to medicine and management of medical systems.

Sources: D. Delen, A. Oztekin, and L. Tomak, “An Analytic Approach to Better Understanding and Management of Coronary Surgeries,” Decision Support Systems, Vol. 52, No. 3, 2012, pp. 698–705; and American Heart Association, “Heart Disease and Stroke Statistics,” heart.org (accessed May 2018).

5.2 BASIC CONCEPTS OF NEURAL NETWORKS

Neural networks represent a brain metaphor for information processing. These models are biologically inspired rather than an exact replica of how the brain actually functions. Neural networks have been shown to be very promising systems in many forecasting and business classification applications due to their ability to “learn” from the data, their nonparametric nature (i.e., no rigid assumptions), and their ability to generalize. Neural computing refers to a pattern-recognition methodology for machine learning. The result- ing model from neural computing is often called an artificial neural network (ANN) or a neural network. Neural networks have been used in many business applications for pattern recognition, forecasting, prediction, and classification. Neural network comput- ing is a key component of any data science and business analytics toolkit. Applications of neural networks abound in finance, marketing, manufacturing, operations, information systems, and so on.

Because we cover neural networks, especially the feed-forward, multi-layer, perception-type prediction modeling–specific neural network architecture, in Chapter 6 (which is dedicated to deep learning and cognitive computing) as a primer to un- derstanding deep learning and deep neural networks, in this section, we provide only a brief introduction to the vast variety of neural network models, methods, and applications.

The human brain possesses bewildering capabilities for information processing and problem solving that modern computers cannot compete with in many aspects. It has been postulated that a model or a system that is enlightened and supported by results from brain research and has a structure similar to that of biological neural net- works could exhibit similar intelligent functionality. Based on this bottom-up approach,

256 Part II • Predictive Analytics/Machine Learning

ANN (also known as connectionist models, parallel distributed processing models, neuro- morphic systems, or simply neural networks) has been developed as biologically inspired and plausible models for various tasks.

Biological neural networks are composed of many massively interconnected neurons. Each neuron possesses axons and dendrites, fingerlike projections that enable a neuron to communicate with its neighboring neurons by transmitting and receiving electrical and chemical signals. More or less resembling the structure of their biological counterparts, ANN are composed of interconnected, simple processing elements called artificial neurons. When processing information, the processing elements in ANN operate concurrently and collectively, similar to biological neurons. ANN possess some desirable traits similar to those of biological neural networks, such as the abilities to learn, self- organize, and support fault tolerance.

Coming along a winding journey, ANN have been investigated by researchers for more than half a century. The formal study of ANN began with the pioneering work of McCulloch and Pitts in 1943. Inspired by the results of biological experiments and obser- vations, McCulloch and Pitts (1943) introduced a simple model of a binary artificial neuron that captured some of the functions of biological neurons. Using information- processing machines to model the brain, McCulloch and Pitts built their neural network model using a large number of interconnected artificial binary neurons. From these beginnings, neural network research became quite popular in the late 1950s and early 1960s. After a thor- ough analysis of an early neural network model (called the perceptron, which used no hidden layer) as well as a pessimistic evaluation of the research potential by Minsky and Papert in 1969, interest in neural networks diminished.

During the past two decades, there has been an exciting resurgence in ANN studies due to the introduction of new network topologies, new activation functions, and new learning algorithms as well as progress in neuroscience and cognitive science. Advances in theory and methodology have overcome many of the obstacles that hindered neural network research a few decades ago. Evidenced by the appealing results of numer- ous studies, neural networks are gaining in acceptance and popularity. In addition, the desirable features in neural information processing make neural networks attractive for solving complex problems. ANN have been applied to numerous complex problems in a variety of application settings. The successful use of neural network applications has inspired renewed interest from industry and business. With the emergence of deep neural networks (as part of the rather recent deep learning phenomenon), the popularity of neu- ral networks (with a “deeper” architectural representation and much-enhanced analytics capabilities) hit an unprecedented high, creating mile-high expectations from this new generation of neural networks. Deep neural networks are covered in detail in Chapter 6.

Biological versus Artificial Neural Networks

The human brain is composed of special cells called neurons. These cells do not die and replenish when a person is injured (all other cells reproduce to replace themselves and then die). This phenomenon might explain why humans retain information for an extended period of time and start to lose it when they get old—as the brain cells gradu- ally start to die. Information storage spans sets of neurons. The brain has anywhere from 50  billion to 150 billion neurons of which there are more than 100 different kinds. Neurons are partitioned into groups called networks. Each network contains several thousand highly interconnected neurons. Thus, the brain can be viewed as a collection of neural networks.

The ability to learn and to react to changes in our environment requires intelligence. The brain and the central nervous system control thinking and intelligent behavior. People who suffer brain damage have difficulty learning and reacting to changing environments. Even so, undamaged parts of the brain can often compensate with new learning.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 257

A portion of a network composed of two cells is shown in Figure 5.2. The cell itself includes a nucleus (the central processing portion of the neuron). To the left of cell 1, the dendrites provide input signals to the cell. To the right, the axon sends output signals to cell 2 via the axon terminals. These axon terminals merge with the dendrites of cell 2. Signals can be transmitted unchanged, or they can be altered by synapses. A synapse is able to increase or decrease the strength of the connection between neurons and cause excitation or inhibition of a subsequent neuron. This is how information is stored in the neural networks.

An ANN emulates a biological neural network. Neural computing actually uses a very limited set of concepts from biological neural systems (see Technology Insights 5.1). It is more of an analogy to the human brain than an accurate model of it. Neural concepts usually are implemented as software simulations of the massively parallel processes in- volved in processing interconnected elements (also called artificial neurons, or neurodes) in a network architecture. The artificial neuron receives inputs analogous to the electro- chemical impulses that dendrites of biological neurons receive from other neurons. The output of the artificial neuron corresponds to signals sent from a biological neuron over its axon. These artificial signals can be changed by weights in a manner similar to the physical changes that occur in the synapses (see Figure 5.3).

Several ANN paradigms have been proposed for applications in a variety of prob- lem domains. Perhaps the easiest way to differentiate among the various neural models is on the basis of the way they structurally emulate the human brain, process information, and learn to perform their designated tasks.

Dendrites

Synapse

Dendrites

Soma

Axon Axon

Soma

Synapse

1 2

FIGURE 5.2 Portion of a Biological Neural Network: Two Interconnected Cells/Neurons.

1

2

n n

Inputs

X Y

Y

Y

X

X

Weights

Summation Transfer Function

1

2

Outputs

2

1

Neuron (or PE)

S X W n

i i i5

5 1 S

Y

W

W

Wn

f(S)

FIGURE 5.3 Processing Information in an Artificial Neuron.

258 Part II • Predictive Analytics/Machine Learning

In the mining industry, most of the underground injuries and fatalities are due to rock falls (i.e., fall of hanging wall/roof). The method that has been used for many years in the mines when determin- ing the integrity of the hanging wall is to tap the hanging wall with a sounding bar and listen to the sound emitted. An experienced miner can differenti- ate an intact/solid hanging wall from a detached/ loose hanging wall by the sound that is emitted. This method is subjective. The Council for Scientific and Industrial Research (CSIR) in South Africa has devel- oped a device that assists any miner in making an

objective decision when determining the integrity of the hanging wall. A trained neural network model is embedded into the device. The device then records the sound emitted when a hanging wall is tapped. The sound is then preprocessed before being input into a trained neural network model, which classi- fies the hanging wall as either intact or detached.

Teboho Nyareli, who holds a master’s degree in electronic engineering from the University of Cape Town in South Africa and works as a research engineer at CSIR, used NeuroSolutions, a popu- lar artificial neural network modeling software

Application Case 5.1 Neural Networks Are Helping to Save Lives in the Mining Industry

Application Case 5.1 provides an interesting example of the use of neural networks as a prediction tool in the mining industry.

TECHNOLOGY INSIGHTS 5.1 The Relationship between Biological and Artificial Neural Networks

The following list shows some of the relationships between biological and artificial networks.

Biological Artificial

Soma Node

Dendrites Input

Axon Output

Synapse Weight

Slow Fast

Many neurons (109) Few neurons (a dozen to hundreds of thousands)

Sources: L. Medsker and J. Liebowitz, Design and Development of Expert Systems and Neural Networks, Macmillan, New York, 1994, p. 163; and F. Zahedi, Intelligent Systems for Business: Expert Systems with Neural Networks, Wadsworth, Belmont, CA, 1993.

Because they are biologically inspired, the main processing elements of a neural network are individual neurons, analogous to the brain’s neurons. These artificial neurons receive the in- formation from other neurons or external input stimuli, perform a transformation on the inputs, and then pass on the transformed information to other neurons or external outputs. This is simi- lar to how it is currently thought that the human brain works. Passing information from neuron to neuron can be thought of as a way to activate, or trigger, a response from certain neurons based on the information or stimulus received.

How information is processed by a neural network is inherently a function of its structure. Neural networks can have one or more layers of neurons. These neurons can be highly or fully interconnected, or only certain layers can be connected. Connections between neurons have an associated weight. In essence, the “knowledge” possessed by the network is encapsulated in these interconnection weights. Each neuron calculates a weighted sum of the incoming neuron values, transforms this input, and passes on its neural value as the input to subsequent neurons. Typically, although not always, this input/output transformation process at the individual neuron level is performed in a nonlinear fashion.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 259

u SECTION 5.2 REVIEW QUESTIONS

1. What is an ANN? 2. What are the commonalities and differences between biological and artificial neural

networks?

3. What types of business problems can be solved with ANN?

5.3 NEURAL NETWORK ARCHITECTURES

There are several neural network architectures designed to solve different types of prob- lems (Haykin, 2009). The most common ones include feedforward (multilayer perceptron with backpropagation), associative memory, recurrent networks, Kohonen’s self- organizing feature maps, and Hopfield networks. The feedforward multi-layer perceptron-type network architecture activates the neurons (and learns the relationship between input variables and the output variable) in one direction (from input layer to the output layer, going through one or more middle/hidden layers). This neural network architecture will be covered in detail in Chapter 6; hence, the details will be skipped in this section. In contrast to feedforward neural network architecture, Figure 5.4 shows a pictorial representation of a recurrent neural net- work architecture where the connections between the layers are not unidirectional; rather, there are many connections in every direction between the layers and neurons, creating a complex connection structure. Many experts believe that this multidirectional connectedness better mimics the way biological neurons are structured in the human brain.

Kohonen’s Self-Organizing Feature Maps

First introduced by the Finnish professor Teuvo Kohonen, Kohonen’s self- organizing feature map (Kohonen networks, or SOM in short) provides a way to represent mul- tidimensional data in much lower dimensional spaces, usually one or two dimensions.

developed by NeuroDimensions, Inc., to develop the classification-type prediction models. The multi- layer perceptron-type ANN architecture that he built achieved better than 70 percent prediction accu- racy on the hold-out sample. In 2018, the prototype system was undergoing a final set of tests before it was deployed as a decision aid and then the com- mercialization phase followed. The following figure shows a snapshot of NeuroSolution’s model build- ing workspace, called the breadboard.

Source: Used with permission from NeuroSolutions, customer success story, neurosolutions.com/resources/nyareli.html (accessed May 2018).

Questions for Case 5.1

1. How did neural networks help save lives in the mining industry?

2. What were the challenges, the proposed solu- tion, and the results?

260 Part II • Predictive Analytics/Machine Learning

One of the most interesting aspects of SOM is that they learn to classify data without supervision (i.e., there is no output vector). Remember that in supervised learning techniques, such as backpropagation, the training data consist of vector pairs—an input vector and a target vector. Because of its self-organizing capability, SOM are commonly used for clustering tasks where a group of cases is assigned an arbitrary number of naturals groups. Figure 5.5a illustrates a very small Kohonen network of 4 * 4 nodes connected to the input layer (with three inputs), representing a two- dimensional vector.

Hopfield Networks

The Hopfield network is another interesting neural network architecture, first introduced by John Hopfield (1982). Hopfield demonstrated in a series of research articles in the early 1980s how highly interconnected networks of nonlinear neurons can be extremely

Input 1

Input 2

Input 3

Input n

*H: indicates a “hidden” neuron without a target output

Output 1

Output 2

H

H

FIGURE 5.4 Recurrent Neural Network Architecture.

I N P U T

O U T P U T

Input 1

Input 2

Input 3

(a) Kohonen Network (SOM) (b) Hopfield Network

FIGURE 5.5 Graphical Depiction of Kohonen and Hopfield ANN Structures.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 261

effective in solving complex computational problems. These networks were shown to provide novel and quick solutions to a family of problems stated in terms of a desired objective subject to a number of constraints (i.e., constraint optimization problems). One of the major advantages of Hopfield neural networks is the fact that their structure can be realized on an electronic circuit board, possibly on a very large-scale integration (VLSI) circuit, to be used as an online solver with a parallel-distributed process. Architecturally, a general Hopfield network is represented as a single large layer of neurons with total inter- connectivity; that is, each neuron is connected to every other neuron within the network (see Figure 5.5b).

Ultimately, the architecture of a neural network model is driven by the task it is in- tended to carry out. For instance, neural network models have been used as classifiers, as forecasting tools, as customer segmentation mechanisms, and as general optimizers. As shown later in this chapter, neural network classifiers are typically multilayer models in which information is passed from one layer to the next, with the ultimate goal of map- ping an input to the network to a specific category, as identified by an output of the net- work. A neural model used as an optimizer, in contrast, can be a single layer of neurons, can be highly interconnected, and can compute neuron values iteratively until the model converges to a stable state. This stable state represents an optimal solution to the problem under analysis.

Application Case 5.2 summarizes the use of predictive modeling (e.g., neural net- works) in addressing emerging problems in the electric power industry.

The electrical power industry produces and deliv- ers electric energy (electricity or power) to both residential and business customers wherever and whenever they need it. Electricity can be generated from a multitude of sources. Most often, electricity is produced at a power station using electromechanical generators that are driven by heat engines fueled by chemical combustion (by burning coal, petroleum, or natural gas) or nuclear fusion (by a nuclear reac- tor). Generation of electricity can also be accom- plished by other means, such as kinetic energy (through falling/flowing water or wind that activates turbines), solar energy (through the energy emitted by sun, either light or heat), or geothermal energy (through the steam or hot water coming from deep layers of the earth). Once generated, electric energy is distributed through a power grid infrastructure.

Even though some energy-generation methods are favored over others, all forms of electricity gen- eration have positive and negative aspects. Some are environmentally favored but are economically unjustifiable; others are economically superior but environmentally prohibitive. In a market economy,

the options with fewer overall costs are generally chosen above all other sources. It is not clear yet which form can best meet the necessary demand for electricity without permanently damaging the environment. Current trends indicate that increas- ing the shares of renewable energy and distributed generation from mixed sources has the promise of reducing/balancing environmental and economic risks.

The electrical power industry is a highly regu- lated, complex business endeavor. There are four distinct roles that companies choose to participate in: power producers, transmitters, distributers, and retail- ers. Connecting all of the producers to all of the cus- tomers is accomplished through a complex structure, called the power grid. Although all aspects of the elec- tricity industry are witnessing stiff competition, power generators are perhaps the ones getting the lion’s share of it. To be competitive, producers of power need to maximize the use of their variety of resources by making the right decisions at the right rime.

StatSoft, one of the fastest growing provid- ers of customized analytics solutions, developed

Application Case 5.2 Predictive Modeling Is Powering the Power Generators

(Continued )

262 Part II • Predictive Analytics/Machine Learning

integrated decision support tools for power gen- erators. Leveraging the data that come from the production process, these data mining–driven soft- ware tools help technicians and managers rapidly optimize the process parameters to maximize the power output while minimizing the risk of adverse effects. Following are a few examples of what these advanced analytics tools, which include ANN and SVM, can accomplish for power generators.

• Optimize Operation Parameters Problem: A coal-burning 300 MW multi- cyclone unit required optimization for consis- tent high flame temperatures to avoid forming slag and burning excess fuel oil. Solution: Using StatSoft’s predictive model- ing tools (along with 12 months of three-minute historical data), optimized control parameter settings for stoichiometric ratios, coal flows, primary air, tertiary air, and split secondary air damper flows were identified and implemented. Results: After optimizing the control pa- rameters, flame temperatures showed strong responses, resulting in cleaner combustion for higher and more stable flame temperatures.

• Predict Problems Before They Happen Problem: A 400 MW coal-fired DRB-4Z burner required optimization for consistent and robust low NOx operations to avoid ex- cursions and expensive downtime. Identify root causes of ammonia slip in a selective non- catalytic reduction process for NOx reduction. Solution: Apply predictive analytics method- ologies (along with historical process data) to predict and control variability; then target pro- cesses for better performance, thereby r educing both average NOx and variability. Results: Optimized settings for combinations of control parameters resulted in consistently lower NOx emissions with less variability (and

no excursions) over continued operations at low load, including predicting failures or unexpected maintenance issues.

• Reduce Emission (NOx, CO) Problem: While NOx emissions for higher loads were within acceptable ranges, a 400 MW coal-fired DRB-4Z burner was not optimized for low-NOx operations under low load (50–175 MW). Solution: Using data-driven predictive modeling technologies with historical data, optimized parameter settings for changes to airflow were identified, resulting in a set of specific, achievable input parameter ranges that were easily implemented into the existing DCS (digital control system). Results: After optimization, NOx emissions under low-load operations were comparable to NOx emissions under higher loads.

As these specific examples illustrate, there are numerous opportunities for advanced analytics to make a significant contribution to the power industry. Using data and predictive models could help deci- sion makers get the best efficiency from their pro- duction system while minimizing the impact on the environment.

Questions for Case 5.2

1. What are the key environmental concerns in the electric power industry?

2. What are the main application areas for predic- tive modeling in the electric power industry?

3. How was predictive modeling used to address a variety of problems in the electric power industry?

Source: Based on the StatSoft, Success Stories, statsoft.com/Portals/ 0/Downloads/EPRI.pdf (accessed June 2018) and the statsoft.fr/ pdf/QualityDigest_Dec2008.pdf (accessed February 2018).

u SECTION 5.3 REVIEW QUESTIONS

1. What are the most popular neural network architectures? 2. What types of problems are solved with Kohonen SOM ANN architecture? 3. How does Hopfield ANN architecture work? To what type of problems can it be

applied?

Application Case 5.2 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 263

5.4 SUPPORT VECTOR MACHINES

Support vector machines are among the popular machine-learning techniques, mostly because of their superior predictive power and their theoretical foundation. SVM are among the supervised learning techniques that produce input-output functions from a set of labeled training data. The function between the input and output vectors can be either a classification (used to assign cases into predefined classes) or a regression (used to estimate the continuous numerical value of the desired output). For classification, nonlinear kernel functions are often used to transform input data (naturally representing highly complex nonlinear relationships) to a high-dimensional feature space in which the input data become linearly separable. Then, the maximum-margin hyperplanes are constructed to optimally separate the output classes from each other in the training data.

Given a classification-type prediction problem, generally speaking, many linear classifiers (hyperplanes) can separate the data into multiple subsections, each represent- ing one of the classes (see Figure 5.6a where the two classes are represented with circles [“ ”] and squares [“ ”]). However, only one hyperplane achieves the maximum separa- tion between the classes (see Figure 5.6b where the hyperplane and the two maximum margin hyperplanes are separating the two classes).

Data used in SVM can have more than two dimensions (i.e., two distinct classes). In that case, we would be interested in separating data using the n - 1 dimensional hyperplane, where n is the number of dimensions (i.e., class labels). This can be seen as a typical form of linear classifier where we are interested in finding the n - 1 hyperplane so that the distance from the hyperplanes to the nearest data points are maximized. The assumption is that the larger the margin or distance between these parallel hyperplanes, the better the generalization power of the classifier (i.e., pre- diction power of the SVM model). If such hyperplanes exist, they can be mathe- matically represented using quadratic optimization modeling. These hyperplanes are known as the maximum-margin hyperplane, and such a linear classifier is known as a maximum-margin classifier.

In addition to their solid mathematical foundation in statistical learning theory, SVM have also demonstrated highly competitive performance in numerous real-world predic- tion problems, such as medical diagnosis, bioinformatics, face/voice recognition, demand

L1

L2 L3

2 w

w . x 2

b 5

2 1

w . x 2

b 5

0

w . x 2

b 5

1

(a) (b)

X2 X2

X1 X1

FIGURE 5.6 Separation of the Two Classes Using Hyperplanes.

264 Part II • Predictive Analytics/Machine Learning

As technology keeps advancing, new and improved safety measures are being developed and incorpo- rated into vehicles and roads to prevent crashes from happening and/or reduce the impact of the injury sustained by passengers caused by such inci- dents. Despite the extent of these efforts, the num- ber of vehicle crashes and the resulting injuries are increasing worldwide. For instance, according to the National Highway Traffic Safety Administration (NHTSA), in the United States more than 6 million traffic accidents claim over 30,000 lives and injure more than 2 million people each year (NHTSA, 2014). The latest NHTSA report presented to the U.S. Congress in April 2014 stated that in 2012, highway fatalities in the United States reached 33,561, which is an increase of 1,082 over the pre- vious year (Friedman, 2014). In the same year, an estimated 2.36 million people were injured in motor vehicle traffic crashes compared to 2.22 million in 2011. As a result, an average of nearly four lives were lost and nearly 270 people were injured on America’s roadways every hour in 2012. In addition to the staggering number of fatalities and injuries, these traffic accidents also cost the taxpayers more than $230 billion. Hence, addressing road safety is a major problem in the United States.

Root causes of traffic accidents and crash- related injury severity are of special concern to the general public and to researchers (in academia, government, and industry) because such investiga- tion aimed not only at prevention of crashes but also at reduction of their severe outcomes, poten- tially saving many lives and money. In addition to laboratory- and experimentation-based engineer- ing research methods, another way to address the issue is to identify the most probable factors that affect injury severity by mining the historical data

on vehicle crashes. Thorough understanding of the complex circumstances in which drivers and/or passengers are more likely to sustain severe inju- ries or even be killed in a vehicle crash can miti- gate the risks involved to a great extent, thereby saving lives due to crashes. Many factors were found to have an impact on the severity of injury sustained by occupants in the event of a vehi- cle accident. These factors include behavioral or demographic features of the occupants (e.g., drug and/or alcohol levels, seatbelt or other restrain- ing system usage, gender and age of the driver), crash-related situational characteristics (e.g., road surface/type/situation, direction of impact, strike versus struck, number of cars and/or other objects involved), environmental factors at the time of the accident (weather conditions, visibility and/or light conditions, time of the day, etc.), and the techni- cal characteristics of the vehicle itself (age, weight, body type, etc.).

The main goal of this analytic study was to determine the most prevailing risk factors and their relative importance/significance in influencing the likelihood of increasing severity of injury caused by vehicle crashes. The crashes examined in this study included a collection of geographically well- represented samples. To have a consistent sample, the data set comprised only collations of specific types: single or multi-vehicle head-on collisions, single or multi-vehicle angled collisions, and single- vehicle fixed-object collisions. To obtain reliable and accurate results, this investigative study employed the most prevalent machine-learning techniques to identify the significance of crash-related factors as they relate to the changing levels of injury sever- ity in vehicle crashes and compared the different machine-learning techniques.

Application Case 5.3 Identifying Injury Severity Risk Factors in Vehicle Crashes with Predictive Analytics

forecasting, image processing, and text mining, which has established SVM as among the most popular analytics tools for knowledge discovery and data mining. Similar to artificial neural networks, SVM possess the well-known ability of being universal approximators of any multivariate function to any desired degree of accuracy. Therefore, SVM are of par- ticular interest to modeling highly nonlinear, complex problems, systems, and processes. In the research study summarized in Application Case 5.3, SVM were better than other machine-learning methods in predicting and characterizing injury severity risk factors in automobile crashes.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 265

The Research Method

The methodology employed in this study follows a very well-known standardized analytics process, namely cross-industry standard process for data mining (CRISP-DM). As is the case in any analyt- ics project, a significant portion of the project time was devoted to the acquisition, integration, and preprocessing of data. Then, the preprocessed, analytics-ready data were used to build several dif- ferent prediction models. Using a set of standard metrics, researchers assessed the outcomes of these models and compared them. In the final stage, sen- sitivity analyses were used to identify the most pre- vailing injury-severity related risk factors.

To effectively and efficiently perform the indi- vidual tasks in the proposed methodology, several statistical and data mining software tools were used. Specifically, JMP (a statistical and data mining soft- ware tool developed by SAS Institute), Microsoft Excel, and Tableau were used for inspecting, under- standing, and preprocessing the data; IBM SPSS Modeler and KNIME were used for data merging, predictive model building, and sensitivity analysis.

The National Automotive Sampling System General Estimates System (NASS GES) data set was used for covered accidents in the years 2011 and 2012. The complete data set was obtained in the form of three separate flat/text files—accident, vehi- cle, and person. The accident files contained spe- cific characteristics about road conditions, environ- mental conditions, and crash-related settings. The vehicle files included a large number of variables about the specific features of the vehicle involved in the crash. The person files provided detailed demographics, injury, and situational information about the occupants (i.e., driver and the passengers) impacted in the crash. To consolidate the data into a single database, the two years of data were merged within each file types (i.e., accident, person, vehi- cle), and the resulting files were combined using unique accident, vehicle, and person identifiers to create a single data set. After the data consolidation/ aggregation, the resulting data set included person- level records—one record per person involved in a reported vehicle crash. At this point in the process (before the data cleaning, preprocessing and slic- ing/dicing), the complete data set included 279,470 unique records (i.e., persons/occupants involved in crashes) and more than 150 variables (a combination

of accident, person, and vehicle related characteris- tics). Figure 5.7 graphically illustrates the individual steps involved in the processing of data.

Of all the variables—directly obtained from the GES databases and the ones that were derived/ recalculated using the existing GES variables—29 were selected as relevant and potentially influential in determining the varying levels of injury severity involved in vehicle crashes. This extent of variables was expected to provide a rich description of the people and the vehicle involved in the accident: the specifics of the environmental conditions at the time of the crash, the settings surrounding the crash itself, and the time and place of the crash. Table 5.2 lists

Data Preprocessing Selecting Characterizing Aggregating

Accident DB Vehicle DB Person DB

Combined DB (279,470 rows,

152 cols)

2011 2012

Pre-processed Data

2011 2012 2011 2012

FIGURE 5.7 Data Acquisition/Merging/Preparation Process.

(Continued )

Source: Microsoft Excel 2010, Microsoft Corporation.

266 Part II • Predictive Analytics/Machine Learning

TABLE 5.2 List of Variables Included in the Study

Variable Description Data Type Descriptive Statistics1 Missing (%)

AIR_BAG Airbag deployed Binary Yes: 52, no: 26 5.2

ALC_RES Alcohol test results Numeric 12.68 115.052 0.4 BDYTYP_IMN Vehicle body type Nominal Sedan: 34, Sm-SUV: 13 3.2

DEFORMED Extent of damage Nominal Major: 43, minor: 22 3.7

DRINKING Alcohol involvement Binary Yes: 4, no: 67 28.8

AGE Age of person Numeric 36.45 (18.49) 6.9

DRUGRES1 Drug test results Binary Yes: 2, no: 72 25.5

EJECT_IM Ejection Binary Yes: 2, no: 93 4.9

FIRE_EXP Fire occurred Binary Yes: 3, no: 97 0.0

GVWR Vehicle weight category Nominal Small: 92, large: 5 2.9

HAZ_INV Hazmat involved Binary Yes: 1, no: 99 0.0

HOUR_IMN Hour of day Nominal Evening: 39, noon: 32 1.2

INT_HWY Interstate highway Binary Yes: 13, no: 86 0.7

J_KNIFE Jackknife Binary Yes: 4, no: 95 0.2

LGTCON_IM Light conditions Nominal Daylight: 70, dark: 25 0.3

MANCOL_IM Manner of collision Nominal Front: 34, angle: 28 0.0

MONTH Month of year Nominal Oct: 10, Dec: 9 0.0

NUMINJ_IM Number of injured Numeric 1.23 14.132 0.0 PCRASH1_IMN Precrash movement Nominal Going str.: 52, stopped: 14 1.3

REGION Geographic region Nominal South: 42, Midwest: 24 0.0

REL_ROAD Relation to traffic way Nominal Roadway: 85, median: 9 0.1

RELJCT1_IM At a junction Binary Yes: 4, no: 96 0.0

REST_USE_N Restraint system used Nominal Yes: 76, no: 4 7.4

SEX_IMN Gender of driver Binary Male: 54, female: 43 3.1

TOWED_N Car towed Binary Yes: 49, no: 51 0.0

VEH_AGE Age of vehicle Numeric 8.96 14.182 0.0 WEATHR_IM Weather condition Nominal Clear: 73, cloudy: 14 0.0

WKDY_IM Weekday Nominal Friday: 17, Thursday 15 0.0

WRK_ZONE Work zone Binary Yes: 2, no: 98 0.0

INJ_SEV Injury severity (Dependent Variable)

Binary Low: 79, high: 21 0.0

1For numeric variables: mean (st. dev.); for binary or nominal variables: % frequency of the top two classes.

Application Case 5.3 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 267

and briefly describes the variables created and used for this study.

Table 5.3 shows the predictive accuracies of all four model types. It shows the confusion matrices, overall accuracy, sensitivity, specificity, and area under the receiver operating characteristics (ROC) curve measures obtained using 10-fold cross-validation for all four mode types. As the results indicate, SVM was the most accurate classification technique with bet- ter than 90 percent overall accuracy, comparably high sensitivity and specificity, and an area under the curve (AUC) value of 0 .928 (of maximum 1 .000) . The next best model type was C5 decision tree algorithms with slightly better accuracy than ANN. The last in the accuracy ranking was LR, also with fairly good accuracy measures but not as good as the machine- learning methods.

Even though the accuracy measures obtained from all four model types were high enough to validate the proposed methodology, the main goal of this study was to identify and prioritize the sig- nificant risk factors influencing the level of injury severity sustained by drivers during a vehicle crash. To achieve this goal, a sensitivity analysis on all of the developed prediction models was con- ducted. Focusing on each model type individually, the variable importance measures for each fold were calculated using leave-one-out method, and then the results obtained were summed for each

model type. To properly fuse (i.e., ensemble) the sensitivity analysis results for all four model types, the models’ contribution to the fused/combined variable importance values were determined based on their cross-validation accuracy. That is, the best performing model type had the largest weight/con- tribution while the worst performing model type had the smallest weight/contribution. The fused variable importance values were tabulated, normal- ized, and then graphically presented in Figure 5.8.

Examination of the sensitivity analysis results revealed four somewhat distinct risk groups, each comprising four to eight variables. The top group, in an order from most to least importance, included REST_USE_N (whether the seat belt of any other restraining system was used), MANCOL_IM (man- ner of collision), EJECT_IM (whether the driver was ejected from the car), and DRUGRES1 (results of the drug test). According to the combined sensitivity analysis results of all prediction models, these four risk factors seemed to be significantly more impor- tant than the rest.

Questions for Case 5.3

1. What are the most important motivations behind analytically investigating car crashes?

2. How were the data in the Application Case acquired, merged, and reprocessed?

TABLE 5.3 Tabulation of All Prediction Results Based on 10-fold Cross-Validation

Model Type

Confusion Matrices Accuracy (%)

Sensitivity (%)

Specificity (%) AUCLow High

Artificial neural networks (ANN)

Low 12,864 1,464 85.77 81.31 89.78 0.865

High 2,409 10,477

Support vector machines (SVM)

Low 13,192 1,136 90.41 88.55 92.07 0.928

High 1,475 11,411

Decision trees (DT>C5)

Low 12,675 1,653 86.61 84.55 88.46 0.8790

High 1,991 10,895

Logistic regression (LR)

Low 8,961 2,742 76.97 77.27 76.57 0.827

High 3,525 11,986

(Continued )

268 Part II • Predictive Analytics/Machine Learning

3. What were the results of this study? How can these findings be used for practical purposes?

Sources: D. Delen, L. Tomak, K. Topuz, & E. Eryarsoy, “Investigating Injury Severity Risk Factors in Automobile Crashes with Predictive Analytics and Sensitivity Analysis Methods,” Journal of Transport

& Health, 4, 2017, pp. 118–131; D. Friedman, “Oral Testimony Before the House Committee on Energy and Commerce, by the Subcommittee on Oversight and Investigations,” April 1, 2014, www. nhtsa.gov/Testimony (accessed October 2017); National Highway Traffic Safety Administration (NHTSA’s) (2018) General Estimate System (GES), www.nhtsa.gov (accessed January 20, 2018).

FIGURE 5.8 Variable Importance Values.

0 20 40 60 80 100

GVWR

WRK_ZONE

TOWED_N

DRINKING_N

BDYTYP_IMN

HAZ_INV

J_KNIFE

ALC_RES_N

REGION

NUMINJ_IM

LGTCON_IM

FIRE_EXP

VEH_AGE

RELJCT1_IM

HOUR_IMN

MONTH

DEFORMED

SEX_IMN

WEATHR_IM

PCRASH1_IMN

INT_HWY

WKDY_IM

DRIVER_AGE

REL_ROAD

AIRBAG_DEPL_N

DRUGRES1

EJECT_IM

MANCOL_IM

REST_USE_N

0

0

0

0

0

0

1

2

2

2

4

4

6

8

8

9

11

14

19

20

24

24

30

32

39

65

76

86

100

Normalized Importance Measure

Group 1

Group 2

Group 3

Group 4

Application Case 5.3 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 269

Mathematical Formulation of SVM

Consider data points in the training data set of the form:

{(x1,c1), (x2,c2), . . . , (xn,cn)}

where the c is the class label taking a value of either 1 (i.e., “yes”) or 0 (i.e., “no”) while x is the input variable vector. That is, each data point is an m- dimensional real vector, usually of scaled 30, 14 or 3-1, 14 values. The normalization and/or scaling are important steps to guard against variables/attributes with larger variance that might otherwise dominate the classification formulae. We can view this as training data, which denote the correct classification (something that we would like the SVM to eventually achieve) by means of a dividing hyperplane, which takes the mathematical form

w # x - b = 0. The vector w points perpendicularly to the separating hyperplane. Adding the offset

parameter b allows us to increase the margin. In its absence, the hyperplane is forced to pass through the origin, restricting the solution. Because we are interested in the maxi- mum margin, we also are interested in the support vectors and the parallel hyperplanes (to the optimal hyperplane) closest to these support vectors in either class. It can be shown that these parallel hyperplanes can be described by equations

w # x - b = 1, w # x - b = -1.

If the training data are linearly separable, we can select these hyperplanes so that there are no points between them and then try to maximize their distance (see Figure 5.6b). By using geometry, we find the distance between the hyperplanes is 2> � w �, so we want to minimize � w �. To exclude data points, we need to ensure that for all i either

w # xi - b Ú 1 or

w # xi - b … -1. This can be rewritten as:

ci (w # xi - b) Ú 1, 1 … i … n Primal Form

The problem now is to minimize � w � subject to the constraint ci(w # xi - b) Ú 1, 1 … i … n. This is a quadratic programming (QP) optimization problem. More clearly,

Minimize 11>22 7w 72 Subject to ci(w # xi - b) Ú 1, 1 … i … n

The factor of 1>2 is used for mathematical convenience. Dual Form

Writing the classification rule in its dual form reveals that classification is only a function of the support vectors, that is, the training data that lie on the margin. The dual of the SVM can be shown to be:

max a n

i = 1 ai - a

i, j ai aj ci cj xi

T xj

270 Part II • Predictive Analytics/Machine Learning

where the a terms constitute a dual representation for the weight vector in terms of the training set:

w = a i ai ci xi

Soft Margin

In 1995, Cortes and Vapnik suggested a modified maximum margin idea that allows for mislabeled examples. If there exists no hyperplane that can split the “yes” and “no” examples, the soft margin method will choose a hyperplane that splits the examples as cleanly as possible while still maximizing the distance to the nearest cleanly split examples. This work popularized the expression support vector machine or SVM. The method introduces slack variables, ji, which measure the degree of misclassification of the datum.

ci(w # xi - b) Ú 1 - ji 1 … i … n The objective function is then increased by a function that penalizes non-zero ji,

and the optimization becomes a trade-off between a large margin and a small error pen- alty. If the penalty function is linear, the equation then transforms to

min 7w 72 + C a i ji such that ci (w # xi - b) Ú 1 - ji 1 … i … n

This constraint along with the objective of minimizing � w � can be solved using Lagrange multipliers. The key advantage of a linear penalty function is that the slack vari- ables vanish from the dual problem with the constant C appearing only as a v@additional constraint on the Lagrange multipliers. Nonlinear penalty functions have been used, partic- ularly to reduce the effect of outliers on the classifier, but unless care is taken, the problem becomes nonconvex, and thus it is considerably more difficult to find a global solution.

Nonlinear Classification

The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 while he was a doctoral student at the Institute of Control Science in Moscow was a linear clas- sifier. However, in 1992, Boser, Guyon, and Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick (originally proposed by Aizerman et al., 1964) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in the transformed feature space. The transformation can be nonlinear and the transformed space high dimensional; thus, although the classifier is a hyperplane in the high-dimensional feature space, it can be nonlinear in the original input space.

If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regular- ized, so the infinite dimension does not spoil the results. Some common kernels include:

Polynomial (homogeneous): k(x, x′) = (x # x′) Polynomial (inhomogeneous): k(x, x′) = (x # x′ + 1) Radial basis function: k(x, x′) = exp (-g 7x - x′ 72), for g 7 0 Gaussian radial basis function: k(x, x′) = expa-

7x - x′ 72 2s2

b

Sigmoid: k(x, x′) = tan h (kx # x′ + c) for some k 7 0 and c 6 0

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 271

Kernel Trick

In machine learning, the kernel trick is a method for converting a linear classifier algo- rithm into a nonlinear one by using a nonlinear function to map the original observa- tions into a higher-dimensional space; this makes a linear classification in the new space equivalent to nonlinear classification in the original space.

This is done using Mercer’s theorem, which states that any continuous, symmetric, positive semi-definite kernel function K1x, y2 can be expressed as a dot product in a high-dimensional space. More specifically, if the arguments to the kernel are in a measur- able space X and if the kernel is positive semi-definite—that is,

a i, j

K (xi, xj)ci cj Ú 0

for any finite subset 5x1, c, xn6 of X and subset 5c1, c, cn6of objects (typically real numbers or even molecules)—then there exists a function w1x2 whose range is in an inner product space of possibly high dimension, such that

K(x, y) = w(x) # w( y) The kernel trick transforms any algorithm that solely depends on the dot product

between two vectors. Wherever a dot product is used, it is replaced with the kernel func- tion. Thus, a linear algorithm can easily be transformed into a nonlinear algorithm. This nonlinear algorithm is equivalent to the linear algorithm operating in the range space of w. However, because kernels are used, the w function is never explicitly computed. This is desirable because the high-dimensional space could be infinite-dimensional (as is the case when the kernel is a Gaussian).

Although the origin of the term kernel trick is not known, it was first published by Aizerman et al. (Aizerman et al., 1964). It has been applied to several kinds of algorithm in machine learning and statistics, including:

• Perceptrons • Support vector machines • Principal components analysis • Fisher’s linear discriminant analysis • Clustering

u SECTION 5.4 REVIEW QUESTIONS

1. How do SVM work? 2. What are the advantages and disadvantages of SVM? 3. What is the meaning of “maximum-margin hyperplanes”? Why are they important

in SVM?

4. What is the “kernel trick”? How is it used in SVM?

5.5 PROCESS-BASED APPROACH TO THE USE OF SVM

Due largely to the better classification results, SVM recently have become a popular tech- nique for classification-type problems. Even though people consider them as being easier to use than artificial neural networks, users who are not familiar with the intricacies of SVM often get unsatisfactory results. In this section, we provide a process-based approach to the use of SVM, which is more likely to produce better results. A pictorial representa- tion of the three-step process is given in Figure 5.9.

272 Part II • Predictive Analytics/Machine Learning

NUMERICIZING THE DATA SVM require that each data instance be represented as a vec- tor of real numbers. Hence, if there are categorical attributes, we first have to convert them into numeric data. A common recommendation is to use m pseudo-binary variables to represent an m-class attribute (where m Ú 3). In practice, only one of the m variables assumes the value of 1 and others assume the value of 0 based on the actual class of the case (this is also called 1-of-m representation). For example, a three-category attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0) .

NORMALIZING THE DATA As was the case for artificial neural networks, SVM also require normalization and/or scaling of numerical values. The main advantage of normalization is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is that it helps performing numerical calculations during the iterative process of model building. Because kernel values usually depend on the inner products of feature vectors (e.g., the linear kernel and the polynomial kernel), large attribute val- ues might slow the training process. Use recommendations to normalize each attribute to the range [-1, +1] or [0, 1] . Of course, we have to use the same normalization method to scale testing data before testing.

SELECT THE KERNEL TYPE AND KERNEL PARAMETERS Even though there are only four common kernels mentioned in the previous section, one must decide which one to use

Preprocess the Data

Scrub the data “Identify and handle missing, incorrect, and noisy” Transform the data “Numerisize, normalize, and standardize the data”

Develop the Model

Select the kernel type “Choose from RBF, sigmoid, or polynomial kernel types” Determine the kernel values “Use v -fold cross validation or employ ‘grid-search’”

Deploy the Model

Extract the model coefficients Code the trained model into the decision support system Monitor and maintain the model

Training Data

Preprocessed Data

Validated SVM Model

Prediction Model

Experimentation “Training/Testing”

DB

FIGURE 5.9 Simple Process Description for Developing SVM Models.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 273

(or whether to try them all, one at a time, using a simple experimental design approach). Once the kernel type is selected, then one needs to select the value of penalty parameter C and kernel parameters. Generally speaking, Radial Basis Function (RBF) is a reason- able first choice for the kernel type. The RBF kernel aims to nonlinearly map data into a higher dimensional space; by doing so (unlike with a linear kernel), it handles the cases in which the relation between input and output vectors is highly nonlinear. Besides, one should note that the linear kernel is just a special case of RBF kernel. There are two pa- rameters to choose for RBF kernels: C and g. It is not known beforehand which C and g are the best for a given prediction problem; therefore, some kind of parameter search method needs to be used. The goal for the search is to identify optimal values for C and g so that the classifier can accurately predict unknown data (i.e., testing data). The two most commonly used search methods are cross-validation and grid search.

DEPLOY THE MODEL Once an “optimal” SVM prediction model has been developed, the next step is to integrate it into the decision support system. For that, there are two options: (1) converting the model into a computational object (e.g., a Web service, Java Bean, or COM object) that takes the input parameter values and provides output prediction and (2) extracting the model coefficients and integrating them directly into the decision sup- port system. The SVM models are useful (i.e., accurate, actionable) only if the behavior of the underlying domain stays the same. For some reason, if it changes, so does the ac- curacy of the model. Therefore, one should continuously assess the performance of the models and decide when they no longer are accurate, and, hence, need to be retrained.

Support Vector Machines versus Artificial Neural Networks

Even though some people characterize SVM as a special case of ANN, most recognize them as two competing machine-learning techniques with different qualities. Here are a few points that help SVM stand out against ANN. Historically, the development of ANN followed a heuristic path with applications and extensive experimentation preced- ing theory. In contrast, the development of SVM involved sound statistical learning theory first and then implementation and experiments. A significant advantage of SVM is that while ANN could suffer from multiple local minima, the solutions to SVM are global and unique. Two more advantages of SVM are that they have a simple geometric interpreta- tion and give a sparse solution. The reason that SVM often outperform ANN in practice is that they successfully deal with the “over fitting” problem, which is a big issue with ANN.

Although SVM have these advantages (from a practical point of view), they have some limitations. An important issue that is not entirely solved is the selection of the ker- nel type and kernel function parameters. A second and perhaps more important limitation of SVM involves its speed and size, both in the training and testing cycles. Model building in SVM involves complex and time-demanding calculations. From the practical point of view, perhaps the most serious problem with SVM is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks. Despite these limitations, because SVM are based on a sound theoretical founda- tion and the solutions they produce are global and unique in nature (as opposed to get- ting stuck in a suboptimal alternative such as a local minimum), today they are arguably among the most popular prediction modeling techniques in the data mining arena. Their use and popularity will only increase as the popular commercial data mining tools start to incorporate them into their modeling arsenal.

u SECTION 5.5 REVIEW QUESTIONS

1. What are the main steps and decision points in developing an SVM model? 2. How do you determine the optimal kernel type and kernel parameters?

274 Part II • Predictive Analytics/Machine Learning

3. Compared to ANN, what are the advantages of SVM? 4. What are the common application areas for SVM? Search the Internet to identify

popular application areas and specific SVM software tools used in those applications.

5.6 NEAREST NEIGHBOR METHOD FOR PREDICTION

Data mining algorithms tend to be highly mathematical and computationally intensive. The two popular ones that are covered in the previous section (i.e., ANN and SVM) involve time- demanding, computationally intensive iterative mathematical derivations. In contrast, the k-nearest neighbor algorithm (or kNN in short) seems overly simplistic for a competitive prediction method. What it does and how it does it are so easy to understand (and explain to others). k-NN is a prediction method for classification—as well as regression-type prediction problems. k-NN is a type of instance-based learning (or lazy learning) since the function is approximated only local and all computations are deferred until the actual prediction.

The k-nearest neighbor algorithm is among the simplest of all machine-learning al- gorithms: For instance, in the classification-type prediction, a case is classified by a major- ity vote of its neighbors with the object being assigned to the class most common among its k nearest neighbors (where k is a positive integer). If k = 1, then the case is simply assigned to the class of its nearest neighbor. To illustrate the concept with an example, let us look at Figure 5.10 where a simple two-dimensional space represents the values for the two variables (x, y); the star represents a new case (or object); and circles and squares represent known cases (or examples). The task is to assign the new case to either circles or squares based on its closeness (similarity) to one or the other. If you set the value of k to 1 (k = 1), the assignment should be made to square because the closest example to star is a square. If you set the value of k to 3 (k = 3), the assignment should be made to circle because there two circles and one square; hence, from the simple majority vote rule, the circle gets the assignment of the new case. Similarly, if you set the value of k to 5 (k = 5), then the assignment should be made to square class. This overly simplified example is meant to illustrate the importance of the value that one assigns to k.

The same method can be used for regression-type prediction tasks by simply av- eraging the values of its k nearest neighbors and assigning this result to the case being predicted. It can be useful to weight the contributions of the neighbors so that the nearer

y

yi

xi x

k 5 3

k 5 5

FIGURE 5.10 The Importance of the Value of k in kNN Algorithm.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 275

neighbors contribute more to the average than the more distant ones. A common weight- ing scheme is to give each neighbor a weight of 1>d, where d is the distance to the neighbor. This scheme is essentially a generalization of linear interpolation.

The neighbors are taken from a set of cases for which the correct classification (or, in the case of regression, the numerical value of the output value) is known. This can be thought of as the training set for the algorithm even though no explicit training step is required. The k-nearest neighbor algorithm is sensitive to the local structure of the data.

Similarity Measure: The Distance Metric

One of the two critical decisions that an analyst has to make while using kNN is to determine the similarity measure (the other is to determine the value of k, which is explained next). In the kNN algorithm, the similarity measure is a mathematically calcu- lable distance metric. Given a new case, kNN makes predictions based on the outcome of the k neighbors closest in distance to that point. Therefore, to make predictions with kNN, we need to define a metric for measuring the distance between the new case and the cases from the examples. One of the most popular choices to measure this dis- tance is known as Euclidean (Eq. 2), which is simply the linear distance between two points in a dimensional space; the other popular one is the rectilinear (a.k.a. city-block or Manhattan distance) (Eq. 3). Both of these distance measures are special cases of Minkowski distance (Eq. 1).

Minkowski distance

d (i, j) = 3( � xi1 - xj 1 �q + � xi 2 - xj 2 �q + . . . + � xip - xjp �q ) (Eq. 1) where i = (xi1, xi2, . . . , xip ) and j = (xj1, xj2, . . . , xjp ) are two p-dimensional data objects (e.g., a new case and an example in the data set), and q is a positive integer.

If q = 1, then d is called Manhattan distance

d(i, j) = 3 � xi1 - xj1 � + � xi 2 - xj 2 � + . . . + � xip - xjp � (Eq. 2) If q = 2, then d is called Euclidean distance

d(i, j) = 3( � xi1 - xj1 �2 + �xi 2 - xj 2 �2 + . . . + � xip - xjp �2 ) (Eq. 3) Obviously, these measures apply only to numerically represented data. What

about nominal data? There are ways to measure distance for non-numerical data as well. In the simplest case, for a multi-value nominal variable, if the value of that vari- able for the new case and that for the example case are the same, the distance would be 0, otherwise 1. In cases such as text classification, more sophisticated metrics exist, such as the overlap metric (or Hamming distance). Often, the classification accuracy of kNN can be improved significantly if the distance metric is determined through an experimental design in which different metrics are tried and tested to identify the best one for the given problem.

Parameter Selection

The best choice of k depends upon the data. Generally, larger values of k reduce the effect of noise on the classification (or regression) but also make boundaries between classes less distinct. An “optimal” value of k can be found by some heuristic techniques, for in- stance, cross-validation. The special case in which the class is predicted to be the class of the closest training sample (i.e., when k = 1) is called the nearest neighbor algorithm.

CROSS-VALIDATION Cross-validation is a well-established experimentation technique that can be used to determine optimal values for a set of unknown model parameters.

276 Part II • Predictive Analytics/Machine Learning

It applies to most, if not all, of the machine-learning techniques that have a number of model parameters to be determined. The general idea of this experimentation method is to divide the data sample into a number of randomly drawn, disjointed subsamples (i.e., v number of folds). For each potential value of k, the kNN model is used to make predic- tions on the vth fold while using the v - 1 folds as the examples and to evaluate the error. The common choice for this error is the root-mean-squared-error (RMSE) for regression- type predictions and percentage of correctly classified instances (i.e., hit rate) for the classification-type predictions. This process of testing each fold against the remaining examples repeats v times. At the end of the v number of cycles, the computed errors are accumulated to yield a goodness measure of the model (i.e., how well the model predicts with the current value of the k). At the end, the k value that produces the smallest overall error is chosen as the optimal value for that problem. Figure 5.11 shows a simple process that uses the training data to determine optimal values for k and distance metric, which are then used to predict new incoming cases.

As we observed in the simple example given earlier, the accuracy of the kNN algo- rithm can be significantly different with different values of k. Furthermore, the predictive power of the kNN algorithm degrades with the presence of noisy, inaccurate, or irrelevant features. Much research effort has been put into feature selection and normalization/ scaling to ensure reliable prediction results. A particularly popular approach is the use of evolutionary algorithms (e.g., genetic algorithms) to optimize the set of features included in the kNN prediction system. In binary (two-class) classification problems, it is helpful to choose k to be an odd number because this would avoid tied votes.

A drawback to the basic majority voting classification in kNN is that the classes with the more frequent examples tend to dominate the prediction of the new vector because they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weigh the classification taking into account the distance from the test point to each of its k nearest neighbors. Another way to overcome this drawback is by one level of abstraction in data representation.

The naïve version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have

Historical Data

New Data

Parameter Setting

Distance metric Value of “k”

Training Set

Validation Set

Predicting

Classify (or forecast) new cases using k number of most similar cases

FIGURE 5.11 Process of Determining the Optimal Values for Distance Metric and k.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 277

been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. Using an appropriate nearest neighbor search algo- rithm makes kNN computationally tractable even for large data sets. Refer to Application Case 5.4 about the superior capabilities of kNN in image recognition and categorization.

Image recognition is an emerging data mining appli- cation field involved in processing, analyzing, and categorizing visual objects such as pictures. In the process of recognition (or categorization), images are first transformed to a multidimensional feature space and then, using machine-learning techniques, are categorized into a finite number of classes. Application areas of image recognition and categori- zation range from agriculture to homeland security, personalized marketing to environmental protection. Image recognition is an integral part of an artifi- cial intelligence field called computer vision. As a technological discipline, computer vision seeks to develop computer systems that are capable of “see- ing” and reacting to their environment. Examples of applications of computer vision include systems for process automation (industrial robots), navigation (autonomous vehicles), monitoring/detecting (visual surveillance), searching and sorting visuals (indexing databases of images and image sequences), engag- ing (computer–human interaction), and inspection (manufacturing processes).

While the field of visual recognition and cate- gory recognition has been progressing rapidly, much remains to be done to reach human-level perfor- mance. Current approaches are capable of dealing with only a limited number of categories (100 or so) and are computationally expensive. Many machine- learning techniques (including ANN, SVM, and kNN) are used to develop computer systems for visual rec- ognition and categorization. Although commend- able results have been obtained, generally speaking, none of these tools in their current form is capable of developing systems that can compete with humans.

Several researchers from the Computer Science Division of the Electrical Engineering and Computer Science Department at the University of California–Berkeley used an innovative ensemble approach to image categorization (Zhang et al.,

2006). They considered visual category recognition in the framework of measuring similarities, or per- ceptual distances, to develop examples of catego- ries. The recognition and categorization approach the researchers used was quite flexible, permitting recognition based on color, texture, and particu- larly shape. While nearest neighbor classifiers (i.e., kNN) are natural in this setting, they suffered from the problem of high variance (in bias-variance decomposition) in the case of limited sampling. Alternatively, one could choose to use SVM, but they also involve time-consuming optimization and computations. The researchers proposed a hybrid of these two methods, which deals naturally with the multiclass setting, has reasonable computa- tional complexity both in training and at run time, and yields excellent results in practice. The basic idea was to find close neighbors to a query sample and train a local support vector machine that pre- serves the distance function on the collection of neighbors.

The researchers’ method can be applied to large, multiclass data sets when it outperforms near- est neighbor and SVM and remains efficient when the problem becomes intractable. A wide variety of distance functions were used, and their experiments showed state-of-the-art performance on a number of benchmark data sets for shape and texture classifica- tion (MNIST, USPS, CUReT) and object recognition (Caltech-101).

Another group of researchers (Boiman and Irani, 2008) argued that two practices commonly used in image classification methods (namely, SVM- and ANN-type model-driven approaches and kNN-type nonparametric approaches) have led to less-than-desired performance outcomes. These researchers also claimed that a hybrid method can improve the performance of image recognition and categorization. They proposed a trivial Naïve Bayes

Application Case 5.4 Efficient Image Recognition and Categorization with knn

(Continued )

278 Part II • Predictive Analytics/Machine Learning

u SECTION 5.6 REVIEW QUESTIONS

1. What is special about the kNN algorithm? 2. What are the advantages and disadvantages of kNN as compared to ANN and SVM? 3. What are the critical success factors for a kNN implementation? 4. What is a similarity (or distance) measure? How can it be applied to both numerical

and nominal valued variables?

5. What are the common applications of kNN?

5.7 NAÏVE BAYES METHOD FOR CLASSIFICATION

Naïve Bayes is a simple probability-based classification method (a machine-learning tech- nique that is applied to classification-type prediction problems) derived from the well- known Bayes theorem. The method requires the output variable to have nominal values. Although the input variables can be a mix of numeric and nominal types, the numeric output variable needs to be discretized via some type of binning method before it can be used in a Bayes classifier. The word “Naïve” comes from its strong, somewhat unrealistic, assumption of independence among the input variables. Simply put, a Naïve Bayes clas- sifier assumes that the input variables do not depend on each other, and the presence (or absence) of a particular variable in the mix of the predictors does not have anything to do with the presence or absence of any other variables.

Naïve Bayes classification models can be developed very efficiently (rather rapidly with very little computational effort) and effectively (quite accurately) in a supervised machine-learning environment. That is, by using a set of training data (not necessarily very large), the parameters for Naïve Bayes classification models can be obtained using the maximum likelihood method. In other words, because of the independence assump- tion, we can develop Naïve Bayes models without strictly complying with all of the rules and requirements of Bayes theorem. First let us review the Bayes theorem.

kNN-based classifier, which employs kNN distances in the space of the local image descriptors (not in the space of images). The researchers claimed that, although the modified kNN method is extremely simple and efficient and requires no learning/train- ing phase, its performance ranks among the top leading learning-based parametric image classifiers. Empirical comparisons of their method were shown on several challenging image categorization data- bases (Caltech-101, Caltech-256, and Graz-01).

In addition to image recognition and catego- rization, kNN is successfully applied to complex classification problems, such as content retrieval (handwriting detection, video content analysis, body and sign language (where communication is done using body or hand gestures), gene expression (another area in which kNN tends to perform bet- ter than other state-of-the-art techniques; in fact, a

combination of kNN-SVM is one of the most popular techniques used here), and protein-to-protein inter- action and 3D structure prediction (graph-based kNN is often used for interaction structure prediction).

Questions for Case 5.4

1. Why is image recognition/classification a worthy but difficult problem?

2. How can kNN be effectively used for image rec- ognition/classification applications?

Sources: H. Zhang, A. C. Berg, M. Maire, & J. Malik, “SVM- KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition,” Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp. 2126–2136; O. Boiman, E. Shechtman, & M. Irani, “In Defense of Nearest-Neighbor Based Image Classification,” IEEE Conference on Computer Vision and Pattern Recognition, 2008 (CVPR), 2008, pp. 1–8.

Application Case 5.4 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 279

Bayes Theorem

To appreciate Naïve Bayes classification method, one would need to understand the basic definition of the Bayes theorem and the exact Bayes classifier (the one without the strong “Naïve” independence assumption). The Bayes theorem (also called Bayes Rule), named after the British mathematician Thomas Bayes (1701–1761), is a mathematical formula for determining conditional probabilities (the formula follows). In this formula, Y denotes the hypothesis and X denotes the data/evidence. This vastly popular theorem/rule provides a way to revise/improve prediction probabilities by using additional evidence.

The following formula shows the relationship between the probabilities of two events, Y and X. P (Y ) is the prior probability of Y. It is “prior” in the sense that it does not take into account any information about X. P(X>Y ) is the conditional probability of Y, given X. It is also called the posterior probability because it is derived from (or depends upon) the specified value of X. P(X>Y) is the conditional probability of X given Y. It is also called the likelihood. P (X ) is the prior probability of X, which is also called the evi- dence, and acts as the normalizing constant.

P (Y � X ) = P (X � Y )P (Y )

P (X ) S Posterior =

Likelihood * Prior

Evidence

P (Y � X ): Posterior probability of Y given X P (X � Y ): Conditional probability of X given Y (likelihood ) P(Y ): Prior probability of Y

P (X): Prior probability of X (evidence, or unconditional probability of X)

To numerically illustrate these formulas, let us look at a simple example. Based on the weather report, we know that there is a 40 percent chance of rain on Saturday. From the historical data, we also know that if it rains on Saturday, there is a 10 percent chance it will rain on Sunday; and if doesn’t rain on Saturday, there is an 80 percent chance it will rain on Sunday. Let us say that “Raining on Sunday” is event Y , and “Raining on Monday” is event X. Based on the description we can write the following:

P (Y ) = Probability of raining on Saturday = 0.40 P (X � Y ) = Probability of raining on Sunday if it rained on Saturday = 0.10 P(X) = Probability of raining on Monday = Sum of the probability of “Raining on Saturday and Raining on Sunday” and “Not Raining on Saturday and Raining on Sunday” = 0.40 * 0.10 + 0.60 * 0.80 = 0.52

Now if we were to calculate the probability for “It rained on Saturday?” given that it “Rained on Sunday,” we would use Bayes theorem. It would allow us to calculate the probability of an earlier event given the result of a later event.

P (X � Y ) = P ( X � Y )P (Y )

P (X ) =

0.10 * 0.40

0.52 = 0.0769

Therefore, in this example, if it rained on Sunday, there’s a 7.69 percent chance that it rained on Saturday.

Naïve Bayes Classifier

The Bayes classifier uses the Bayes theorem without the simplifying strong independence assumption. In a classification-type prediction problem, the Bayes classifier works as fol- lows: Given a new sample to classify, it finds all other samples exactly like it (i.e., all predic- tor variables having the same values as the sample being classified); determines the class labels that they all belong to; and classifies the new sample into the most representative class. If none of the samples has the exact value match with the new class, then the classifier

280 Part II • Predictive Analytics/Machine Learning

will fail in assigning the new sample into a class label (because the classifier could not find any strong evidence to do so). Here is a very simple example. Using the Bayes classifier, we are to decide whether to play golf (Yes or No) for the following situation (Outlook is Sunny, Temperature is Hot, Humidity is High, and Windy is No). Table 5.4 presents histori-

cal samples that will be used to illustrate the specifics of our classification process. Based on the historical data, three samples seem to match the situation (sample

numbers 1, 6, and 7 as highlighted in Table 5.4). Of the three, two of the samples have the class label “No” and one has the label “Yes.” Because the majority of the matched samples indicated “No,” the new sample/situation is to be classified as “No.”

Now let us consider a situation in which Outlook is Sunny, Temperature is Hot, Humidity is High, and Windy is Yes. Because there is no sample matching this value set, the Bayes classifier will not return a result. To find exact matches, there needs to be a very big data set. Even for the big data sets, as the number of predictor variables increases, the possibility of not finding an exact match increases significantly. When the data set and the number of predictor variables get larger, so does the time it takes to search for an exact match. All of these are the reasons why the Naïve Bayes classifier, a derivative of the Bayes classifier, is often used in predictive analytics and data mining practices. In the Naïve Bayes classifier, the exact match requirement is no longer needed. The Naïve Bayes classifier treats each predictor variable as an independent contributor to the prediction of the output variable and, hence, significantly increases its practicality and usefulness as a classification-type prediction tool.

Process of Developing a Naïve Bayes Classifier

Similar to other machine-learning methods, Naïve Bayes employs a two-phase model de- velopment and scoring/deployment process: (1) training in which the model/parameters are estimated and (2) testing in which the classification/prediction is performed on new cases. The process can be described as follows.

Training phase

Step 1. Obtain the data, clean the data, and organize them in a flat file format (i.e., col- umns as variables and rows as cases).

TABLE 5.4 Sample Data Set for the Classification-Type Prediction Methods

Input Variables (X) Output

Variable (Y)

Sample No. Outlook Temperature Humidity Windy Play Golf

1 Sunny Hot High No No

2 Overcast Hot High No Yes

3 Rainy Cool Normal No Yes

4 Rainy Cool Normal Yes No

5 Overcast Cool Normal Yes No

6 Sunny Hot High No No

7 Sunny Hot High No Yes

8 Rainy Mild Normal No Yes

9 Sunny Mild Normal Yes Yes

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 281

Step 2. Make sure that the variables are a nominal; if not (i.e., if any one of the variables is numeric/continuous), the numeric variables need to go through a data trans- formation (i.e., converting the numerical variable into nominal types by using discretization, such as binning).

Step 3. Calculate the prior probability of all class labels for the dependent variable. Step 4. Calculate the likelihood for all predictor variables and their possible values with

respect to the dependent variable. In the case of mixed variable types (categorical and continuous), each variable’s likelihood (conditional probability) is estimated with the proper method that applies to the specific variable type. Likelihoods for nominal and numeric predictor variables are calculated as follows:

• For categorical variables, the likelihood (the conditional probability) is esti- mated as the simple fraction of the training samples for the variable value with respect to the dependent variable.

• For numerical variables, the likelihood is calculated by (1) calculating the mean and variance for each predictor variable for each dependent variable value (i.e., class) and then (2) calculating the likelihood using the following formula:

P (x = v � c) = 122psc2 e - (v - mc )22sc2

Quite often, the continuous/numerical independent/input variables are discretized (using an appropriate binning method), and then the categorical variable estimation method is used to calculate the conditional probabilities (likelihood parameters). If performed properly, this method tends to produce better predicting Naïve Bayes models.

Testing Phase

Using the two sets of parameters produced in steps 3 and 4 in the training phase, any new sample can be put into a class label using the following formula:

Posterior = Prior x Likelihood

Evidence

P (C � F1, . . . , Fn ) = P (C ) P (F1, . . . , Fn � C )

P (F1, . . . , Fn)

Because the denominator is constant (the same for all class labels), we can remove it from the formula, leading to the following simpler formula, which is essentially nothing but the joint probability.

classify ( f1, . . . , fn ) = argmax p (C = c) q n

i = 1 p(Fi = fi � C = c)

This is a simple example to illustrate these calculations. In this example, we use the same data as shown in Table 5.4. The goal is to classify the following case: given that Outlook is Sunny, Temperature is Hot, Humidity is High, and Windy is No, what would be the class for the dependent variable (Play = Yes or No)?

From the data, we can observe that Prior (Yes) = 5>9 and Prior (No) = 4>9. For the Outlook variable, the likelihoods are Likelihood (No/Sunny) = 2/3;

Likelihood (No/Overcast); = ½; Likelihood (No/Rainy) = 1/3 . The likelihood values of the other variables (Temperature, Humidity, and Wind) can be determined/calculated similarly. Again, the case we are trying to classify is Outlook is Sunny, Temperature is Hot, Humidity is High, and Windy is No. The results are shown in Table 5.5.

c

282 Part II • Predictive Analytics/Machine Learning

Based on the results shown in Table 5.5, the answer would be Play Golf = Yes because it produces a larger value, 0.031 (compared to 0.025 for “No”) as per the joint probabilities (the simplified calculation without the inclusion of the denominator). If we were to use the full posterior formula for the two class labels, which requires the inclu- sion of the denominator in the calculations, we observe 0.07 for “Yes” and 0.056 for “No.” Because the denominator is common to all class labels, it will change the numerical out- put but not the class assignment.

Although Naïve Bayes is not very commonly used in predictive analytics projects today (because of its relatively poor prediction performance in a wide variety of applica- tion domains), one of its extensions, namely Bayesian network (see the next section), is gaining surprisingly rapid popularity among data scientists in the analytics world.

Application Case 5.5 provides an interesting example when many predictive an- alytics techniques are used to determine the changing condition of Crohn’s disease patients in order to better manage this debilitating chronic disease. Along with Naïve Bayes, several statistical and machine-learning methods were developed, tested, and compared. The best performing model was then used to explain the rank-ordered im- portance (i.e., relative contribution) of all independent variables used in predicting the disease progress.

TABLE 5.5 Naïve Bayes Classification Calculations

1Does not include the denominator/evidence for the calculations; hence, it is a partial calculation. 2 Include the denominator/evidence. Because the evidence is the same for all class labels (i.e., Yes and No), it does not make a difference in the classification result because both of the measures indicate class label as Yes.

Ratio Fraction (%>100) Play: Yes Play: No Play = Yes Play = No

Li k

e li

h o

o d

Outlook = Sunny 1/3 2/3 0.33 0.67

Temperature = Hot 2/4 2/4 0.50 0.50

Humidity = High 2/4 2/4 0.50 0.50

Wind = No 4/6 2/6 0.67 0.33

Prior 5/9 4/9 0.56 0.44

Product (multiply all)1 0.031 0.025

Divide by the evidence2 0.070 0.056

Introduction and Motivation

Inflammatory bowel disease (IBD), which includes Crohn’s disease and ulcerative colitis (UC), impacts 1.6 million Americans, according to the Crohn’s and Colitis Foundation (crohnscolitisfoundation. org). Crohn’s disease causes chronic inflammation and damages the gastrointestinal tract. It can impact any part of the gastrointestinal tract. The cause of

the disease is not entirely known, but some knowl- edge from research suggests that it could be caused by a combination of factors that include genetic makeup, immune system, and environmental set- tings. Systems that can detect disease progression or early disease onset can help in optimal utilization of healthcare resources and can result in better patient outcomes. The goal of this case study was to use

Application Case 5.5 Predicting Disease Progress in Crohn’s Disease Patients: A Comparison of Analytics Methods

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 283

electronic medical records (EMRs) to predict and explain inflammation in Crohn’s disease patients.

The Methodology

The data used in this study were from one of the nation’s largest EMR databases, Cerner Health Facts EMR. It houses rich and varied information related to patients, healthcare settings, costs, reimburse- ment type, and prescription ordered data from mul- tiple healthcare providers and hospitals in the United States. Data stored in the EMR database consists of patient-level data that were captured when a patient visited hospitals, urgent care centers, specialty clinics, general clinics, and nursing homes. The Health Facts database contains patient-level de-identified longitu- dinal data that were time stamped. The database was organized in the data tables as shown in Table 5.6.

A high-level process flow of the research meth- odology is shown in Figure 5.12. Although the pro- cess flow diagram did not provide the details of each step, it gave a high-level view of the sequence of the steps performed in the current predictive modeling study using EMR data. The three model types shown in the diagram were selected based on their com- paratively better performance over other machine- learning methods such as Naïve Bayes, nearest neighbor, and neural networks. Detailed steps of data balancing and data standardization are explained in the paper by Reddy, Delen, and Agrawal (2018).

The Results

Prediction results were generated using the test set applying the repeated 10 times run on the 10-fold cross-validation method. The performance of each model was assessed by the metric AUC, which was

the preferred performance metric over the prediction accuracy because the ROC curve, which generated the AUC, compared the classifier performance across the entire range of class distributions and error costs and, hence, is widely accepted as the performance measure for machine-learning applications. The mean AUC from the 10 run on the 10-fold cross-validation was generated (and shown in Table 5.7) for the three final model types— logistic regression, regularized regression, and gradient boosting machines (GBM).

Upon generation of the AUC for 100 models, researchers performed a post hoc analysis of variance (ANOVA) test and applied Tukey’s Honest Significant Difference (HSD) test for multiple comparison tests to determine which classifier method’s performance differed from the others based on the AUC. The test results showed that the mean AUC for regularized regression and the logistic regression did not differ significantly. However, the AUC from regularized regression and logistic regression were significantly different from the GBM model as seen in Table 5.8.

The relative importance of the independent variables was computed by adding the total amount of decrease in Gini index by the splits over a given predictor, averaged across all trees specified in the GBM tuning parameter, 1,000 trees in this research. This average decrease in GINI was normalized to a 0-100 scale on which a higher number indicates a stronger predictor. The variable importance results are shown in Figure 5.13.

Relative importance was computed by adding the total amount of decrease in Gini index by the splits over a given predictor averaged across all trees specified in the GBM tuning parameter, 1,000 trees in this research. This average decrease in GINI was nor- malized to a 0-100. scale on which a higher number

TABLE 5.6 Metadata of the Tables Extracted from EMR Database

Data Set (table) Description

Encounter Encounters including demographics, billing, healthcare setting, payer type, etc.

Medication Medication orders sent by the healthcare provider

Laboratory Laboratory data including blood chemistry, hematology, and urinalysis

Clinical Event Clinical events data containing information about various metrics including body mass index, smoking status, pain score, etc.

Procedure Clinical procedures performed on the patient

(Continued )

284 Part II • Predictive Analytics/Machine Learning

indicates a stronger predictor. The model results in Figure 5.13 showed that there was not one single predictor but a combination of predictors driving the predictions. Crohn’s disease location at diagnosis such as small intestine and large intestine; lab parameters

at baseline such as white blood cell (WBC) count; mean corpuscular hemoglobin (MCH); mean corpus- cular volume; sodium; red blood cell (RBC); distri- bution of platelet count; creatinine; hematocrit; and hemoglobin were the strongest  predictors. One  of

Encounter Procedure Medication

Combined Patient-Level

Data Set

Data Preprocessing

Gradient Boosting Machine

Results (mean AUC) Variable Selection

Variable Importance

Results (mean AUC) Variable Selection

Results (mean AUC) Variable Selection

Regularized Regression Logistic Regression

Lab Clinical Event

Selecting/filtering Aggregating

10 Replications of 10-fold Cross Validation

Training and Testing

Training and Testing

10%

10%

10%

10% 10%

10%

10%

10%

10% 10%

Training and Testing

Transforming

Comparative Analyses

FIGURE 5.12 Process Flow Diagram of the High-Level Steps Involved in the Data Mining Research.

Application Case 5.4 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 285

the strongest demographic predictors of the inflam- mation severity doubling was age. Other healthcare- setting and encounter-related variables such as hospital bed size, diagnosis priority, and region, whether south or not, predicting whether inflam- mation severity doubled or not, also had some pre- dictive ability. The majority of the Crohn’s disease researchers identified the location of the disease, age at diagnosis, smoking status, biologic markers, and tumor necrosis factor (TNF) levels to predict the response to treatment; these are some of the iden- tifiers that also predicted the inflammation severity.

Logistic regression and regularized regression cannot produce a similar relative variable importance plot. However, the odds ratio and standardized coeffi- cients generated were used to identify the stronger predictors of inflammation severity.

This study was able to show that disease can be managed in real time by using decision sup- port tools that rely on advanced analytics to pre- dict the future inflammation state, which would then allow for medical intervention prospectively. With this information, healthcare providers can improve patient outcomes by intervening early and making

TABLE 5.7 AUC for Each Repeated Run Across Three Models

Repeated Run Logistic Regression Regularized Regression

Gradient Boosting Machines (GBM)

1 0.7929 0.8267 0.9393

2 0.7878 0.8078 0.9262

3 0.8080 0.8145 0.9369

4 0.8461 0.8487 0.9124

5 0.8243 0.8281 0.9414

6 0.7681 0.8543 0.8878

7 0.8167 0.8154 0.9356

8 0.8174 0.8176 0.9330

9 0.8452 0.8281 0.9467

10 0.8050 0.8294 0.9230

Mean AUC 0.8131 0.8271 0.9282

Median AUC 0.8167 0.8274 0.9343

TABLE 5.8 ANOVA with Multiple Comparisons Using Tukey’s Test

Tukey Grouping Mean AUC No. of Observations Model Type

A 0.928 100 GBM

B 0.827 100 Regularized regression

B 0.812 100 Logistic regression

Means with the Same Letter Are Not Significantly Different

(Continued )

286 Part II • Predictive Analytics/Machine Learning

necessary therapeutic adjustments that would work for the specific patient.

Questions for Case 5.5

1. What is Crohn’s disease and why is it important?

2. Based on the findings of this Application Case, what can you tell about the use of analytics in chronic disease management?

3. What other methods and data sets might be used to better predict the outcomes of this chronic disease?

Source: B. K. Reddy, D. Delen, & R. K. Agrawal, “Predicting and Explaining Inflammation in Crohn’s Disease Patients Using Predictive Analytics Methods and Electronic Medical Record Data,” Health Informatics Journal, 2018.

30

25

20

15

10

5

0

W B

C

M C

H

M C

V

A g e

S o d iu

m

R B

C

P la

te le

ts

C re

a ti n in

e

C h lo

ri d e

B lo

o d U

re a N

it ro

g e n

H e m

a to

c ri t

H e m

o g lo

b in

D ia

g n o s is

P ri o ri ty

B e d s iz

e 3

0 0

to 4

9 9

M a ri ta

lS ta

tu s S

in g le

M a ri ta

lS ta

tu s N

u ll

M a ri ta

lS ta

tu s M

a rr

ie d

B e d s iz

e R

a n g e 5

0 0

.

C e n s u s R

e g io

n S

o u th

G e n d e rM

a le

D ia

g n o s is

U n s p e c ifi

e d S

it e

B e d S

iz e R

a n g e 2

0 0

to 2

9 9

C e n s u s R

e g io

n N

o rt

h e a s t

D ia

g n o s is

S m

a llI

n te

s ti n e

D ia

g n o s is

S m

a llW

it h L a rg

e ln

t. ..

R a c e C

a u c a s ia

n

R a c e H

is p a n ic

R a c e N

a ti ve

A m

e ri c a n

R a c e O

th e r

M a ri ta

lS ta

tu s L e g a lly

S e p a ra

te d

M a ri ta

lS ta

tu s W

id o w

e d

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

FIGURE 5.13 Relative Variable Importance for GBM Model.

u SECTION 5.7 REVIEW QUESTIONS

1. What is special about the Naïve Bayes algorithm? What is the meaning of “Naïve” in this algorithm?

2. What are the advantages and disadvantages of Naïve Bayes compared to other machine-learning methods?

3. What type of data can be used in Naïve Bayes algorithm? What type of predictions can be obtained from it?

4. What is the process of developing and testing a Naïve Bayes classifier?

Application Case 5.5 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 287

5.8 BAYESIAN NETWORKS

Bayesian belief networks or Bayesian networks (BN) were first defined in an early paper of Judea Pearl as “supportive of self-activated, multidirectional propagation of evidence that converges rapidly to a globally-consistent equilibrium” (Pearl, 1985). Later on, with his continuing work in this area, Pearl won the prestigious ACM’s A.M. Turing Award for his contributions to the field of artificial intelligence and the development of BN. With this success, BN has received more public recognition than ever before, establishing it as a new paradigm in artificial intelligence, predictive analytics, and data science.

BN is a powerful tool for representing dependency structure in a graphical, ex- plicit, and intuitive way. It reflects the various states of a multivariate model and their probabilistic relationships. Theoretically, any system can be modeled with BN. In a given model, some states will occur more frequently when others are also present; for example, if a freshman student is not registered for next fall (a presumed freshman stu- dent dropout case), the chances of the student’s having financial aid is lower, indicating a relationship between the two variables. This is where the conditional probabilities (the basic theory that underlies BN) come to play to analyze and characterize the situation.

BNs have become popular among probabilistic graphical models because they have been shown to be able to capture and reason with complex, nonlinear, and partially uncertain situations and interactions (Koller and Friedman, 2009). While their solid, probability-based theoretical properties made Bayesian networks immediately attractive for academic research, especially for studying causality, their use in practical data science and business analytics domains is relatively new. For instance, researchers recently have developed data analytics– driven BN models in such domains that include predicting and understanding the graft survival for kidney transplantations (Topuz et al., 2018), predicting failures in the rail in- dustry caused by weather-related issues (Wang et al., 2017), predicting food fraud type (Bouzembrak et al., 2016), and detecting diseases (Meyfroidt et al., 2009).

Essentially, the BN model is a directed acyclic graph whose nodes correspond to the variables and arcs that signify conditional dependencies between variables and their possible values (Pearl, 2009). Here is a simple example, which was previously used as Application Case 3.2. For details of the example, please reread this Application Case. Let us say that the goal was to predict whether a freshman student will stay or drop out of college (presented in the graph as SecondFallRegistered) using some data/information about the student such as (1) the declared college type (a number of states/options exists for potential colleges) and (2) whether the student received financial aid in the first fall semester (two states exists, Yes or No), both of which can be characterized probabilistically using the his- torical data. One might think that there exist some causal links among the three variables, both college type and financial aid relating to whether the student comes back for the second fall semester and that it is reasonable to think that some colleges historically have more financial support than others (see Figure 5.14 for the presumed causal relationships).

The direction of links in BN graphs corresponded to the probabilistic or condi- tional dependencies between any two variables. Calculating actual conditional prob- abilities using historical data would help predict and understand student retention (SecondFallRegistered) using two variables, “financial aid” and “college type.” Such a network can then be used to answer questions such as these:

• Is the college type “engineering”? • What are the chances the student will register next fall? • How will financial aid affect the outcome?

How Does BN Work?

Building probabilistic models such as BN of complex real-world situations/problems using historical data can help in predicting what is likely to happen when something else would have happened. Essentially, BN typically tries to represent interrelationships

288 Part II • Predictive Analytics/Machine Learning

among the variables (both input and output variables) using a probabilistic structure that is often called the joint distribution. Joint distributions can be presented as a table consisting of all possible combinations of states (variable values) in a given model. For complex models, such a table can easily become rather large because it stores one prob- ability value for every combination of states. To mitigate the situation, BN does not con- nect all of nodes in the model to each other; rather, it connects only the nodes that are probabilistically related by some sort of conditional and/or logical dependency, resulting in significant savings on the computations.

The naturally complex probability distributions can be represented in a relatively compact way using BNs’ conditional independence formula. In the following formula, each xi represents a variable and Paxi represents the parents of that variable; by using these rep- resentations, the BN chain rule can be expressed as follows (Koller and Friedman, 2009):

P(x1, . . . , xn) = q n

i = 1 P(xi �Paxi).

Let’s look at an example by building a simple network for the student retention pre- diction problem. Remember that our problem is to predict whether a freshman student will stay for the second fall semester or drop out of college by using some data/information from student records (i.e., declared college type and whether the student received finan- cial aid for the first semester). The constructed BN graphical model shown in Figure 5.15 exhibits the relationships and conditional probabilities among all three nodes.

How Can BN Be Constructed?

There are two common methods available to construct the network: (1) manually with the help of a domain expert and (2) analytically by learning the structure of the network from the historical data by using advanced mathematical methods. Building a network manually, even for a modest size network, requires a skilled knowledgeable engineer spending several hours with the domain expert. As the size of the network gets larger, the time spent by the engineer and the domain expert increases exponentially. In some cases, it is very difficult to find a knowledgeable expert for a particular domain. Even if such a domain expert exists, he or she might not have the time to devote to the model building effort and/or might not be explicit and articulate enough (i.e., explaining tacit knowledge is always a difficult task) to be of much use as a source of knowledge. Therefore, most of the previous studies developed and offered various techniques that can be used to learn the structure of the network automatically from the data.

One of the earlier methods used to learn the structure of the network automatically from the data is the Naïve Bayes method. The Naïve Bayes classification method is a simple probabilistic model that assumes conditional independence between all predictor

SecondFallRegistered (binary)

FinancialAid (binary)

CollegeType (nominal)

FIGURE 5.14 Simple Illustration of the Partial Causality Relationships in Student Retention.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 289

variables and the given class/target variable to learn the structure. The classification algo- rithm is based on the Bayes rule that the probability of class/target value is computed for each given attribute variable and then the highest prediction is chosen for the structure.

A more recent and popular method for learning the structure of the network is called Tree Augmented Naïve (TAN) Bayes. The TAN method is an updated version of the Naïve Bayes classifier that uses tree structure to approximate the interactions between predictor variables and the target variable (Friedman, Geiger, and Goldszmidt, 1997). In the TAN model structure, class variable has no parent, and each and every predictor variable has the class variable as its parent along with at most one other predictor variable (i.e., attribute) as shown in Figure 5.16. Thus, an arc between two variables indicates a directional and causal relationship between them. Formal representation of parents for a variable xi can be shown by:

Paxi = 5C, xd(i)6 where the tree is a function over d1i2 7 0, and Paxi is the set of parents for each xi. A class variable (C ) has no parents, namely PaC = ∅. It is empirically and theoretically shown that TAN performs better than Naïve Bayes and maintains simplicity in the compu- tations because it does not require a search process (Friedman et al., 1997).

The procedure for constructing a TAN uses Chow and Liu’s tree Bayesian con- cept. Finding a maximally weighted spanning tree in a graph is an optimization problem

FinancialAid (binary)

CollegeType (nominal)

SecondFallRegistered (binary)

FIGURE 5.15 Conditional Probability Tables for the Two Predictor and One Target Variables.

290 Part II • Predictive Analytics/Machine Learning

whose objective is to maximize log likelihood of d(i) (Chow and Liu 1968). Then the TAN construction steps can be as described as follows (Friedman, et al., 1997):

Step 1. Compute the conditional mutual information function for each (i, j ) pair as

IP (xi : xj �C) = a xi,xj,C

P (xi, xj, C )log P(xi, xj �C )

P(xi �C)P(xj �C ) , i ≠ j

This function indicates how much information is provided when the class variable is known.

Step 2. Build a complete undirected graph and use a conditional mutual information function to annotate the weight of an edge connecting xi to xj.

Step 3. Build a maximum weighted spanning tree. Step 4. Convert the undirected graph into a directed one by choosing a root variable

and setting the direction of all edges to be outward from it. Step 5. Construct a TAN model by adding a vertex labeled by C and an arc from C to

each xi.

One of the superior features of BN is its ease of adaptability. While building a BN, one can start the network as small with a limited knowledge and then expand on it as new information becomes available. In such situation, having missing values in the data set might not be a major issue because it can use the available portion of the data/values/ knowledge to create the probabilities. Figure 5.17 shows a fully developed, data-driven BN example for the student retention project.

From the applicability perspective, such a fully constructed BN can be of great use to practitioners (i.e., administrators and managers in educational institutions) because it offers a holistic view of all relationships and provides the means to explore detailed information using a variety of “what-if” analysis. In fact, with this network model, it is possible to cal- culate the student-specific risk probability of attrition, which is the likelihood or posterior probability of the student who would drop out, by systematically selecting and changing the value of a predictor variable within its value domain (assessing how much the dropout risk of a student changes as the value of a given predictor variable such as Fall-GPA changes).

When interpreting the BN model shown in Figure 5.17, one should con- sider the arcs, directions of the arrows on those arcs, direct interactions, and indi- rect relationships. For example, the fall grant/tuition waiver/scholarship category

C

X1 X2 X3

FIGURE 5.16 Tree Augmented Naïve Bayes Network Structure.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 291

P e rs

is ta

n c e F a ll

0 t

o 0

.5 0

.5 t

o 0

.7 5

0 .7

5 t

o 0

.8 0

.8 t

o 1

1

8 .4

3 Y e s

7 7

.7 N

o 2

2 .3

Y e s

7 4

.0 N

o

Y e s

N o

7 8

.8 2

1 .2

Y e s

N o

Y e s

N o

2 6

.0 7

9 .1

2 0

.9

8 5

.2 1

4 .8

9 .1

3 7

.9 5

1 0

.1 6

4 .3

0 .8

7 5

6 0

.2 3

8 .6

4 6

4 .5

1 3

.1 6

4 .4

R e c e iv

e d S

p ri n g A

id S

p ri n g G

ra n tT

u it io

n W

a iv

e rS

c h o la

rs h ip

F a llG

ra n tT

u it io

n W

a iv

e rS

c h o la

rs h ip

R e c e iv

e d F a llA

id

Y e s

N o

4 1

.0 5

9 .0

F a llS

tu d e n tL

o a n

Y e s

N o

3 8

.1

0 t

o 2

2 2 t

o 1

5 1

5 1

5 t

o 5

4

W I B A M H N D X O C P

A F H X K J I G R

9 0

.8 7

.8 4

0 .4

8 0

.2 4

.0 8

1 0

.2 7

0 .1

1 .0

7 6

.0 7

6

E N

A G

A S

B U

H E

S G

U E

D

A B C D F

A B C D F

1 6

.6 4

7 .7

2 3

.1 7

.7 6

4 .7

4

1 6

.2 4

5 .4

2 4

.3 8

.5 6

5 .5

8

0 t

o 1

0 1

0 o

t 1

3 1

3 t

o 1

4 1

4 1

4 t

o 2

2

1 5

.0 2

0 .4

1 7

.3 2

0 .9

2 6

.4

1 7

.5 1

1 .3

2 6

.2 1

5 .6

9 .4

9 1

1 .7

8 .0

7

7 7

.7 9

.1 3

4 .5

4 1

.5 1

2 .2

5 3

.1 2

1 .0

4 0

.1 5

0 .3

0 0

.1 3

.0 7

6 .0

7 6

0 .2

5 1

.6 5

9 6

.2 1

.2 1

0 .7

1

6 1

.9

S p ri n g S

tu d e n tL

o a n

M o n th

s A

ft e rH

ig h S

c h o o lG

ra d u a ti o n

E th

n ic

it y

A d d m

is io

n T

yp e

C o lle

g e

F a llE

a rn

e d H

o u rs

F a llG

P A

L

F a llC

u m

u la

ti ve

G P

.. .

S e c o n d F a llR

re g is

te re

d

F IG

U R

E 5

.1 7

B

a y e si

a n

B e li e f

N e tw

o rk

f o

r P re

d ic

ti n

g F

re sh

m e n

S tu

d e n

t A

tt ri

ti o

n .

292 Part II • Predictive Analytics/Machine Learning

(i.e., FallGrantTuitionWaiverScholarship) and all the nodes linked to FallGrantTuition WaiverScholarship are related to student attrition (i.e., SecondFallRegistered). Moreover, while FallGrantTuitionWaiverScholarship interacts with college (College) and spring grant/tuition waiver/scholarship (i.e., SpringGrantTuitionWaiverScholarship) directly, it also interacts with admission type (AdmissionType) indirectly through College. According to the BN model, one of the most interactive predictors is the student’s earned credit hours by registered rate (i.e., PersistanceFall), which contributes to the effect of the student’s fall GPA (FallGPA) and student attrition. As such, if the PersistanceFall of the student is less than 0.8, then College type has an effect on student attrition. However, if the PersistanceFall of the student is 1.0, the College type does not impact the student attrition in a noteworthy manner.

As a collective view to what-if scenarios, Figure 5.18 summarizes the most posi- tive and most negative levels within each predictor with its posterior probabilities. For instance, getting an A for the Fall GPA decreases the posterior probability of student attri- tion to 7.3 percent, or conversely, getting an F increases the probability of attrition to 87.8 percent where the baseline is 21.2 percent.

Some people have had doubts about using BN because they thought that BN does not work well if the probabilities upon which it is constructed are not exact. However, it turns out that in most cases, approximate probabilities driven from data and even the subjective ones that are guessed by domain experts provide reasonably good results. BN are shown to be quite robust toward imperfect, inaccurate, and incomplete knowledge. Often the combination of several strands of imperfect knowledge allows BN to make sur- prisingly good predictions and conclusions. Studies have shown that people are better at estimating probabilities “in the forward direction.” For example, managers are quite good at providing probability estimates for “If the student has dropped out the college, what are the chances his or her college type is Art & Sciences?” rather than the reverse, “If the

21.2

0%

Baseline (Second fall registered5No)

Fall GPA (A vs F)

Fall cumulative GPA (A vs F)

Spring grant/ tuition waiver/ scholarship (Yes vs No)

College (HES vs GU)

Admission type (H vs K)

Fall grant/ tuition waiver/ scholarship (Yes vs No)

Fall student loan (No vs Yes)

Months after high school (2to15 vs .15)

Ethnicity (White vs Pacific)

Spring student loan (Yes vs No)

Received fall financial aid (Yes vs No)

Earned by registered hours in fall (,0.5 vs 1)

Fall earned hours (,10 vs .14)

Received spring financial aid (Yes vs No)

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

87.87.3

88.47.2

81

64.3

42.7

39.8

10.5

9.2

15.1

14.7

15.9

19.2

19.5

18.6

31

49.7

27.8

25

20.7 46.9

19.9

18.2

45.1

23.2

20.3 26.8

FIGURE 5.18 Probability of Student Attrition for Risk Factors—What-If Analysis on Individual Factors.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 293

student goes to Art & Sciences college, what are the chances that this student will not register the next fall?”

u SECTION 5.8 REVIEW QUESTIONS

1. What are Bayesian networks? What is special about them? 2. What is the relationship between Naïve Bayes and Bayesian networks? 3. What is the process of developing a Bayesian networks model? 4. What are the advantages and disadvantages of Bayesian networks compared to other

machine-learning methods?

5. What is Tree Augmented Naïve (TAN) Bayes and how does it relate to Bayesian networks?

5.9 ENSEMBLE MODELING

Ensembles (or more appropriately called model ensembles or ensemble modeling) are combinations of the outcomes produced by two or more analytics models into a com- pound output. Ensembles are primarily used for prediction modeling when the scores of two or more models are combined to produce a better prediction. The prediction can be either classification or regression/estimation type (i.e., the former predicting a class label and the latter estimating a numerical output variable). Although use of ensembles has been dominated by prediction-type modeling, it can also be used for other analytics tasks such as clustering and association rule mining. That is, model ensembles can be used for supervised as well as unsupervised machine-learning tasks. Traditionally, these machine- learning procedures focused on identifying and building the best possible model (often the most accurate predictor of the holdout data) from a large number of alternative model types. To do so, analysts and scientists used an elaborate experimental process that mainly relied on trial and error to improve each single model’s performance (defined by some predetermined metrics, e.g., prediction accuracy) to its best possible level so that the best of the models can be used/deployed for the task at hand. The ensemble ap- proach turns this thinking around. Rather than building models and selecting the single best model to use/deploy, it proposes to build many models and use them all for the task they are intended to perform (e.g., prediction).

Motivation—Why Do We Need to Use Ensembles?

Usually researchers and practitioners build ensembles for two main reasons: for bet- ter accuracy and more stable/robust/consistent/reliable outcomes. Numerous research studies and publications over the past two decades have shown that ensembles almost always improve predictive accuracy for the given problem and rarely predict worse than the single models (Abbott, 2014). Ensembles began to appear in the data mining/analyt- ics literature in 1990s, motivated by the limited success obtained by the earlier works on combining forecasts that dated a couple or more decades. By the early- to mid-2000s, ensembles had become popular and almost essential to winning data mining and predic- tive modeling competitions. One of the most popular awards for ensemble competitions is perhaps the famous Netflix prize, which was an open competition that solicited re- searchers and practitioners to predict user ratings of films based on historical ratings. The prize was US$1 million for a team that could reduce the RMSE of the then-existing Netflix internal prediction algorithm by the largest margin but no less than 10 percentage points. The winner, runner-up, and nearly all the teams at the top of the leaderboard used model ensembles in their submissions. As a result, the winning submission was the result of an ensemble containing hundreds of predictive models.

294 Part II • Predictive Analytics/Machine Learning

When it comes to justifying the use of ensembles, Vorhies (2016) put it the best—if you want to win a predictive analytics competition (at Kaggle or at anywhere else) or at least get a respectable place on the leaderboard, you need to embrace and intelligently use model ensembles. Kaggle has become the premier platform for data scientists to showcase their talents. According to Vorhies, the Kaggle competitions are like Formula One racing for data science. Winners edge out competitors at the fourth decimal place and, like Formula One race cars, not many of us would mistake them for daily drivers. The amount of time devoted and the extreme techniques used would not always be ap- propriate for an ordinary data science production project, but like paddle shifters and ex- otic suspensions, some of those improvements and advanced features find their way into the day-to-day life and practice of analytics professionals. In addition to Kaggle competi- tions, reputable organizations such as the Association for Computing Machinery (ACM)’s Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD) and Pacific-Asia Conference in Knowledge Discovery and Data Mining (PAKDD) regularly organize competitions (often called “cups”) for the community of data scientists to dem- onstrate their competence, sometimes for monetary rewards but most often for simple bragging rights. Some of the popular analytics companies like the SAS Institute and Teradata Corporation organize similar competitions for (and extend a variety of relatively modest awards to) both graduate and undergraduate students in universities all over the world, usually in concert with their regular analytics conferences.

It is not just the accuracy that makes model ensembles popular and unavoidably necessary. It has been shown time and time again that ensembles can improve model accuracy, but they can also improve model robustness, stability, and, hence, reliabil- ity. This advantage of model ensembles is equally (or perhaps more) important and invaluable than accuracy in situations for which reliable prediction is of the essence. In ensemble models, by combining (some form of averaging) multiple models into a single prediction outcome, no single model dominates the final predicted value of the models, which in turn, reduces the likelihood of making a way-off-target “wacky” predic- tion. Figure 5.19 shows a graphical illustration of model ensembles for classification-type prediction problems. Although some varieties exist, most ensemble modeling methods

Data

M1T

Training & Calibrating

Trained Model

Combining Ensembling

Predicting Deploying

Processed Data

Raw Data Collection

Cross- Validation

1 0

1

0

...

......

...

... ...

10 %

10 %

10 %10

% 10 %

10 % 10 %10

% 10 %

10 %

M1

M2T M2

M3T M3

MnT Mn

FIGURE 5.19 Graphical Depiction of Model Ensembles for Prediction Modeling.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 295

follow this generalized process. From left to right, Figure 5.19 illustrates the general tasks of data acquisition and data preparation followed by cross-validation and model building and testing, and finally assembling/combining the individual model outcomes and assess- ing the resultant predictions.

Another way to look at ensembles is from the perspective of “collective wisdom” or “crowdsourcing.” In the popular book The Wisdom of Crowds (Surowiecki, 2005), the au- thor proposes that better decisions can be made if rather than relying on a single expert, many (even uninformed) opinions (obtained by a process called crowdsourcing) can be aggregated into a decision that is superior to the best expert’s opinion. In his book, Surowiecki describes four characteristics necessary for the group opinion to work well and not degenerate into the opposite effect of poor decisions as evidenced by the “mad- ness of crowds”: diversity of opinion, independence, decentralization, and aggregation. The first three characteristics relate to how individual decisions are made—they must have information that differs from that of others in the group and is not affected by the others in the group. The last characteristic merely states that the decisions must be combined. These four principles/characteristics seem to lay the foundation for building better model ensembles as well. Each predictive model has a voice in the final decision. The diversity of opinion can be measured by the correlation of the predictive values themselves—if all of the predictions are highly correlated, or in other words, if the models nearly all agree, there is no foreseeable advantage in combining them. The decentralization characteristic can be achieved by resampling data or case weights; each model uses either different records from a common data set or at least uses the records with weights that differ from the other models (Abbott, 2014).

One of the prevalent concepts in statistics and predictive modeling that is highly relevant to model ensembles is the bias-variance trade-off. Therefore, before delving into the different types of model ensembles, it is necessary to review and understand the bias- variance trade-off principle (as it applies to the field of statistics or machine learning). In predictive analytics, bias refers to the error and variance refers to the consistency (or lack thereof) in predictive accuracy of models applied to other data sets. The best models are expected to have low bias (low error, high accuracy) and low variance (consistency of ac- curacy from data set to data set). Unfortunately, there is always a trade-off between these two metrics in building predictive models—improving one results in worsening the other. You can achieve low bias on training data, but the model could suffer from high variance on hold-out/validation data because the models could have been overtrained/overfit. For instance, the kNN algorithm with k = 1 is an example of a low bias model (perfect on training data set) but is susceptible to high variance on a test/validation data set. Use of cross-validation along with proper model ensembles seems to be the current best practice in handling such trade-offs between bias and variance in predictive modeling.

Different Types of Ensembles

Ensembles or teams of predictive models working together have been the fundamental strategy for developing accurate and robust analytics models. Although ensembles have been around for quite a while, their popularity and effectiveness has surfaced in a sig- nificant way only within the last decade as they continually improved in parallel with the rapidly improving software and hardware capabilities. When we refer to model ensembles, many of us immediately think of decision tree ensembles like random forest and boosted trees; however, generally speaking, the model ensembles can be classified into four groups in two dimensions as shown in Figure 5.20. The first dimension is the method type (the x-axis in Figure 5.20) in which the ensembles can be grouped into bagging or boosting types. The second dimension is the model type (the y-axis in Figure 5.20) in which the ensembles can be grouped into homogeneous or heterogeneous types (Abbott, 2014).

296 Part II • Predictive Analytics/Machine Learning

As the name implies, homogeneous-type ensembles combine the outcomes of two or more of the same type of models such as decision trees. In fact, a vast majority of homo- geneous model ensembles are developed using a combination of decision tree structures. The two most common categories of homogeneous type ensembles that use decision trees are bagging and boosting (more information on these are given in subsequent sec- tions). Heterogeneous model ensembles combine the outcomes of two or more different types of models such as decision trees, artificial neural networks, logistic regression, SVM, and others. As mentioned in the context of “the wisdom of crowds,” one of the key suc- cess factors in ensemble modeling is to use models that are fundamentally different from one another, ones that look at the data from a different perspective. Because of the way it combines the outcomes of different model types, heterogeneous model ensembles are also called information fusion models (Delen and Sharda, 2010) or stacking (more infor- mation on these is given later in this chapter).

Bagging

Bagging is the simplest and most common ensemble method. Leo Breiman, a very well- respected scholar in the world of statistics and analytics, is known to have first pub- lished a description of bagging (i.e., Bootstrap Aggregating) algorithm at the University of California–Berkeley in 1996 (Breiman, 1996). The idea behind bagging is quite simple yet powerful: build multiple decision trees from resampled data and combine the predicted values through averaging or voting. The resampling method Breiman used was bootstrap sampling (sampling with replacement), which creates replicates of some records in the training data. With this selection method, on average, about 37 percent of the records will not be included at all in the training data set (Abbott, 2014).

Although bagging was first developed for decision trees, the idea can be applied to any predictive modeling algorithm that produces outcomes with sufficient variation in the predicted values. Although rare in practice, the other predictive modeling algorithms that are potential candidates for bagging-type model ensembles include neural networks, Naïve Bayes, k-nearest neighbor (for low values of k), and, to a lesser degree, even logis- tic regression. k-nearest neighbor is not a good candidate for bagging if the value of k is

[Rare - Active Research Area] Systematically weighing data samples for better prediction modeling

[Rare] Other types of single- model-type bagging (e.g., Ann)

Bagging H

o m

o g e n e o u s

H e te

ro g e n e o u s

Boosting

Stacking (meta- modeling)

Simple/Complex model weighing

Ensemble trees Random forest

Information fusion

AdaBoost XGBoost [Rare - Active Research Area] Other types of single-model- type boosting

Method Type

Model Type

FIGURE 5.20 Simple Taxonomy for Model Ensembles.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 297

already large; the algorithm already votes or averages predictions and with larger values of k, so predictions are already very stable with low variance.

Bagging can be used for both classification- and regression/estimation–type predic- tion problems. In classification-type prediction problems, all of the participant models’ outcomes (class assignments) are combined using either a simple or complex/weighted majority voting mechanism. The class label that gets the most/highest votes becomes the aggregated/ensemble prediction for that sample/record. In regression/estimation–type prediction problems, when the output/target variable is a number, all of the participant models’ outcomes (numerical estimations) are combined using either a simple or com- plex/weighted–averaging mechanism. Figure 5.21 illustrates the graphical depiction of a decision tree–type bagging algorithm.

One of the key questions in bagging is, “How many bootstrap samples, also called replicates, should be created?” Brieman stated, “My sense of it is that fewer are required when y [dependent variable] is numerical and more are required with an increasing number of classes [for classification type prediction problems].” He typically used 10–25 bootstrap replicates with significant improvements occurring with as few as 10 replicates. Overfitting the models is an important requirement to building good bagged ensembles. By overfitting each model, the bias is low, but the decision tree generally has worse ac- curacy on held-out data. But bagging is a variance reduction technique; the averaging of predictions smooths the predictions to behave in a more stable way on new data.

As mentioned before, the diversity of model predictions is a key factor in creating effective ensembles. One way to measure the diversity of predictions is to examine the correlation of predicted values. If the correlations between model predictions are always very high, more than 0.95, each model brings little additional predictive information to the ensemble and therefore little improvement in accuracy is achievable. Generally, it is best to have correlations of less than 0.9. The correlations should be computed from the model propensities or predicted probabilities rather than the 50,16 classification value itself. Bootstrap sampling in bagging is the key to introducing diversity in the models. One can think of the bootstrap sampling methodology as creating case weights for each record—some records are included multiple times in the training data (their weights are

Data

Sample1 Sample2 Samplen

Final Prediction (voting/average)

Bootstrap Sampling

FIGURE 5.21 Bagging-Type Decision Tree Ensembles.

298 Part II • Predictive Analytics/Machine Learning

1, 2, 3, or more), and other records are not included at all (their weights are equal to 0) (Abbott, 2014).

Boosting

Boosting is perhaps the second most common ensemble method after bagging. Yoav Freund and Robert E. Schapire are known to have first introduced the boosting algo- rithm in separate publications in the early 1990s and then in a 1996 joint publication (Freund and Schapire, 1996). They introduced the well-known boosting algorithm, called AdaBoost. As with bagging, the idea behind boosting is also quite straightforward. First, build a rather simple classification model; it needs to be only slightly better than random chance, so for a binary classification problem, it needs to be only slightly better than a 50 percent correct classification. In this first step, each record is used in the algorithm with equal case weights as one would do normally in building a predictive model. The errors in the predicted values for each case are noted. The case weights of correctly classified records/cases/samples will stay the same or perhaps be reduced, and the case weights of the records that are incorrectly classified will have increased, and then a second sim- ple model is built on these weighted cases (i.e., the transformed/weighted–training data set). In other words, for the second model, records that were incorrectly classified are “boosted” through case weights to be considered more strongly or seriously in the con- struction of the new prediction model. In each iteration, the records that are incorrectly predicted (the ones that are difficult to classify) keep having their case weights increased, communicating to the algorithm to pay more attention to these records until, hopefully, they are finally classified correctly.

This process of boosting is often repeated tens or even hundreds of times. After the tens or hundreds of iterations, the final predictions are made based on a weighted aver- age of the predictions from all the models. Figure 5.22 illustrates the simple process of boosting in building decision tree–type ensemble models. As shown, each tree takes the most current data set (one of equal size, but with the most recently boosted case weights)

2 1

Data

4

Final Prediction (voting/average)

n 3

FIGURE 5.22 Boosting-Type Ensembles for Decision Trees.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 299

to build another tree. The feedback of incorrectly predicted cases is used as an indicator to determine which cases and to what extent (direction and magnitude) to boost (update the weights) for the training samples/cases.

Although they look quite similar in structure and purpose, bagging and boosting employ slightly different strategies to utilize the training data set and to achieve the goal of building the best possible prediction model ensemble. The two key differences be- tween bagging and boosting are as follows. Bagging uses a bootstrap sample of cases to build decision trees whereas boosting uses the complete training data set. Whereas bagging creates independent, simple trees to ensemble, boosting creates dependent trees (each tree “learning” from the previous one to pay more attention to the incorrectly pre- dicted cases) that collectively contribute to the final ensemble.

Boosting methods are designed to work with weak learners, that is, simple models; the component models in a boosted ensemble are simple models with high bias although low variance. The improvement with boosting is better, as with bagging, when algorithms that are unstable predictors are used. Decision trees are most often used in boosted models. Naïve Bayes is also used but with fewer improvements over a single model. Empirically speaking, boosting typically produces better model accuracy than single deci- sion trees or even bagging-type ensembles.

Variants of Bagging and Boosting

Bagging and boosting were the first ensemble methods that appeared in predictive ana- lytics software, primarily with decision tree algorithms. Since their introduction, many other approaches to building ensembles have been developed and made available, par- ticularly in open source software (both as part of open analytics platforms like KNIME and Orange and as class libraries in R and Python). The most popular and successful [advanced] variants of bagging and boosting are random forest and stochastic gradient boosting, respectively.

RANDOM FOREST The random forest (RF) model was first introduced by Breiman (2001) as a modification to the simple bagging algorithm. As with bagging, the RF algorithm be- gins with a bootstrap-sampled data set and builds one decision tree from each bootstrap sample. Compared to simple bagging, there is, however, an important twist to the RF algorithm: at each split in the tree, staring from the very first split, rather than considering all input variables as candidates, only a random subset of variables is considered. Hence, in RF, the bootstrap sampling technique applies to both a random selection of cases and a random selection of features (i.e., input variables).

The number of cases and the number of variables to consider along with how many trees to construct are all parameters used to decide in building RF models. Common prac- tice suggests that the default number of variables to consider as candidates at each split point should be the square root of the total number of candidate inputs. For example, if there were 100 candidate inputs for the model, a random 10 inputs are candidates for each split. This also means that it is unlikely that the same inputs will be available for splits at parent and children nodes in a given tree, forcing the tree to find alternate ways to maxi- mize the accuracy of subsequent splits. Therefore, there is an intentionally created twofold diversity mechanism built into the tree construction process—random selection of cases and variables. RF models produce prediction outcomes that are usually more accurate than simple bagging and are often more accurate than simple boosting (i.e., AdaBoost).

STOCHASTIC GRADIENT BOOSTING The simple boosting algorithm AdaBoost is only one of many boosting algorithms currently documented in the literature. In commercial soft- ware, AdaBoost is still the most commonly used boosting technique; however, dozens

300 Part II • Predictive Analytics/Machine Learning

of boosting variants can be found in open source software packages. One interesting boosting algorithm that has recently gained popularity due to its superior performance is the stochastic gradient boosting (SGB) algorithm created by Jerry Friedman at Stanford University. Then Friedman developed an advanced version of this algorithm (Friedman, 2001) called multiple additive regression trees (MART) and later branded as TreeNet' by Salford Systems in its software tool. Like other boosting algorithms, the MART algorithm builds successive, simple trees and combines them additively. Typically, the simple trees are more than stumps and contain up to six terminal nodes. Procedurally, after building the first tree, errors (also called residuals) are computed. The second tree and all subse- quent trees then use residuals as the target variable. Subsequent trees identify patterns that relate inputs to small and large errors. Poor prediction of the errors results in large errors to predict in the next tree, and good predictions of the errors result in small errors to predict in the next tree. Typically, hundreds of trees are built, and the final predictions are additive combinations of the predictions that are, interestingly, piecewise constant models because each tree is itself a piecewise constant model. However, one rarely notices these intricacies about the individual trees because typically hundreds of trees are included in the ensemble (Abbott, 2014). The TreeNet algorithm, an example of sto- chastic gradient boosting, has won multiple data mining modeling competitions since its introduction and has proven to be an accurate predictor with the benefit that very little data cleanup is needed for the trees before modeling.

Stacking

Stacking (a.k.a. stacked generalization or super learner) is a part of heterogeneous en- semble methods. To some analytics professionals, it could be the optimum ensemble technique but is also the least understood (and the most difficult to explain). Due to its two-step model training procedure, some think of it as an overly complicated ensemble modeling. Simply put, stacking creates an ensemble from a diverse group of strong learn- ers. In the process, it interjects a metadata step called super learner or meta learner. These intermediate meta classifiers forecast how accurate the primary classifiers have become and are used as the basis for adjustments and corrections (Vorhies, 2016). The process of stacking is figuratively illustrated in Figure 5.23.

As shown in Figure 5.23, in constructing a stacking-type model ensemble, a number of diverse strong classifiers are first trained using bootstrapped samples of the training data, creating tier 1 classifiers (each optimized to its full potential for the best possible prediction outcomes). The outputs of the tier 1 classifiers are then used to train a tier 2 classifier (i.e., a metaclassifier) (Wolpert, 1992). The underlying idea is to learn whether training data have been properly learned. For example, if a particular classifier incorrectly learned a certain region of the feature space and hence consistently misclassifies instances coming from that region, the tier 2 classifier might be able to learn this behavior and along with the learned behaviors of other classifiers, it can correct such improper training. Cross-validation–type selection is typically used for training the tier 1 classifiers—the en- tire training data set is divided into k numbers of mutually exclusive subsets, and each tier 1 classifier is first trained on (a different set of) k - 1 subsets of the training data. Each classifier is then evaluated on the kth subset, which was not seen during training. The outputs of these classifiers on their pseudo-training blocks, along with the actual correct labels for those blocks, constitute the training data set for the tier 2 classifier.

Information Fusion

As part of heterogeneous model ensembles, information fusion combines (fuses) the out- put (i.e., predictions) of different types of models such as decision trees, artificial neural networks, logistic regression, SVM, Naïve Bayes, and k-nearest neighbor, among others,

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 301

and their variants. The difference between stacking and information fusion is the fact that information fusion has no “meta modeling” or “superlearners.” It simply combines the outcomes of heterogeneous strong classifiers using simple or weighted voting (for clas- sification) or simple or weighted averaging (for regression). Therefore, it is simpler and less computationally demanding than stacking. In the process of combining the outcomes of multiple models, either a simple voting (each model contributes equally one vote) or a weighted combination of voting (each model contributes based on its prediction accuracy—more accurate models have higher weight value) can be used. Regardless of the combination method, this type of heterogeneous ensemble has been shown to be an invaluable addition to any data mining and predictive modeling project. Figure 5.24 graphically illustrates the process of building information fusion–type model ensembles.

Summary—Ensembles are not Perfect!

As a prospective data scientist, if you are asked to build a prediction model (or any other analytics model for that matter), you are expected to develop some of the popular model ensembles along with the standard individual models. If done properly, you will realize that ensembles are often more accurate and almost always more robust and reliable than

Preprocessed Data

Bootstrap Sampling

Sample1 Sample2

Built and Test the Meta Model

Final Prediction

Samplek

Built Model1

Built Model2

Built Modeln

T ie

r- 2

M e ta

M o d e l B

u ild

in g

T ie

r- 1

In d iv

id u a l M

o d e l B

u ild

in g

k-fold Cross-Validation

FIGURE 5.23 Stacking-Type Model Ensembles.

302 Part II • Predictive Analytics/Machine Learning

the individual models. Although they seem like silver bullets, model ensembles are not without shortcomings; the following are the two most common ones.

COMPLEXITY Model ensembles are more complex than individual models. Occam’s razor is a core principle that many data scientists use; the idea is that simpler models are more likely to generalize better, so it is better to reduce/regularize complexity, or in other words, simplify the models so that the inclusion of each term, coefficient, or split in a model is justified by its power of reducing the error at a sufficient amount. One way to quantify the relationship between accuracy and complexity is taken from information theory in the form of information theoretic criteria, such as the Akaike information cri- terion (AIC), the Bayesian information criterion (BIC), and minimum description length (MDL). Traditionally, statisticians—more recently, data scientists—are using these criteria to select variables in predictive modeling. Information theoretic criteria require a reduc- tion in model error to justify additional model complexity. So the question is, “Do model ensembles violate Occam’s razor?” Ensembles, after all, are much more complex than single models. According to Abbott (2014), if the ensemble accuracy is better on held-out data than single models, then the answer is “no” as long as we think about the complex- ity of a model in different terms—not just computational complexity but also behavioral complexity. Therefore, we should not fear that adding computational complexity (more terms, splits, or weights) will necessarily increase the complexity of models because sometimes the ensemble will significantly reduce the behavioral complexity.

Preprocessed Data

Decision logic (voting/weighting)

Ensemble (the prediction)

26 24 22 0

0.5

1

2 4 6

M argin

M ax

im um

-m ar

gin h

yp er

pla ne

X1

X2

k-fold Cross-Validation

FIGURE 5.24 Illustration of Building Process for Information Fusion–Type Model Ensemble.

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 303

TRANSPARENCY The interpretation of ensembles can become quite difficult. If you build an RF ensemble containing 200 trees, how do you describe why a prediction has a particular value? You can examine each of the trees individually, although this is clearly not practical. For this reason, ensembles are often considered black-box models, mean- ing that what they do is not transparent to the modeler or domain expert. Although you can look at the split statistics (which variables are more often picked to split early in those 200 trees) to artificially judge the level of contribution (a pseudo-variable of importance measure), each variable is contributing to the trained model ensemble. Compared to a single decision tree, such an investigation of 200 trees will be too difficult and not an intuitive way to interpret how the model comes up with a specific prediction. Another way to determine which inputs to the model are most important is to perform a sensitivity analysis.

In addition to complexity and transparency, model ensembles are also more dif- ficult and computationally more expensive to build and much more difficult to deploy. Table 5.9 shows the pros and cons of ensemble models compared with individual models.

In summary, model ensembles are the new frontier for predictive modelers who are interested in accuracy by reducing either errors in the models or the risk that the models behave erratically. Evidence for this is clear from the dominance of ensembles in predic- tive analytics and data mining competitions: ensembles always win.

The good news for predictive modelers is that many techniques for building en- sembles are built into software already. The most popular ensemble algorithms (bagging, boosting, stacking, and their variants) are available in nearly every commercial or open source software tool. Building customized ensembles is also supported in many software products whether based on a single algorithm or through a heterogeneous ensemble.

Ensembles are not appropriate for every solution—their applicability is deter- mined by the modeling objectives defined during business understanding and problem definition—but they should be part of every predictive modeler’s and data scientist’s modeling arsenal.

TABLE 5.9 Brief List of Pros and Cons of Model Ensembles Compared to Individual Models

PROS (Advantages) Description

✓ Accuracy Model ensembles usually result in more accurate models than individual models.

✓ Robustness Model ensembles tend to be more robust against outliers and noise in the data set than individual models.

✓ Reliability (stable) Because of the variance reduction, model ensembles tend to produce more stable, reliable, and believable results than individual models.

✓ Coverage Model ensembles tend to have a better coverage of the hidden complex patterns in the data set than individual models.

CONS (Shortcomings) Description

✓ Complexity Model ensembles are much more complex than individual models.

✓ Computationally expensive Compared to individual models, ensembles require more time and computational power to build.

✓ Lack of transparency (explainability) Because of their complexity, it is more difficult to understand the inner structure of model ensembles (how they do what they do) than individual models.

✓ Harder to deploy Model ensembles are much more difficult to deploy in an analytics-based managerial decision-support system than single models.

304 Part II • Predictive Analytics/Machine Learning

Introduction and Motivation

Analytics has been used by many businesses, organi- zations, and government agencies to learn from past experiences to more effectively and efficiently use their limited resources to achieve their goals and objec- tives. Despite all the promises of analytics, however, its multidimensional and multidisciplinary nature can sometimes disserve its proper, full-fledged application. This is particularly true for the use of predictive analyt- ics in several social science disciplines because these domains are traditionally dominated by descriptive analytics (causal-explanatory statistical modeling) and might not have easy access to the set of skills required to build predictive analytics models. A review of the extant literature shows that drug court is one such area. While many researchers have studied this social phenomenon, its characteristics, its requirements, and its outcomes from a descriptive analytics perspective, there currently is a dearth of predictive analytics mod- els that can accurately and appropriately predict who would (or would not) graduate from intervention and treatment programs. To fill this gap and to help author- ities better manage the resources, and to improve the outcomes, this study sought to develop and compare several predictive analytics models (both single models and ensembles) to identify who would graduate from these treatment programs.

Ten years after President Richard Nixon first declared a “war on drugs,” President Ronald Reagan signed an executive order leading to stricter drug enforcement, stating, “We’re taking down the surren- der flag that has flown over so many drug efforts; we are running up a battle flag.” The reinforcement of the war on drugs resulted in an unprecedented 10-fold surge in the number of citizens incarcerated for drug offences during the following two decades. The sky- rocketing number of drug cases inundated court dockets, overloaded the criminal justice system, and overcrowded prisons. The abundance of drug-related caseloads, aggravated by a longer processing time than that for most other felonies, imposed tremen- dous costs on state and federal departments of justice. Regarding the increased demand, court systems started to look for innovative ways to accelerate the inquest of drug-related cases. Perhaps analytics-driven deci- sion support systems are the solution to the problem.

To support this claim, the current study’s goal was to build and compare several predictive models that use a large sample of data from drug courts across different locations to predict who is more likely to complete the treatment successfully. The researchers believed that this endeavor might reduce the costs to the criminal justice system and local communities.

Methodology

The methodology used in this research effort included a multi-step process that employed pre- dictive analytics methods in a social science con- text. The first step of this process, which focused on understanding the problem domain and the need to conduct this study, was presented in the previous section. For the steps of the process, the research- ers employed a structured and systematic approach to develop and evaluate a set of predictive models using a large and feature-rich real-world data set. The steps included data understanding, data pre- processing, model building, and model evaluation; they are reviewed in this section. The approach also involved multiple iterations of experimentations and numerous modifications to improve individual tasks and to optimize the modeling parameters to achieve the best possible outcomes. A pictorial depiction of the methodology is given in Figure 5.25.

The Results

A summary of the models’ performances based on accuracy, sensitivity, specificity, and AUC is pre- sented in Table 5.10. As the results show, RF has the best classification accuracy and the greatest AUC among the models. The heterogeneous ensemble_ (HE) model closely follows RF, and SVM, ANN, and LR rank third to last based on their classifica- tion performances. RF also has the highest specific- ity and the second highest sensitivity. Sensitivity in the context of this study is an indicator of a model’s ability in correctly predicting the outcome for suc- cessfully graduated participants. Specificity, on the other hand, determines how a model performs in predicting the end results for those who do not suc- cessfully complete the treatment. Consequently, it can be concluded that RF outperforms other models for the drug courts data set used in this study.

Application Case 5.6 To Imprison or Not to Imprison: A Predictive Analytics-Based Decision Support System for Drug Courts

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 305

10-fold Cross-Validation

Data Preprocessing Merging Aggregating Cleaning Binning Selecting

True Positive Count (TP)

False Positive Count (FP)

True Negative Count (TN)

False Negative Count (FN)

True/Observed Class Positive Negative

P os

it iv

e N

e ga

ti ve

P re

di c te

d C

la s s

100 90

80 70

60 50

40 30

40 50

Variable Names

Im po

rt a nc

e

X1

X2

ANN

LR

SVM

RF

Pre-processed Data

SplittingData Preparation Modeling Assessment

Treat. DB Case DB Court DB

Domain Expert(s)

X1

X2

M ax

im um

-m ar gi n hy

pe rp lan

e

M argin

HE

Individual Models

Ensemble Models

10 %

10 %

10 %

10 % 10 %

10% 10%

10%

10%

10%

10%

10% 10%

10%

10%

Variable Importance

Prediction Accuracy 26 24 22 0

0.5

1

2 4 6

26 24 22 0

0.5

1

2 4 6

FIGURE 5.25 Research Methodology Depicted as a Workflow.

TABLE 5.10 Performance of Predictive Models Using 10-Fold Cross-Validation on the Balanced Data Set

Model Type

Confusion Matrix Accuracy (%)

Sensitivity (%)

Specificity (%) AUCG T

In d

iv id

u a l M

o d

e ls ANN

G 6,831 1,072 86.63 86.76 86.49 0.909

T 1,042 6,861

SVM G 6,911 992

88.67 89.63 87.75 0.917 T 799 7,104

LR G 6,321 1,582

85.13 86.16 81.85 0.859 T 768 7,135

E n

se m

b le

s

RF G 6,998 905

91.16 93.44 89.12 0.927 T 491 7,412

HE G 6,885 1,018

90.61 93.66 87.96 0.916 T 466 7,437

ANN: artificial neural networks; DT: decision trees; LR: logistic regression; RF: random forest; HE: heterogeneous ensemble; AUC: area under the curve; G: graduated; T: terminated

(Continued )

306 Part II • Predictive Analytics/Machine Learning

u SECTION 5.9 REVIEW QUESTIONS

1. What is a model ensemble, and where can it be used analytically? 2. What are the different types of model ensembles? 3. Why are ensembles gaining popularity over all other machine-learning trends? 4. What is the difference between bagging- and boosting-type ensemble models? 5. What are the advantages and disadvantages of ensemble models?

Although the RF model performs better than the other models in general, it falls second to the HE model in the number of false negative predictions. Similarly, the HE model has a slightly better performance in true negative predictions. False positive predictions represent participants who were terminated from the treatment, but the models mistakenly classified them as successful graduates. False negatives pertain to individuals who graduated, but the models predicted them as dropouts. False positive predictions are syn- onymous with increased costs and opportunity losses whereas false negatives carry social impacts. Spending resources on those offenders who would recidivate at some point in time during the treatment and, hence, be terminated from the program prevented a number of (potentially successful) prospective offenders from participating in the treatment. Conspicuously, depriv- ing potentially successful offenders from the treatment is against the initial objective of drug courts in reinte- grating nonviolent offenders into their communities.

In summary, traditional causal-explanatory sta- tistical modeling, or descriptive analytics, uses sta- tistical inference and significance levels to test and evaluate the explanatory power of hypothesized underlying models or to investigate the association

between variables retrospectively. Although a legiti- mate approach for understanding the relationships within the data used to build the model, descriptive analytics falls short in predicting outcomes for pro- spective observations. In other words, partial explana- tory power does not imply predictive power, and predictive analytics is a must for building empirical models that predict well. Therefore, relying on the findings of this study, application of predictive analyt- ics (rather than the sole use of descriptive analytics) to predict the outcomes of drug courts is well grounded.

Questions for Case 5.6

1. What are drug courts and what do they do for the society?

2. What are the commonalities and differences between traditional (theoretical) and modern (machine-learning) base methods in studying drug courts?

3. Can you think of other social situations and sys- tems for which predictive analytics can be used?

Source: Zolbanin, H., and Delen, D. (2018). To Imprison or Not to Imprison: An Analytics-Based Decision Support System for Drug Courts. The Journal of Business Analytics (forthcoming).

Chapter Highlights

• Neural computing involves a set of methods that emulates the way the human brain works. The basic processing unit is a neuron. Multiple neurons are grouped into layers and linked together.

• There are differences between biological and arti- ficial neural networks.

• In an artificial neural network, knowledge is stored in the weight associated with each connec- tion between two neurons.

• Neural network applications abound in almost all business disciplines as well as in virtually all other functional areas.

• Business applications of neural networks include finance, bankruptcy prediction, time-series fore- casting, and so on.

• There are various neural network architectures for different types of problems.

• Neural network architectures can be applied not only to prediction (classification or estimation)

Application Case 5.6 (Continued)

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 307

but also to clustering and optimization-type problems.

• SVM are among popular machine-learning techniques, mostly because of their superior predictive performance and their theoretical foundation.

• Although SVM can use a radial-basis function as a kernel, they are not very similar to neural networks.

• SVM can be used for both classification- and estimation/regression-type prediction problems.

• SVM use only numerical variables and the super- vised machine-learning method.

• Plenty of SVM applications exist, and new ones are emerging in a variety of domains including healthcare, finance, security, and energy.

• The nearest neighbor (or k-nearest neighbor) al- gorithm is a simple machine-learning technique that is used for both classification- and estima- tion/regression-type prediction problems.

• The nearest neighbor algorithm is a type of instance-based learning (or lazy learning) algo- rithm in which all computations are deferred until the actual prediction.

• The parameter k signifies the number of neigh- bors to use in a given prediction problem.

• Determining the “optimal” value of k requires a cross-validation–type experimentation.

• The nearest neighbor algorithm uses a dis- tance measure to identify close-by/appropriate neighbors.

• The input variables to the nearest neighbor al- gorithm must be in numeric format; all non- numeric/nominal variables need to be converted to pseudo-binary numeric variables.

• Bayesian classifiers are built on the foundation of the Bayes theorem (i.e., conditional probabilities).

• Naïve Bayes is a simple probability-based clas- sification method that is applied to classification- type prediction problems.

• The Naïve Bayes method requires input and out- put variables to have nominal values; numeric ones need to be discretized.

• Naïve keyword refers to the somewhat unrealistic yet practical assumption of independence (of the predictor/input variables).

• The Bayesian network (or Bayesian belief network) is a relatively new machine-learning technique that is gaining popularity among data scientists, academics, and theorists.

• The Bayesian network is a powerful tool for rep- resenting dependency structure in a graphical, explicit, and intuitive way.

• The Bayesian network can be used for prediction and explanation (or the interrelationships among the variables).

• Bayesian networks can be constructed manually (based on a domain expert’s knowledge) or auto- matically using historical data.

• While constructing a Bayesian network automat- ically, one can use regular Naïve Bayes or the tree-augmented Naïve (TAN) Bayes.

• Bayesian networks provide an excellent model for conducting what-if analyses for a variety of hypothetical scenarios.

• Ensembles (or more appropriately model ensem- bles or ensemble modeling) are combinations of the outcomes produced by two or more analytics models into a compound output.

• Although ensembles are primarily used for pre- diction modeling when the scores of two or more models are combined to produce a better pre- diction, they can also be used for clustering and association.

• Ensembles can be applied to both classification (via voting) and estimation/regression-type (via averaging) prediction problems.

• Ensembles are used mainly for two reasons: to obtain better accuracy and achieve more stable/ reliable outcomes.

• Recent history in data science has shown that en- sembles win competitions.

• There are homogeneous and heterogeneous en- sembles; if the combined models are of the same type (e.g., decision trees), the ensemble is homo- geneous; if not, it is heterogeneous.

• There are three methods in ensemble modeling: bagging, boosting, and stacking.

• Random forest is a bagging-type, homogeneous, decision tree–based ensemble method.

• Stochastic gradient boosting is a boosting type that is a homogeneous, decision tree–based en- semble method.

• Information fusion and stacking are heteroge- neous ensembles in which different types of models are combined.

• The disadvantages of ensembles include com- plexity and lack of transparency.

308 Part II • Predictive Analytics/Machine Learning

Key Terms

AdaBoost artificial neural network

(ANN) attrition axon backpropagation bagging Bayesian belief network

(BNN) Bayesian network (BN) Bayes theorem boosting conditional probability cross-validation

dendrites distance metric Euclidean distance heterogeneous ensemble hidden layer Hopfield network hyperplane information fusion k-fold cross-validation k-nearest neighbor

(kNN) kernel trick Kohonen’s self-organizing

feature map

Manhattan distance maximum margin Minkowski distance multi-layer perceptron Naïve Bayes neural computing neural network neuron nucleus pattern recognition perceptron processing element (PE) radial basis function

(RBF)

random forest retention stacking supervised learning stochastic gradient

boosting synapse transformation (transfer)

function voting weights what-if scenario

Questions for Discussion

1. What is an artificial neural network and for what types of problems can it be used?

2. Compare artificial and biological neural networks. What aspects of biological networks are not mimicked by arti- ficial ones? What aspects are similar?

3. What are the most common ANN architectures? For what types of problems can they be used?

4. ANN can be used for both supervised and unsupervised learning. Explain how they learn in a supervised mode and in an unsupervised mode.

5. What are SVM? How do they work? 6. What are the types of problems that can be solved by SVM? 7. What is the meaning of “maximum-margin hyper-

planes”? Why are they important in SVM? 8. What is the kernel trick and how does it relate to SVM? 9. What are the specific steps to follow in developing an

SVM model? 10. How can the optimal kernel type and kernel parameters

be determined? 11. What are the common application areas for SVM?

Conduct a search on the Internet to identify popular application areas and specific SVM software tools used in those applications.

12. What are the commonalities and differences, advan- taged and disadvantages between ANN and SVM?

13. Explain the difference between a training and a testing data set in ANN and SVM. Why do we need to differen- tiate them? Can the same set be used for both purposes? Why or why not?

14. Everyone would like to make a great deal of money on the stock market. Only a few are very successful. Why is using an SVM or ANN a promising approach? What can they do that other decision support technologies cannot do? How could SVM or ANN fail?

15. What is special about the kNN algorithm?

16. What are the advantages and disadvantages of kNN as compared to ANN and SVM?

17. What are the critical success factors for a kNN implementation?

18. What is a similarity (or distance) measure? How can it be applied to both numerical and nominal valued variables?

19. What are the common (business and scientific) applica- tions of kNN? Conduct a Web search to find three real- world applications that use kNN to solve the problem.

20. What is special about the Naïve Bayes algorithm? What is the meaning of “Naïve” in this algorithm?

21. What are the advantages and disadvantages of Naïve Bayes compared to other machine-learning methods?

22. What type of data can be used in a Naïve Bayes algo- rithm? What type of predictions can be obtained from it?

23. What is the process of developing and testing a Naïve Bayes classifier?

24. What are Bayesian networks? What is special about them? 25. What is the relationship between Naïve Bayes and

Bayesian networks? 26. What is the process of developing a Bayesian networks

model? 27. What are the advantages and disadvantages of Bayesian

networks compared to other machine-learning methods? 28. What is Tree Augmented Naïve (TAN) Bayes and how

does it relate to Bayesian networks? 29. What is a model ensemble, and analytically where can

it be used? 30. What are the different types of model ensembles? 31. Why are ensembles gaining popularity over all other

machine-learning trends? 32. What is the difference between bagging- and boosting-

type ensemble models? 33. What are the advantages and disadvantages of ensem-

ble models?

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 309

Exercises

Teradata University Network (TUN) and Other Hands-On Exercises

1. Go to the Teradata University Network Web site (teradatauniversitynetwork.com) or a URL given by your instructor. Locate Web seminars related to data mining and neural networks. Specifically, view the sem- inar given by Professor Hugh Watson at the SPIRIT2005 conference at Oklahoma State University; then, answer the following questions: a. Which real-time application at Continental Airlines

might have used a neural network? b. What inputs and outputs can be used in building a

neural network application? c. Given that its data mining applications are in real

time, how might Continental implement a neural net- work in practice?

d. What other neural network applications would you propose for the airline industry?

2. Go to the Teradata University Network Web site (teradatauniversitynetwork.com) or a URL given by your instructor. Locate the Harrah’s case. Read the case and answer the following questions:

a. Which of the Harrah’s data applications are most likely implemented using neural networks?

b. What other applications could Harrah’s develop using the data it collects from its customers?

c. What are some concerns you might have as a cus- tomer at this casino?

3. A bankruptcy-prediction problem can be viewed as a problem of classification. The data set you will be using for this problem includes five ratios that have been computed from the financial statements of real-world firms. These five ratios have been used in studies involving bankruptcy pre- diction. The first sample includes data on firms that went bankrupt and firms that did not. This will be your training sample for the neural network. The second sample of 10 firms also consists of some bankrupt firms and some non- bankrupt firms. Your goal is to use neural networks, SVM, and nearest neighbor algorithms to build a model using the first 20 data points and then to test its performance on the other 10 data points. (Try to analyze the new cases yourself manually before you run the neural network and see how well you do.) The following tables show the train- ing sample and test data you should use for this exercise.

Training Sample

Firm WC/TA RE/TA EBIT/TA MVE/TD S/TA BR/NB

1 0.1650 0.1192 0.2035 0.8130 1.6702 1

2 0.1415 0.3868 0.0681 0.5755 1.0579 1

3 0.5804 0.3331 0.0810 1.1964 1.3572 1

4 0.2304 0.2960 0.1225 0.4102 3.0809 1

5 0.3684 0.3913 0.0524 0.1658 1.1533 1

6 0.1527 0.3344 0.0783 0.7736 1.5046 1

7 0.1126 0.3071 0.0839 1.3429 1.5736 1

8 0.0141 0.2366 0.0905 0.5863 1.4651 1

9 0.2220 0.1797 0.1526 0.3459 1.7237 1

10 0.2776 0.2567 0.1642 0.2968 1.8904 1

11 0.2689 0.1729 0.0287 0.1224 0.9277 0

12 0.2039 -0.0476 0.1263 0.8965 1.0457 0

13 0.5056 -0.1951 0.2026 0.5380 1.9514 0

14 0.1759 0.1343 0.0946 0.1955 1.9218 0

15 0.3579 0.1515 0.0812 0.1991 1.4582 0

16 0.2845 0.2038 0.0171 0.3357 1.3258 0

17 0.1209 0.2823 -0.0113 0.3157 2.3219 0

18 0.1254 0.1956 0.0079 0.2073 1.4890 0

19 0.1777 0.0891 0.0695 0.1924 1.6871 0

20 0.2409 0.1660 0.0746 0.2516 1.8524 0

310 Part II • Predictive Analytics/Machine Learning

Test Data

Firm WC/TA RE/TA EBIT/TA MVE/TD S/TA BR/NB

A 0.1759 0.1343 0.0946 0.1955 1.9218 ?

B 0.3732 0.3483 -0.0013 0.3483 1.8223 ?

C 0.1725 0.3238 0.1040 0.8847 0.5576 ?

D 0.1630 0.3555 0.0110 0.3730 2.8307 ?

E 0.1904 0.2011 0.1329 0.5580 1.6623 ?

F 0.1123 0.2288 0.0100 0.1884 2.7186 ?

G 0.0732 0.3526 0.0587 0.2349 1.7432 ?

H 0.2653 0.2683 0.0235 0.5118 1.8350 ?

I 0.1070 0.0787 0.0433 0.1083 1.2051 ?

J 0.2921 0.2390 0.0673 0.3402 0.9277

Describe the results of the neural network, SVM, and nearest neighbor model predictions, including  soft- ware, architecture, and training information.

4. The purpose of this exercise is to develop models to predict the type of forest cover using a number of car- tographic measures. The given data set (see Online Supplements) includes four wilderness areas found in the Roosevelt National Forest of northern Colorado. A total of 12 cartographic measures were utilized as inde- pendent variables; seven major forest cover types were used as dependent variables.

This is an excellent example for a multi-class classification problem. The data set is rather large (with 581,012 unique instances) and feature rich. As you will see, the data are also raw and skewed (un- balanced for different cover types). As a model build- er, you are to make necessary decisions to preprocess the data and build the best possible predictor. Use your favorite tool to build the models for neural net- works, SVM, and nearest neighbor algorithms, and document the details of your results and experiences in a written report. Use screenshots within your re- port to illustrate important and interesting findings. You are expected to discuss and justify any decision that you make along the way.

5. Go to UCI Machine-Learning Repository (archive.ics. uci.edu/ml/index.php), identify four data sets for classification-type problems, and use these data sets to build and compare ANN, SVM, kNN, and Naïve Bayes models. To do so, you can use any analytics tool. We suggest you use a free, open-source ana- lytics tool such as KNIME (knime.org) or Orange

(orange.biolab.si). Prepare a well-written report to summarize your findings.

6. Go to Google Scholar (scholar.google.com). Conduct a search to find two papers written in the last five years that compare and contrast multiple machine-learning methods for a given problem domain. Observe com- monalities and differences among their findings and prepare a report to summarize your understanding.

Team Assignments and Role-Playing Projects

1. Consider the following set of data that relates daily elec- tricity usage to a function of the outside high tempera- ture (for the day):

Temperature, X Kilowatts, Y

46.8 12,530

52.1 10,800

55.1 10,180

59.2 9,730

61.9 9,750

66.2 10,230

69.9 11,160

76.8 13,910

79.7 15,110

79.3 15,690

80.2 17,020

83.3 17,880

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 311

Number Name Independent Variables

1 Elevation Elevation in meters

2 Aspect Aspect in degrees azimuth

3 Slope Slope in degrees

4 Horizontal_Distance_To_Hydrology Horizontal distance to nearest surface water features

5 Vertical_Distance_To_Hydrology Vertical distance to nearest surface water features

6 Horizontal_Distance_To_Roadways Horizontal distance to nearest roadway

7 Hillshade_9am Hill shade index at 9 a.m., summer solstice

8 Hillshade_Noon Hill shade index at noon, summer solstice

9 Hillshade_3pm Hill shade index at 3 p.m., summer solstice

10 Horizontal_Distance_To_Fire_Points Horizontal distance to nearest wildfire ignition points

11 Wilderness_Area (4 binary variables) Wilderness area designation

12 Soil_Type (40 binary variables) Soil-type designation

Number Dependent Variable

1 Cover_Type (7 unique types) Forest cover–type designation

Note: More details about the data set (variables and observations) can be found in the online file.

a. Plot the raw data. What pattern do you see? What do you think is really affecting electricity usage?

b. Solve this problem with linear regression Y = a + bX (in a spreadsheet). How well does this work? Plot your results. What is wrong? Calculate the sum-of- the-squares error and R2.

c. Solve this problem by using nonlinear regres- sion. We recommend a quadratic function, Y = a + b1X + b2X2. How well does this work? Plot your results. Is anything wrong? Calculate the sum-of- squares error and R2.

d. Break the problem into three sections (look at the plot). Solve it using three linear regression models, one for each section. How well does this work? Plot your results. Calculate the sum-of-squares error and R2. Is this modeling approach appropriate? Why or why not?

e. Build a neural network to solve the original problem. (You might have to scale the X and Y values to be between 0 and 1.) Train the network (on the entire set of data) and solve the problem (i.e., make predic- tions for each of the original data items). How well does this work? Plot your results. Calculate the sum- of-squares error and R2.

f. Which method works best and why? 2. Build a real-world neural network. Using demo soft-

ware downloaded from the Web (e.g., NeuroSolutions at neurodimension.com or another neural network tool/site), identify real-world data (e.g., start searching on the Web at archive.ics.uci.edu/ml/index.php

or use data from an organization with which some- one in your group has a contact) and build a neural network to make predictions. Topics might include sales forecasts, predicting success in an academic pro- gram (e.g., predict GPA from high school ranking and SAT scores, being careful to look out for “bad” data, such as GPAs of 0.0) or housing prices; or survey the class for weight, gender, and height and try to predict height based on the other two factors. You could also use U.S. Census data by state on this book’s Web site or at census.gov to identify a relationship between education level and income. How good are your pre- dictions? Compare the results to predictions generated using standard statistical methods (regression). Which method is better? How could your system be embed- ded in a decision support system (DSS) for real deci- sion making?

3. For each of the following applications, would it be bet- ter to use a neural network or an expert system? Explain your answers, including possible exceptions or special conditions. a. Diagnosis of a well-established but complex disease b. Price lookup subsystem for a high-volume merchan-

dise seller c. Automated voice inquiry processing system d. Training of new employees e. Handwriting recognition

4. Consider the following data set, which includes three attributes and a classification for admission decisions into an MBA program:

312 Part II • Predictive Analytics/Machine Learning

GMAT GPA Quantitative

GMAT Decision

650 2.75 35 NO

580 3.50 70 NO

600 3.50 75 YES

450 2.95 80 NO

700 3.25 90 YES

590 3.50 80 YES

400 3.85 45 NO

640 3.50 75 YES

540 3.00 60 ?

690 2.85 80 ?

490 4.00 65 ?

a. Using the data given here as examples, develop your own manual expert rules for decision making.

b. Build and test a neural network model using your favorite data mining tool. Experiment with differ- ent model parameters to “optimize” the predictive power of your model.

c. Build and test a support vector machine model using your favorite data mining tool. Experiment

with different model parameters to “optimize” your model’s predictive power. Compare the results with ANN and SVM.

d. Report the predictions on the last three observations from each of the three classification approaches (ANN, SVM, and kNN). Comment on the results.

e. Comment on the similarity and differences of these three prediction approaches. What did you learn from this exercise?

5. You have worked on neural networks and other data mining techniques. Give examples of the use of each of them. Based on your knowledge, how would you differentiate among these techniques? Assume that a few years from now you will come across a situation in which neural network or other data mining tech- niques could be used to build an interesting applica- tion for your organization. You have an intern working with you to do the grunt work. How will you decide whether the application is appropriate for a neural net- work or another data mining model? Based on your homework assignments, what specific software guid- ance can you provide so that your intern is productive for you quickly? Your answer for this question might mention the specific software, describe how to go about setting up the model/neural network, and vali- date the application.

Internet Exercises

1. Explore the Web sites of several neural network ven- dors, such as California Scientific Software (calsci.com), NeuralWare (neuralware.com), and Ward Systems Group (wardsystems.comv), and review some of their products. Download at least two demos and install, run, and compare them.

2. A very good repository of data that have been used to test the performance of neural network and other machine-learning algorithms can be accessed at https://archive.ics.uci.edu/ml/index.php. Some of the data sets are really meant to test the limits of current machine-learning algorithms and compare their performance against new approaches to learning. However, some of the smaller data sets can be use- ful for exploring the functionality of the software you might download in Internet Exercise 1 or the software that is available at StatSoft.com (i.e., Statistica Data Miner with extensive neural network capabilities). Download at least one data set from the UCI repository (e.g., Credit Screening Databases, Housing Database). Then apply neural networks as well as decision tree methods as appropriate. Prepare a report on your results. (Some of these exercises could also be com- pleted in a group or even as semester-long projects for term papers and so on.)

3. Go to calsci.com and read about the company’s vari- ous business applications. Prepare a report that summa- rizes the applications.

4. Go to nd.com. Read about the company’s applica- tions in investment and trading. Prepare a report about them.

5. Go to nd.com. Download the trial version of NeuroSolutions for Excel and experiment with it using one of the data sets from the exercises in this chapter. Prepare a report about your experience with the tool.

6. Go to neoxi.com. Identify at least two software tools that have not been mentioned in this chapter. Visit Web sites of those tools and prepare a brief report on their capabilities.

7. Go to neuroshell.com. Look at Gee Whiz examples. Comment on the feasibility of achieving the results claimed by the developers of this neural network model.

8. Go to easynn.com. Download the trial version of the software. After the installation of the software, find the sample file called Houseprices.tvq. Retrain the neu- ral network and test the model by supplying some data. Prepare a report about your experience with this software.

9. Visit tibco.com. Download at least three white papers of applications. Which of these applications might have used neural networks?

Chapter 5 • Machine-Learning Techniques for Predictive Analytics 313

10. Go to neuralware.com. Prepare a report about the products the company offers.

11. Go to ibm.com. Download at least two customer suc- cess stories or case studies that use advanced analyt- ics or machine learning. Prepare a presentation for your understanding of these application cases.

12. Go to sas.com. Download at least two customer suc- cess stories or case studies that use advanced analytics

or machine learning. Prepare a presentation for your understanding of these application cases.

13. Go to teradata.com. Download at least two cus- tomer success stories or case studies where advanced analytics or machine learning is used. Prepare a pre- sentation for your understanding of these application cases.

References

Abbott, D. (2014). Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst. Hoboken, NJ: John Wiley.

Aizerman, M., E. Braverman, & L. Rozonoer. (1964). “Theo- retical Foundations of the Potential Function Method in Pattern Recognition Learning.” Automation and Remote Control, Vol. 25, pp. 821–837.

American Heart Association, “Heart Disease and Stroke Statis- tics,” heart.org (accessed May 2018).

Boiman, E. S., & M. Irani. (2008). “In Defense of Nearest-Neighbor Based Image Classification,” IEEE Conference on Computer Vi- sion and Pattern Recognition, 2008 (CVPR), 2008, pp. 1–8.

Bouzembrak, Y., & H. J. Marvin. (2016). “Prediction of Food Fraud Type Using Data from Rapid Alert System for Food and Feed (RASFF) and Bayesian Network Modelling.” Food Control, 61, 180–187.

Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140.

Breiman, L. (2001). “Random Forests.” Machine Learning, 45 (1), 5–32.

Chow, C., & C. Liu (1968). “Approximating Discrete Probabil- ity Distributions with Dependence Trees.” IEEE Transac- tions on Information Theory, 14(3), 462–473.

Delen, D., & R. Sharda. (2010). “Predicting the Financial Success of Hollywood Movies Using an Information Fusion Approach.” Indus Eng J, 21 (1), 30–37.

Delen, D., L. Tomak, K. Topuz, & E. Eryarsoy (2017). Investigat- ing Injury Severity Risk Factors in Automobile Crashes with Predictive Analytics and Sensitivity Analysis Methods. Jour- nal of Transport & Health, 4, 118–131.

Delen, D., A. Oztekin, & L. Tomak. (2012). “An Analytic Approach to Better Understanding and Management of Coronary Surgeries.” Decision Support Systems, 52 (3), 698–705.

Friedman, J. (2001). Greedy Function Approximation: A Gra- dient Boosting Machine. Annals of Statistics, 1189–1232.

Freund, Y., & R. E. Schapire. (1996, July). “Experiments with a New Boosting Algorithm.” In Icml (Vol. 96, pp. 148–156).

Friedman, D. (2014). “Oral Testimony before the House Committee on Energy and Commerce, by the Subcommit- tee on Oversight and Investigations,” April 1, 2014, www. nhtsa.gov/Testimony (accessed October 2014).

Friedman, N., D. Geiger, & M. Goldszmidt. (1997). “Bayesian Net- work Classifiers.” Machine Learning, Vol. 29, No. 2–3, 131–163.

Haykin, S. (2009). Neural Networks and Learning Machines, 3rd ed. Upper Saddle River, NJ: Prentice Hall.

Hopfield, J. (1982, April). “Neural Networks and Physical Systems with Emergent Collective Computational Abilities.” Proceed- ings of National Academy of Science, Vol. 79, No. 8, 2554–2558.

Koller, D., & N. Friedman. (2009). Probabilistic Graphical Models: Principles and Techniques. Boston, MA: MIT Press.

McCulloch, W., & W. Pitts. (1943). “A Logical Calculus of the Ideas Imminent in Nervous Activity.” Bulletin of Mathemat- ical Biophysics, Vol. 5.

Medsker, L., & J. Liebowitz. (1994). Design and Development of Expert Systems and Neural Networks. New York, NY: Macmillan, p. 163.

Meyfroidt, G., F. Güiza, J. Ramon, & M. Bruynooghe. (2009). “Machine Learning Techniques to Examine Large Patient Databases.” Best Practice & Research Clinical Anaesthesiol- ogy, 23(1), 127–143.

Minsky, M., & S. Papert. (1969). Perceptrons. Cambridge, MA: MIT Press.

Neural Technologies. “Combating Fraud: How a Leading Telecom Company Solved a Growing Problem.” neuralt. com/iqs/dlsfa.list/dlcpti.7/downloads.html (accessed March 2009).

NHTSA (2018) National Highway Traffic Safety Administra- tion (NHTSA’s) General Estimate System (GES), www. nhtsa.gov (accessed January 20, 2017).

Pearl, J. (1985). “Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning.” Proceedings of the Seventh Conference of the Cognitive Science Society, 1985, pp. 329–334.

Pearl, J. (2009). Causality. Cambridge University Press, Cam- bridge, England.

Principe, J., N. Euliano, & W. Lefebvre. (2000). Neural and Adaptive Systems: Fundamentals Through Simulations. New York, NY: Wiley.

Reddy, B. K., D. Delen, & R. K. Agrawal. (2018). “Pre- dicting and Explaining Inflammation in Crohn’s Disease Patients Using Predictive Analytics Methods and Electronic Medical Record Data.” Health Informat- ics Journal, 1460458217751015.

Reagan, R. (1982). “Remarks on Signing Executive Order 12368, Concerning Federal Drug Abuse Policy Func- tions,” June 24, 1982. Online by Gerhard Peters and John

314 Part II • Predictive Analytics/Machine Learning

T. Woolley, The American Presidency Project. http:// www. presidency.ucsb.edu/ws/?pid=42671.

Surowiecki, J. (2006). The Wisdom of Crowds. New York, NY: Penguin Random House.

Topuz, K., F. Zengul, A. Dag, A. Almehmi, & M. Yildirim (2018). “Predicting Graft Survival Among Kidney Transplant Recipients: A Bayesian Decision Support Model.” Decision Support Systems, 106, 97–109.

Vorhies, W. (2016). “Want to Win Competitions? Pay Atten- tion to Your Ensembles.” Data Science Central Web Portal, www.datasciencecentral.com/profiles/blogs/want- to-win-at-kaggle-pay-attention-to-your-ensembles (accessed July 2018).

Wang, G., T. Xu, T. Tang, T. Yuan, & H. Wang. (2017). A “Bayes- ian Network Model for Prediction of Weather-Related Fail- ures in Railway Turnout Systems.” Expert Systems with Applications, 69, 247–256.

Wolpert, D. (1992). “Stacked Generalization.” Neural Networks, 5(2), 241–260.

Zahedi, F. (1993). Intelligent Systems for Business: Expert Systems with Neural Networks, Wadsworth, Belmont, CA.

Zhang, H., A. C. Berg, M. Maire, & J. Malik. (2006). “SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition.” In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 2, pp. 2126–2136). IEEE.

315

Deep Learning and Cognitive Computing

LEARNING OBJECTIVES

■■ Learn what deep learning is and how it is changing the world of computing

■■ Know the placement of deep learning within the broad family of artificial intelligence (AI) learning methods

■■ Understand how traditional “shallow” artificial neural networks (ANN) work

■■ Become familiar with the development and learning processes of ANN

■■ Develop an understanding of the methods to shed light into the ANN black box

■■ Know the underlying concept and methods for deep neural networks

■■ Become familiar with different types of deep learning methods

■■ Understand how convolutional neural networks (CNN) work

■■ Learn how recurrent neural networks (RNN) and long short-memory networks (LSTM) work

■■ Become familiar with the computer frameworks for implementing deep learning

■■ Know the foundational details about cognitive computing

■■ Learn how IBM Watson works and what types of application it can be used for

A rtificial intelligence (AI) is making a re-entrance into the world of commuting and in our lives, this time far stronger and much more promising than before. This unprecedented re-emergence and the new level of expectations can largely be attributed to deep learning and cognitive computing. These two latest buzzwords de- fine the leading edge of AI and machine learning today. Evolving out of the traditional artificial neural networks (ANN), deep learning is changing the very foundation of how machine learning works. Thanks to large collections of data and improved computational resources, deep learning is making a profound impact on how computers can discover complex patterns using the self-extracted features from the data (as opposed to a data scientist providing the feature vector to the learning algorithm). Cognitive computing— first popularized by IBM Watson and its success against the best human players in the game show Jeopardy!—makes it possible to deal with a new class of problems, the type

C H A P T E R

6

316 Part II • Predictive Analytics/Machine Learning

of problems that are thought to be solvable only by human ingenuity and creativity, ones that are characterized by ambiguity and uncertainty. This chapter covers the concepts, methods, and application of these two cutting-edge AI technology trends.

6.1 Opening Vignette: Fighting Fraud with Deep Learning and Artificial Intelligence 316

6.2 Introduction to Deep Learning 320 6.3 Basics of “Shallow” Neural Networks 325 6.4 Process of Developing Neural Network–Based Systems 334 6.5 Illuminating the Black Box of ANN 340 6.6 Deep Neural Networks 343 6.7 Convolutional Neural Networks 349 6.8 Recurrent Networks and Long Short-Term Memory Networks 360 6.9 Computer Frameworks for Implementation of Deep Learning 368

6.10 Cognitive Computing 370

6.1 OPENING VIGNETTE: Fighting Fraud with Deep Learning and Artificial Intelligence

THE BUSINESS PROBLEM

Danske Bank is a Nordic universal bank with strong local roots and bridges to the rest of the world. Founded in October 1871, Danske Bank has helped people and businesses in the Nordics realize their ambitions for over 145 years. Its headquarters is in Denmark, with core markets in Denmark, Finland, Norway, and Sweden.

Mitigating fraud is a top priority for banks. According to the Association of Certified Fraud Examiners, businesses lose more than $3.5 trillion each year to fraud. The problem is pervasive across the financial industry and is becoming more prevalent and sophis- ticated each month. As customers conduct more banking online across a wider variety of channels and devices, there are more opportunities for fraud to occur. Adding to the problem, fraudsters are becoming more creative and technologically savvy—they are also using advanced technologies such as machine learning—and new schemes to defraud banks are evolving rapidly.

Old methods for identifying fraud, such as using human-written rules engines, catch only a small percentage of fraud cases and produce a significantly high number of false positives. While false negatives end up costing money to the bank, chasing after a large number of false positives not only costs time and money but also blemishes customer trust and satisfaction. To improve probability predictions and identify a much higher per- centage of actual cases of fraud while reducing false alarms, banks need new forms of analytics. This includes using artificial intelligence.

Danske Bank, like other global banks, is seeing a seismic shift in customer interac- tions. In the past, most customers handled their transactions in a bank branch. Today, almost all interactions take place digitally through a mobile phone, tablet, ATM, or call center. This provides more “surface area” for fraud to occur. The bank needed to mod- ernize its fraud detection defenses. It struggled with a low 40 percent fraud detection rate and was managing up to 1,200 false positives per day—and 99.5 percent of all cases the bank was investigating were not fraud related. That large number of false alarms required a substantial investment of people, time, and money to investigate what turned out to be dead ends. Working with Think Big Analytics, a Teradata company, Danske Bank made a strategic decision to apply innovative analytic techniques, including AI, to better identify instances of fraud while reducing false positives.

Chapter 6 • Deep Learning and Cognitive Computing 317

THE SOLUTION: DEEP LEARNING ENHANCES FRAUD DETECTION

Danske Bank integrated deep learning with graphics processing unit (GPU) appliances that were also optimized for deep learning. The new software system helps the analyt- ics team to identify potential cases of fraud while intelligently avoiding false positives. Operational decisions are shifted from users to AI systems. However, human interven- tion is still necessary in some cases. For example, the model can identify anomalies, such as debit card purchases taking place around the world, but analysts are needed to determine whether that is fraud or a bank customer simply made an online purchase that sent a payment to China and then bought an item the next day from a retailer based in London.

Danske Bank’s analytic approach employs a “champion/challenger” methodology. With this approach, deep learning systems compare models in real time to determine which one is most effective. Each challenger processes data in real time, learning as it goes which traits are more likely to indicate fraud. If a process dips below a certain threshold, the model is fed more data, such as the geolocation of customers or recent ATM transactions. When a challenger outperforms other challengers, it transforms into a champion, giving the other models a roadmap to successful fraud detection.

THE RESULTS

Danske Bank implemented a modern enterprise analytic solution leveraging AI and deep learning, and it has paid big dividends. The bank was able to:

• Realize a 60 percent reduction in false positives with an expectation to reach as high as 80 percent.

• Increase true positives by 50 percent. • Focus resources on actual cases of fraud.

The following graph (see Figure 6.1) shows how true and false positive rates improved with advanced analytics (including deep learning). The red dot represents the old rules engine, which caught only about 40 percent of all fraud. Deep learning improved signifi- cantly upon machine learning, allowing Danske Bank to better detect fraud with much lower false positives.

Enterprise analytics is rapidly evolving and moving into new learning systems enabled by AI. At the same time, hardware and processors are becoming more powerful and spe- cialized, and algorithms more accessible, including those available through open source. This gives banks the powerful solutions needed to identify and mitigate fraud. As Danske Bank learned, building and deploying an enterprise-grade analytics solution that meets its specific needs and leverages its data sources deliver more value than traditional off- the-shelf tools could have provided. With AI and deep learning, Danske Bank now has the ability to better uncover fraud without being burdened by an unacceptable amount of false positives. The solution also allows the bank’s engineers, data scientists, lines of business, and investigative officers from Interpol, local police, and other agencies to col- laborate to uncover fraud, including sophisticated fraud rings. With its enhanced capabili- ties, the enterprise analytic solution is now being used across other business areas of the bank to deliver additional value.

Because these technologies are still evolving, implementing deep learning and AI solutions can be difficult for companies to achieve on their own. They can benefit by partnering with a company that has the proven capabilities to implement technology- enabled solutions that deliver high-value outcomes. As shown in this case, Think Big Analytics, a Teradata company, has the expertise to configure specialized hardware and software frameworks to enable new operational processes. The project entailed integrat- ing open-source solutions, deploying production models, and then applying deep learning

318 Part II • Predictive Analytics/Machine Learning

analytics to extend and improve the models. A framework was created to manage and track the models in the production system and to make sure the models could be trusted. These models enabled the underlying system to make autonomous decisions in real time that aligned with the bank’s procedural, security, and high-availability guidelines. The solution provided new levels of detail, such as time series and sequences of events, to better assist the bank with its fraud investigations. The entire solution was implemented very quickly—from kickoff to live in only five months. Figure 6.2 shows a generalized framework for AI and deep learning–based enterprise-level analytics solutions.

In summary, Danske Bank undertook a multi-step project to productionize machine- learning techniques while developing deep learning models to test those techniques. The integrated models helped identify the growing problem of fraud. For a visual summary, watch the video (https://www.teradata.com/Resources/Videos/Danske-Bank- Innovating-in-Artificial-Intelligence) and/or read the blog (http://blogs.teradata. com/customers/danske-bank-innovating-artificial-intelligence-deep-learning- detect-sophisticated-fraud/).

Deep Learning

21.0

20.8

20.6

20.4

20.2

0.0

0.0 0.02

Tr u e P

o s it iv

e R

a te

0.04 0.06

False Negative Rate

Random pred iction

0.08 0.10

Classic Machine Learning

Rules Engine

Ensemble (area = 0.89) CNN (area = 0.95) ResNet (area = 0.94) LSTM (area = 0.90) Rule Engine Random predictions

FIGURE 6.1 Deep Learning Improves Both True Positives and True Negatives.

Chapter 6 • Deep Learning and Cognitive Computing 319

u QUESTIONS FOR THE OPENING VIGNETTE

1. What is fraud in banking? 2. What are the types of fraud that banking firms are facing today? 3. What do you think are the implications of fraud on banks and on their customers? 4. Compare the old and new methods for identifying and mitigating fraud. 5. Why do you think deep learning methods provided better prediction accuracy? 6. Discuss the trade-off between false positive and false negative (type 1 and type 2

errors) within the context of predicting fraudulent activities.

WHAT WE CAN LEARN FROM THIS VIGNETTE

As you will see in this chapter, AI in general and the methods of machine learning in specific are evolving and advancing rapidly. The use of large digitized data sources, both from inside and outside the organization, both structured and unstructured, along with advanced computing systems (software and hardware combinations), has paved the way toward dealing with problems that were thought to be unsolvable just a few years ago. Deep learning and cognitive computing (as the ramifications of the cutting edge in AI systems) are helping enterprises to make accurate and timely decisions by harnessing the rapidly expanding Big Data resources. As shown in this opening vignette, this new generation of AI systems is capable of solving problems much bet- ter than their older counterparts. In the domain of fraud detection, traditional methods have always been marginally useful, having higher than desired false positive rates and causing unnecessary investigations and thereby dissatisfaction for their customers. As difficult problems such as fraud detection are, new AI technologies like deep learn- ing are making them solvable with a high level of accuracy and applicability.

Source: Teradata Case Study. “Danske Bank Fights Fraud with Deep Learning and AI.” https://www.teradata. com/Resources/Case-Studies/Danske-Bank-Fight-Fraud-With-Deep-Learning-and-AI (accessed August 2018). Used with permission.

Engineer

Simulate

M entoring

Handover Investigate

Cross-Functional Teams

4

3 2

1

Cross-Functional Teams

Leveragable

APIs

Validate

InsightsLive Test

Production

Test

Integrate

Analyze Data

Go Live

Tr

ai ni

ng

Ini tia

l W in

s

Al as-a-Service Manage iterative, stage-gate process for analytic models from development to handover to operations

Al Strategy Analyze business priorities and identify Al use cases. Review key enterprise AI capabilities and provide recommendations and next steps for customers to successfully get value from AI.

Al Rapid Analytic Consulting EngagementTM (Race) Use AI exploration to test use cases and provide a proof of value for AI approaches.

Al Foundation Operationalize use cases through data science and engineering; build and deploy a deep learning platform, integrating data sources, models, and business processes.

FIGURE 6.2 A Generalized Framework for AI and Deep Learning–Based Analytics Solutions.

320 Part II • Predictive Analytics/Machine Learning

6.2 INTRODUCTION TO DEEP LEARNING

About a decade ago, conversing with an electronic device (in human language, intelligently) would have been unconceivable, something that could only be seen in SciFi movies. Today, however, thanks to the advances in AI methods and technologies, almost everyone has ex- perienced this unthinkable phenomenon. You probably have already asked Siri or Google Assistant several times to dial a number from your phone address book or to find an address and give you the specific directions while you were driving. Sometimes when you were bored in the afternoon, you may have asked the Google Home or Amazon’s Alexa to play some music in your favorite genre on the device or your TV. You might have been surprised at times when you uploaded a group photo of your friends on Facebook and observed its tagging suggestions where the name tags often exactly match your friends’ faces in the pic- ture. Translating a manuscript from a foreign language does not require hours of struggling with a dictionary; it is as easy as taking a picture of that manuscript in the Google Translate mobile app and giving it a fraction of a second. These are only a few of the many, ever- increasing applications of deep learning that have promised to make life easier for people.

Deep learning, as the newest and perhaps at this moment the most popular member of the AI and machine-learning family, has a goal similar to those of the other machine- learning methods that came before it: mimic the thought process of humans—using math- ematical algorithms to learn from data pretty much the same way that humans learn. So, what is really different (and advanced) in deep learning? Here is the most commonly pronounced differentiating characteristic of deep learning over traditional machine learn- ing. The performance of traditional machine-learning algorithms such as decision trees, support vector machines, logistic regression, and neural networks relies heavily on the representation of the data. That is, only if we (analytics professionals or data scientists) provide those traditional machine-learning algorithms with relevant and sufficient pieces of information (a.k.a. features) in proper format are they able to “learn” the patterns and thereby perform their prediction (classification or estimation), clustering, or association tasks with an acceptable level of accuracy. In other words, these algorithms need humans to manually identify and derive features that are theoretically and/or logically relevant to the objectives of the problem on hand and feed these features into the algorithm in a proper format. For example, in order to use a decision tree to predict whether a given customer will return (or churn), the marketing manager needs to provide the algorithm with information such as the customer’s socioeconomic characteristics—income, occupa- tion, educational level, and so on (along with demographic and historical interactions/ transactions with the company). But the algorithm itself is not able to define such socio- economic characteristics and extract such features, for instance, from survey forms com- pleted by the customer or obtained from social media.

While such a structured, human-mediated machine-learning approach has been working fine for rather abstract and formal tasks, it is extremely challenging to have the approach work for some informal, yet seemingly easy (to humans), tasks such as face identification or speech recognition since such tasks require a great deal of knowledge about the world (Goodfellow et al., 2016). It is not straightforward, for instance, to train a machine-learning algorithm to accurately recognize the real meaning of a sentence spo- ken by a person just by manually providing it with a number of grammatical or semantic features. Accomplishing such a task requires a “deep” knowledge about the world that is not easy to formalize and explicitly present. What deep learning has added to the classic machine-learning methods is in fact the ability to automatically acquire the knowledge required to accomplish such informal tasks and consequently extract some advanced fea- tures that contribute to the superior system performance.

To develop an intimate understanding of deep learning, one should learn where it fits in the big picture of all other AI family of methods. A simple hierarchical relationship diagram,

Chapter 6 • Deep Learning and Cognitive Computing 321

or a taxonomy-like representation, may in fact provide such a holistic understanding. In an attempt to do this, Goodfellow and his colleagues (2016) categorized deep learning as part of the representation learning family of methods. Representation learning techniques entail one type of machine learning (which is also a part of AI) in which the emphasis is on learn- ing and discovering features by the system in addition to discovering the mapping from those features to the output/target. Figure 6.3 uses a Venn diagram to illustrate the place- ment of deep learning within the overarching family of AI-based learning methods.

Figure 6.4 highlights the differences in the steps/tasks that need to be performed when building a typical deep learning model versus the steps/tasks performed when building models with classic machine-learning algorithms. As shown in the top two work- flows, knowledge-based systems and classic machine-learning methods require data sci- entists to manually create the features (i.e., the representation) to achieve the desired output. The bottommost workflows show that deep learning enables the computer to derive some complex features from simple concepts that would be very effort intensive (or perhaps impossible in some problem situations) to be discovered by humans manu- ally, and then it maps those advanced features to the desired output.

From a methodological viewpoint, although deep learning is generally believed to be a new area in machine learning, its initial idea goes back to the late 1980s, just a few decades after the emergence of artificial neural networks when LeCun and colleagues (1989) published an article about applying backpropagation networks for recognizing handwritten ZIP codes. In fact, as it is being practiced today, deep learning seems to be nothing but an extension of neural networks with the idea that deep learning is able to deal with more complicated tasks with a higher level of sophistication by employing many layers of connected neurons along with much larger data sets to automatically character- ized variables and solve the problems but only at the expense of a great deal of compu- tational effort. This very high computational requirement and the need for very large data sets were the two main reasons why the initial idea had to wait more than two decades until some advanced computational and technological infrastructure emerged for deep

Artificial Intelligence

Machine Learning

Representation Learning

Deep Learning

CNN RNN

LSTM

Autoencoders Decision trees

Logistic regression

... ... Robotics Fuzzy logic

Knowledge- based/expert systems

Clustering PCA/ICA ......

FIGURE 6.3 A Venn Diagram Showing the Placement of Deep Learning within the Overarching AI-Based Learning Methods.

322 Part II • Predictive Analytics/Machine Learning

learning’s practical realization. Although the scale of neural networks has dramatically in- creased in the past decade by the advancement of related technologies, it is still estimated that having artificial deep neural networks with the comparable number of neurons and level of complexity existing in the human brain will take several more decades.

In addition to the computer infrastructures, as mentioned, the availability of large and feature-rich digitized data sets was another key reason for the development of suc- cessful deep learning applications in recent years. Obtaining good performance from a deep learning algorithm used to be a very difficult task that required extensive skills and experience/understanding to design task-specific networks, and therefore, not many were able to develop deep learning for practical and/or research purposes. Large training data sets, however, have greatly compensated for the lack of intimate knowledge and reduced the level of skill needed for implementing deep neural networks. Nevertheless, although the size of available data sets has exponentially increased in recent years, a great chal- lenge, especially for supervised learning of deep networks, is now the labeling of the cases in these huge data sets. As a result, a great deal of research is ongoing, focusing on how we can take advantage of large quantities of unlabeled data for semisupervised or unsupervised learning or how we can develop methods to label examples in bulk in a reasonable time.

The following section of this chapter provides a general introduction to neural networks from where deep learning has originated. Following the overview of these “shallow” neural networks, the chapter introduces different types of deep learning archi- tectures and how they work, some common applications of these deep learning architec- tures, and some popular computer frameworks to use in implementing deep learning in practice. Since, as mentioned, the basics of deep learning are the same as those of arti- ficial neural networks, in the following section, we provide a brief coverage of the neu- ral network architecture (namely, multilayered perceptron [MLP]-type neural networks, which was omitted in the neural network section in Chapter 5 because it was to be covered here) to focus on their mathematical principles and then explain how the vari- ous types of deep learning architectures/approaches were derived from these founda- tions. Application Case 6.1 provides an interesting example of what deep learning and advanced analytics techniques can achieve in the field of football.

Input Knowledge-

Based Systems

Classic Machine Learning

Generic

Deep Learning

R e p re

s e n ta

ti o n L

e a rn

in g

Input

Input

Input

Manually Created

Representation Output

Manually Created Features

Auto- Created Features

Simple Features

More Advanced Features

Mapping from

Features

Mapping from

Features

Mapping from Features

Output

Output

Output

FIGURE 6.4 Illustration of the Key Differences between Classic Machine-Learning Methods and Representation Learning/Deep Learning (shaded boxes indicate components that are able to learn

directly from data).

Chapter 6 • Deep Learning and Cognitive Computing 323

Football. Soccer. The beautiful game. Whatever you call it, the world’s most popular sport is being trans- formed by a Dutch start-up bringing AI to the pitch. SciSports, founded in 2012 by two self-proclaimed football addicts and data geeks, is innovating on the edge of what is possible. The sports analytics com- pany uses streaming data and applies machine learn- ing, deep learning, and AI to capture and analyze these data, making way for innovations in everything from player recruitment to virtual reality for fans.

Player Selection Goes High Tech

In the era of eight-figure contracts, player recruitment is a high-stakes game. The best teams are not those with the best players but the best combination of players. Scouts and coaches have used observation, rudimentary data, and intuition for decades, but savvy clubs now are using advanced analytics to identify rising stars and undervalued players. “The SciSkill Index evaluates every professional football player in the world in one universal index,” says SciSports founder and CEO Giels Brouwer. The company uses machine-learning algorithms to calculate the quality, talent, and value of more than 200,000 players. This

helps clubs find talent, look for players who fit a cer- tain profile, and analyze their opponents.

Every week, more than 1,500 matches in 210 leagues are analyzed by the SciSkill technology. Armed with this insight, SciSports partners with elite football clubs across Europe and other continents to help them sign the right players. This has led to several unexpected—and in some cases lucrative— player acquisitions. For example, a second-division Dutch player did not want to renew his contract, so he went out as a free agent. A new club reviewed the SciSkill index and found his data intriguing. That club was not too sure at first because it thought he looked clumsy in scouting—but the data told the true story. The club signed him as the third striker, and he quickly moved into a starting role and became its top goal scorer. His rights were sold at a large premium within two years, and now he is one of the top goal scorers in Dutch professional football.

Real-Time 3D Game Analysis

Traditional football data companies generate data only on players who have the ball, leaving everything else

Application Case 6.1 Finding the Next Football Star with Artificial Intelligence

(Continued )

324 Part II • Predictive Analytics/Machine Learning

undocumented. This provides an incomplete picture of player quality. Seeing an opportunity to capture the immense amount of data regarding what happens away from the ball, SciSports developed a camera sys- tem called BallJames.

BallJames is a real-time tracking technology that automatically generates 3D data from video. Fourteen cameras placed around a stadium record every movement on the field. BallJames then gen- erates data such as the precision, direction, and speed of the passing, sprinting strength, and jump- ing strength. “This forms a complete picture of the game,” says Brouwer. “The data can be used in lots of cool ways, from allowing fans to experience the game from any angle using virtual reality, to sports betting and fantasy sports.” He added that the data can even help coaches on the bench. “When they want to know if a player is getting tired, they can substitute players based on analytics.”

Machine Learning and Deep Learning

SciSports models on-field movements using machine- learning algorithms, which by nature improve on performing a task as the player gains more experi- ence. On the pitch, BallJames works by automati- cally assigning a value to each action, such as a cor- ner kick. Over time, these values change based on their success rate. A goal, for example, has a high value, but a contributing action—which may have previously had a low value—can become more valuable as the platform masters the game. Wouter Roosenburg, SciSports chief technology officer, says AI and machine learning will play an important role in the future of SciSports and football analyt- ics in general. “Existing mathematical models model

existing knowledge and insights in football, while artificial intelligence and machine learning will make it possible to discover new connections that people wouldn’t make themselves.”

To accurately compile 3D images, BallJames must distinguish between players, referees, and the ball. SAS Event Stream Processing enables real-time image recognition using deep learning models. “By combining our deep learning models into SAS'Viya', we can train our models in-memory in the cloud, on our cameras or wherever our resources are,” says Roosenburg. The ability to deploy deep learn- ing models in memory onto cameras and then do the inferencing in real time is cutting-edge sci- ence. “Having one uniform platform to manage the entire 3-D production chain is invaluable,” says Roosenburg. “Without SAS Viya, this project would not be possible.”

Adding Oomph to Open Source

Previously SciSports exclusively used open source to build models. It now benefits from an end-to- end platform that allows analytical teams to work in their language of choice and share a single, man- aged analytical asset inventory across the organiza- tion. According to Brouwer, this enables the firm to attract employees with different open-source skills yet still manage the production chain using one platform. “My CTO tells me he loves that our data scientists can do all the research in open source and he doesn’t have to worry about the produc- tion of the models,” says Brouwer. “What takes 100 lines of code in Python only takes five in SAS. This speeds our time to market, which is crucial in sports analytics.”

SciSports - Facts & Figures

1

Universal Index with Every Professional

Football Player

Players Analyzed in SciSkill Index

Cameras Around the Pitch Enable

Real-Time Analysis

200,000 14

Application Case 6.1 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 325

u SECTION 6.2 REVIEW QUESTIONS

1. What is deep learning? What can deep learning do? 2. Compared to traditional machine learning, what is the most prominent difference of

deep learning?

3. List and briefly explain different learning methods in AI. 4. What is representation learning, and how does it relate to deep learning?

6.3 BASICS OF “SHALLOW” NEURAL NETWORKS

Artificial neural networks are essentially simplified abstractions of the human brain and its complex biological networks of neurons. The human brain has a set of billions of interconnected neurons that facilitate our thinking, learning, and understanding of the world around us. Theoretically speaking, learning is nothing but the establishment and adaptation of new or existing interneuron connections. In the artificial neural networks, however, neurons are processing units (also called processing elements [PEs]) that perform a set of predefined mathematical operations on the numerical values coming from the input variables or from the other neuron outputs to create and push out its own outputs. Figure 6.5 shows a schematic representation of a single-input and single-output neuron (more accurately, the processing element in artificial neural networks).

In this figure, p represents a numerical input. Each input goes into the neuron with an adjustable weight w and a bias term b. A multiplication weight function applies the weight to the input, and a net input function shown by g adds the bias term to the weighted input z. The output of the net input function (n, known as the net input) then goes through another function called the transfer (a.k.a. activation) function (shown by f ) for conversion and the production of the actual output a. In other words:

a = f (wp + b)

Since its inception, SciSports has quickly become one of the world’s fastest-growing sports analytics companies. Brouwer says the versatility of the SAS Platform has also been a major factor. “With SAS, we’ve got the ability to scale processing power up or down as needed, put models into production in real time, develop everything in one platform and integrate with open source. Our ambition is to bring real-time data analytics to billions of soccer fans all over the world. By partnering with SAS, we can make that happen.”

Questions for Case 6.1

1. What does SciSports do? Look at its Web site for more information.

2. How can advanced analytics help football teams?

3. What is the role of deep learning in solutions provided by SciSports?

Sources: SAS Customer Stories. “Finding the Next Football Star with Artificial Intelligence.” www.sas.com/en_us/customers/ scisports. html (accessed August 2018).Copyright (c) 2018 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Used with permission.

X fSp

Input Single-Input Neuron a 5 f(wp 1 b)

w b

nz a

FIGURE 6.5 General Single-Input Artificial Neuron Representation.

326 Part II • Predictive Analytics/Machine Learning

A numerical example: if w = 2, p = 3, and b = -1, then a = f (2 * 3 - 1) = f (5). Various types of transfer functions are commonly used in the design of neural

networks. Table 6.1 shows some of the most common transfer functions and their cor- responding operations. Note that in practice, selection of proper transfer functions for a network requires a broad knowledge of neural networks—characteristics of the data as well as the specific purpose for which the network is created.

Just to provide an illustration, if in the previous example we had a hard limit trans- fer function, the actual output a would be a = hardlim(5) = 1. There are some guide- lines for choosing the appropriate transfer function for each set of neurons in a network. These guidelines are especially robust for the neurons located at the output layer of the network. For example, if the nature of the output for a model is binary, we are advised to use Sigmoid transfer functions at the output layer so that it produces an output between 0 and 1, which represents the conditional probability of y = 1 given x or P (y = 1 � x). Many neural network textbooks provide and elaborate on those guidelines at different layers in a neural network with some consistency and much disagreement, suggesting that the best practices should (and usually does) come from experience.

TABLE 6.1 Common Transfer (Activation) Functions in Neural Networks

Transfer Function Form Operation

Hard limit

a

n 0

21

11

a 5 hardlim (n)

a = +1 if n 7 0

a = 0 if n 6 0

Linear

21

11

a 5 purelin (n)

n

a

0 a = n

Log-Sigmoid

21

11

a 5 logsig (n)

n

a

0 a = 1

1 + e-n

Positive linear (a.k.a. rectified linear or ReLU)

21

11

a 5 poslin(n)

n

a

0

a = n if n 7 0

a = 0 if n 6 0

Chapter 6 • Deep Learning and Cognitive Computing 327

Typically, a neuron has more than a single input. In that case, each individual input pi can be shown as an element of the input vector p. Each of the individual input values would have its own adjustable weight wi of the weight vector W. Figure 6.6 shows a multiple-input neuron with R individual inputs.

For this neuron, the net input n can be expressed as:

n = w1,1 p1 + w1,2 p2 + w1,3 p3 + . . . + w1,R pR + b

Considering the input vector p as a R * 1 vector and the weight vector W as a 1 * R vector, then n can be written in matrix form as:

n = Wp + b

where Wp is a scalar (i.e., 1 * 1 vector). Moreover, each neural network is typically composed of multiple neurons connected

to each other and structured in consecutive layers so that the outputs of a layer work as the inputs to the next layer. Figure 6.7 shows a typical neural network with four neurons

X fSpR

p1

p1

Inputs Single-Input Neuron a 5 f(wp 1 b)

WRx1 bRx1

nz a

FIGURE 6.6 Typical Multiple-Input Neuron with R Individual Inputs.

Hidden Layer

Output Layer

Input Layer

In p u ts

Output

FIGURE 6.7 Typical Neural Network with Three Layers and Eight Neurons.

328 Part II • Predictive Analytics/Machine Learning

at the input (i.e., first) layer, four neurons at the hidden (i.e., middle) layer, and a single neuron at the output (i.e., last) layer. Each of the neurons has its own weight, weighting function, bias, and transfer function and processes its own input(s) as described.

While the inputs, weighting functions, and transfer functions in a given network are fixed, the values of the weights and biases are adjustable. The process of adjusting weights and biases in a neural network is what is commonly called training. In fact, in practice, a neural network cannot be used effectively for a prediction problem unless it is well trained by a sufficient number of examples with known actual outputs (a.k.a. targets). The goal of the training process is to adjust network weights and biases such that the network output for each set of inputs (i.e., each sample) is adequately close to its corresponding target value.

Application Case 6.2 provides a case where computer gaming companies are using advanced analytics to better understand and engage with their customers.

Video gamers are a special breed. Sure, they spend a lot of time playing games, but they’re also build- ing social networks. Like sports athletes, video game players thrive on competition. They play against other gamers online. Those who earn first place, or even second or third place, have bragging rights. And like athletes who invest a lot of time training, video gamers take pride in the number of hours they spend playing. Furthermore, as games increase in complexity, gamers take pride in developing unique skills to best their compatriots.

A New Level of Gaming

Video gaming has evolved from the days of PAC-MAN and arcades. The widespread availability of the Internet has fueled the popularity of video games by bringing them into people’s homes via a wide range of electronics such as the personal computer and mobile devices. The world of computer games is now a powerful and profitable business.

According to NewZoo’s Global Games Market Report from April 2017, the global games market in 2017 saw:

• $109 billion in revenues.

• 7.8 percent increase from the previous year.

• 2.2 billion gamers globally.

• 42 percent of the market being mobile.

Video game companies can tap into this envi- ronment and learn valuable information about their customers, especially their behaviors and the under- lying motivations. These customer data enable com- panies to improve the gaming experience and better engage players.

Traditionally, the gaming industry appealed to its customers—the gamers—by offering striking graphics and captivating visualizations. As technol- ogy advanced, the graphics became more vivid with hi-def renditions. Companies have continued to use technology in highly creative ways to develop games that attract customers and capture their inter- ests, which results in more time spent playing and higher affinity levels. What video game companies have not done as well is to fully utilize technology to understand the contextual factors that drive sus- tained brand engagement.

Know the Players

In today’s gaming world, creating an exciting prod- uct is no longer enough. Games must strongly appeal to the visual and auditory senses in an era when people expect cool graphics and cutting-edge sound effects. Games must also be properly mar- keted to reach highly targeted player groups. There are also opportunities to monetize gaming characters in the form of commercially available merchandise (e.g., toy store characters) or movie rights. Making a game successful requires programmers, designers,

Application Case 6.2 Gaming Companies Use Data Analytics to Score Points with Players

Chapter 6 • Deep Learning and Cognitive Computing 329

scenarists, musicians, and marketers to work together and share information. That is where gamer and gaming data come into play.

For example, the size of a gamer’s network— the number and types of people a gamer plays with or against—usually correlates with more time spent playing and more money that is spent. The more relationships gamers have, the higher the likeli- hood they will play more games with more people because they enjoy the experience. Network effects amplify engagement volumes.

These data also help companies better under- stand the types of games each individual likes to play. These insights enable the company to recom- mend additional games across other genres that will likely exert a positive impact on player engagement and satisfaction. Companies can also use these data in marketing campaigns to target new gamers or entice existing gamers to upgrade their member- ships, for example, to premium levels.

Monetize Player Behaviors

Collaborative filtering (cFilter) is an advanced ana- lytic function that makes automatic predictions (fil- tering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The cFilter function supposes that if User A has the same opinion as User B on one issue, then User A is more likely to have User B’s opinion on a different issue when compared to a random user. This shows that predictions are specific to a gamer based on data from many other gamers.

Filtering systems are often used by online retailers to make product recommendations. The analytics can determine products that a customer will like based on what other shoppers who made similar purchases also bought, liked, or rated highly. There are many examples across other industries such as healthcare, finance, manufacturing, and telecommunication.

The cFilter analytic function offers several ben- efits to online video game companies:

• Marketers can run more effective cam- paigns. Connections between gamers nat- urally form to create clusters. Marketers can isolate common player characteristics and

leverage those insights for campaigns. Con- versely, they can isolate players who do not belong to a cluster and determine what unique characteristics contribute to their nonconform- ing behaviors.

• Companies can improve player retention. A strong membership in a community of gam- ers decreases the chances of churn. The greater the incentives for gamers to belong to a group of active participants, the more desire they have to engage in competitions. This increases the “stickiness” of customers and can lead to more game subscriptions.

• Data insights lead to improved customer satisfaction. Clusters indicate a desire for certain types of games that correspond to dis- tinct gamer interests and behaviors. Compa- nies can create gaming experiences that are unique to each player. Enticing more peo- ple to play and play longer enhances gamer satisfaction.

Once they understand why customers want to play games and uncover their relationships with other gamers, companies can create the right incentives for players to keep returning. This ensures a sus- tained customer base and stable revenue streams.

Boost Loyalty and Revenue

Regardless of the genre, each video game has pas- sionate players who seek each other for competi- tions. The thrill of a conquest attracts avid engage- ment. Over time, distinct networks of gamers are formed, with each participant constructing social relationships that often lead to more frequent and intense gaming interactions.

The gaming industry is now utilizing data analytics and visualizations to discern customer behaviors better and uncover player motivations. Looking at customer segments is no longer enough. Companies are now looking at microsegments that go beyond traditional demographics like age or geo- graphic location to understand customer preferences such as favorite games, preferred levels of difficulty, or game genres.

By gaining analytic insights into gamer strat- egies and behaviors, companies can create unique

(Continued )

330 Part II • Predictive Analytics/Machine Learning

gaming experiences that are attuned to these behaviors. By engaging players with the games and features they desire, video game compa- nies gain a devoted following, grow profits, and develop new revenue streams through merchan- dising ventures.

For a visual treat, watch a short video (https:// www.teradata.com/Resources/Videos/Art-of- Analytics-The-Sword) to see how the companies can use analytics to decipher gamer relationships that drive user behaviors and lead to better games.

Questions for Case 6.2

1. What are the main challenges for gaming companies?

2. How can analytics help gaming companies stay competitive?

3. What types of data can gaming companies obtain and use for analytics?

Source: Teradata Case Study. https://www.teradata.com/ Resources/Case-Studies/Gaming-Companies-Use-Data- Analytics (accessed August 2018).

Technology Insight 6.1 briefly describes the common components (or elements) of a typical artificial neural network along with their functional relationships.

TECHNOLOGY INSIGHT 6.1 Elements of an Artificial Neural Network

A neural network is composed of processing elements that are organized in different ways to form the network’s structure. The basic processing unit in a neural network is the neuron. A number of neurons are then organized to establish a network of neurons. Neurons can be orga- nized in a number of different ways; these various network patterns are referred to as topologies or network architectures (some of the most common architectures are summarized in Chapter 5). One of the most popular approaches, known as the feedforward-multilayered perceptron, allows all neurons to link the output in one layer to the input of the next layer, but it does not allow any feedback linkage (Haykin, 2009).

Processing Element (PE) The PE of an ANN is an artificial neuron. Each neuron receives inputs, processes them, and de- livers a single output as shown in Figure 6.5. The input can be raw input data or the output of other processing elements. The output can be the final result (e.g., 1 means yes, 0 means no), or it can be input to other neurons.

Network Structure

Each ANN is composed of a collection of neurons that are grouped into layers. A typical struc- ture is shown in Figure 6.8. Note the three layers: input, intermediate (called the hidden layer), and output. A hidden layer is a layer of neurons that takes input from the previous layer and converts those inputs into outputs for further processing. Several hidden layers can be placed between the input and output layers, although it is common to use only one hidden layer. In that case, the hidden layer simply converts inputs into a nonlinear combination and passes the transformed inputs to the output layer. The most common interpretation of the hidden layer is as a feature-extraction mechanism; that is, the hidden layer converts the original inputs in the problem into a higher-level combination of such inputs.

In ANN, when information is processed, many of the processing elements perform their computations at the same time. This parallel processing resembles the way the human brain works, and it differs from the serial processing of conventional computing.

Application Case 6.2 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 331

Input Layer

Y1

X1

X2 X3

(PE)

(PE)

(PE) Hidden Layer

(PE)

(PE) Output Layer

(PE)

(PE)

Weighted Sum ( )

Transfer Function

( )fS

FIGURE 6.8 Neural Network with One Hidden Layer. PE: processing element (an artificial representation of a biological neuron); Xi: inputs to a PE; y: output generated by a PE; g : summation function; and f : activation/transfer function.

Input Each input corresponds to a single attribute. For example, if the problem is to decide on ap- proval or disapproval of a loan, attributes could include the applicant’s income level, age, and home ownership status. The numeric value, or the numeric representation of non-numeric value, of an attribute is the input to the network. Several types of data, such as text, picture, and voice, can be used as inputs. Preprocessing may be needed to convert the data into meaningful inputs from symbolic/non-numeric data or to numeric/scale data.

Outputs The output of a network contains the solution to a problem. For example, in the case of a loan application, the output can be “yes” or “no.” The ANN assigns numeric values to the output, which may then need to be converted into categorical output using a threshold value so that the results would be 1 for “yes” and 0 for “no.”

Connection Weights Connection weights are the key elements of an ANN. They express the relative strength (or mathematical value) of the input data or the many connections that transfer data from layer to layer. In other words, weights express the relative importance of each input to a processing element and, ultimately, to the output. Weights are crucial in that they store learned patterns of information. It is through repeated adjustments of weights that a network learns.

Summation Function The summation function computes the weighted sums of all input elements entering each pro- cessing element. A summation function multiplies each input value by its weight and totals the values for a weighted sum. The formula for n inputs (represented with X ) in one processing element is shown in Figure 6.9a, and for several processing elements, the summation function formulas are shown in Figure 6.9b.

Transfer Function

The summation function computes the internal stimulation, or activation level, of the neuron. Based on this level, the neuron may or may not produce an output. The relationship between the

332 Part II • Predictive Analytics/Machine Learning

(a) Single Neuron

(PE)

PE: Processing Element (or neuron)

Y1

Y2 X2

W2

W1 W11

W21

W12

W22

W23

X1

X2

X1

Y3

Y

Y 5 X1W1 1 X2W2

Y1 5 X1W11 1 X2W21 Y2 5 X1W12 1 X2W22 Y3 5 X2W23

(PE)

(PE)

(PE)

(b) Multiple Neurons

FIGURE 6.9 Summation Function for (a) a Single Neuron/PE and (b) Several Neurons/PEs.

Processing Element (PE)

Y 5 1.2

Summation Function: Y 5 3(0.2) 1 1(0.4) 1 2(0.1) 5 1.2

Transfer Function: YT 5 1/(1 1 e 21.2) 5 0.77

YT 5 0.77

X1 5 3

X2 5 1

X3 5 2

W 1 5 0.2

W2 5 0.4

W 3 5

0. 1

FIGURE 6.10 Example of ANN Transfer Function.

internal activation level and the output can be linear or nonlinear. The relationship is expressed by one of several types of transformation (transfer) functions (see Table 6.1 for a list of commonly used activation functions). Selection of the specific activation function affects the network’s op- eration. Figure 6.10 shows the calculation for a simple sigmoid-type activation function example.

The transformation modifies the output levels to fit within a reasonable range of values (typically between 0 and 1). This transformation is performed before the output reaches the next level. Without such a transformation, the value of the output becomes very large, especially when there are several layers of neurons. Sometimes a threshold value is used instead of a trans- formation function. A threshold value is a hurdle value for the output of a neuron to trigger the next level of neurons. If an output value is smaller than the threshold value, it will not be passed to the next level of neurons. For example, any value of 0.5 or less becomes 0, and any value above 0.5 becomes 1. A transformation can occur at the output of each processing element, or it can be performed only at the final output nodes.

Chapter 6 • Deep Learning and Cognitive Computing 333

Application Case 6.3 provides an interesting use case where advanced analytics and deep learning are being used to prevent the extinction of rare animals.

“There are some people who want to kill animals like the lions and cheetahs. We would like to teach them, there are not many left,” says WildTrack offi- cials. The more we can study their behavior, the more we can help to protect them—and sustain the earth’s biodiversity that supports us all. Their tracks tell a collective story that holds incredible value in conservation. Where are they going? How many are left? There is much to be learned by monitoring footprints of endangered species like the cheetah.

WildTrack, a nonprofit organization, was founded in 2004 by Zoe Jewell and Sky Alibhai, a vet- erinarian and a wildlife biologist, respectively, who had been working for many years in Africa monitor- ing black and white rhinos. While in Zimbabwe, in the early 1990s, they collected and presented data to show that invasive monitoring techniques used for black rhinos were negatively impacting female fertility and began to develop a footprint identifica- tion technique. Interest from researchers around the world who needed a cost-effective and noninvasive approach to wildlife monitoring sparked WildTrack.

Artificial intelligence may help people recre- ate some of the skills used by indigenous trackers. WildTrack researchers are exploring the value AI can bring to conservation. They think that AI solu- tions are designed to enhance human efforts—not replace them. With deep learning, given enough data, a computer can be trained to perform human- like tasks such as identifying footprint images and recognizing patterns in a similar way to indigenous trackers—but with the added ability to apply these concepts at a much larger scale and more rapid pace. Analytics really underpins the whole thing, potentially giving insights into species populations that WildTrack never had before.

The WildTrack footprint identification tech- nique is a tool for noninvasive monitoring of endan- gered species through digital images of footprints. Measurements from these images are analyzed by customized mathematical models that help to identify the species, individual, sex, and age class. AI could add the ability to adapt through progressive learning algorithms and tell an even more complete story.

Obtaining crowdsourcing data is the next important step toward redefining what con- servation looks like in the future. Ordinary people would not necessarily be able to dart a rhino, but they can take an image of a foot- print. WildTrack has data coming in from everywhere—too much to manage traditionally. That’s really where AI comes in. It can automate repetitive learning through data, performing fre- quent, high- volume, computerized tasks reliably and without fatigue.

“Our challenge is how to harness artificial intelligence to create an environment where there’s room for us, and all species in this world,” says Alibhai.

Questions for Case 6.3

1. What is WildTrack and what does it do?

2. How can advanced analytics help WildTrack?

3. What are the roles that deep learning plays in this application case?

Source: SAS Customer Story. “Can Artificial Intelligence Help Protect These Animals from Extinction? The Answer May Lie in Their Footprints.” https://www.sas.com/en_us/explore/ analytics-in-action/impact/WildTrack.html (accessed August 2018); WildTrack.org.

Application Case 6.3 Artificial Intelligence Helps Protect Animals from Extinction

u SECTION 6.3 REVIEW QUESTIONS

1. How does a single artificial neuron (i.e., PE) work? 2. List and briefly describe the most commonly used ANN activation functions. 3. What is MLP, and how does it work? 4. Explain the function of weights in ANN. 5. Describe the summation and activation functions in MLP-type ANN architecture.

334 Part II • Predictive Analytics/Machine Learning

6.4 PROCESS OF DEVELOPING NEURAL NETWORK–BASED SYSTEMS

Although the development process of ANN is similar to the structured design methodolo- gies of traditional computer-based information systems, some phases are unique or have some unique aspects. In the process described here, we assume that the preliminary steps of system development, such as determining information requirements, conducting a fea- sibility analysis, and gaining a champion in top management for the project, have been completed successfully. Such steps are generic to any information system.

As shown in Figure 6.11, the development process for an ANN application in- cludes nine steps. In step 1, the data to be used for training and testing the network

Get more data; reformat data

Collect, organize and format the data 1

Step

Re-separate data into subsets

Separate data into training, validation, and testing sets

2

Change network architecture

Decide on a network architecture and structure 3

Change learning algorithm

Select a learning algorithm 4

Change network parameters

Reset and restart the training

Set network parameters and initialize their values

5

Initialize weights and start training (and validation)

6

Stop training, freeze the network weights

7

Test the trained network 8

Deploy the network for use on unknown new cases

9

FIGURE 6.11 Development Process of an ANN Model.

Chapter 6 • Deep Learning and Cognitive Computing 335

are collected. Important considerations are that the particular problem is amenable to a neural network solution and that adequate data exist and can be obtained. In step 2, training data must be identified, and a plan must be made for testing the performance of the network.

In steps 3 and 4, a network architecture and a learning method are selected. The availability of a particular development tool or the capabilities of the development personnel may determine the type of neural network to be constructed. Also, certain problem types have demonstrated high success rates with certain configurations (e.g., multilayer feedforward neural networks for bankruptcy prediction [Altman (1968), Wilson and Sharda (1994), and Olson, Delen, and Meng (2012)]). Important considerations are the exact number of neurons and the number of layers. Some packages use genetic algo- rithms to select the network design.

There are several parameters for tuning the network to the desired learning perfor- mance level. Part of the process in step 5 is the initialization of the network weights and pa- rameters followed by the modification of the parameters as training performance feedback is received. Often, the initial values are important in determining the efficiency and length of training. Some methods change the parameters during training to enhance performance.

Step 6 transforms the application data into the type and format required by the neu- ral network. This may require writing software to preprocess the data or performing these operations directly in an ANN package. Data storage and manipulation techniques and processes must be designed for conveniently and efficiently retraining the neural network when needed. The application data representation and ordering often influence the ef- ficiency and possibly the accuracy of the results.

In steps 7 and 8, training and testing are conducted iteratively by presenting input and desired or known output data to the network. The network computes the outputs and adjusts the weights until the computed outputs are within an acceptable tolerance of the known outputs for the input cases. The desired outputs and their relationships to input data are derived from historical data (i.e., a portion of the data collected in step 1).

In step 9, a stable set of weights is obtained. Then the network can reproduce the desired outputs given inputs such as those in the training set. The network is ready for use as a stand-alone system or as part of another software system where new input data will be presented to it and its output will be a recommended decision.

Learning Process in ANN

In supervised learning, the learning process is inductive; that is, connection weights are de- rived from existing cases. The usual process of learning involves three tasks (see Figure 6.12):

1. Compute temporary outputs. 2. Compare outputs with desired targets. 3. Adjust the weights and repeat the process.

Like any other supervised machine-learning technique, neural network training is usually done by defining a performance function (F) (a.k.a. cost function or loss func- tion) and optimizing (minimizing) that function by changing model parameters. Usually, the performance function is nothing but a measure of error (i.e., the difference between the actual input and the target) across all inputs of a network. There are several types of error measures (e.g., sum square errors, mean square errors, cross entropy, or even cus- tom measures) all of which are designed to capture the difference between the network outputs and the actual outputs.

The training process begins by calculating outputs for a given set of inputs using some random weights and biases. Once the network outputs are on hand, the performance

336 Part II • Predictive Analytics/Machine Learning

function can be computed. The difference between the actual output (Y or YT) and the desired output (Z) for a given set of inputs is an error called delta (in calculus, the Greek symbol delta, ∆, means “difference”).

The objective is to minimize delta (i.e., reduce it to 0 if possible), which is done by adjusting the network’s weights. The key is to change the weights in the proper direction, making changes that reduce delta (i.e., error). Different ANNs compute delta in different ways, depending on the learning algorithm being used. Hundreds of learning algorithms are available for various situations and configurations of ANN.

Backpropagation for ANN Training

The optimization of performance (i.e., minimization of the error or delta) in the neural network is usually done by an algorithm called stochastic gradient descent (SGD), which is an iterative gradient-based optimizer used for finding the minimum (i.e., the lowest point) in performance functions, as in the case of neural networks. The idea behind the SGD algorithm is that the derivative of the performance function with respect to each current weight or bias indicates the amount of change in the error measure by each unit of change in that weight or bias element. These derivatives are referred to as network gradients. Calculation of network gradients in the neural networks requires application of an algorithm called backpropagation, which is the most popular neural network learning algorithm, that applies the chain rule of calculus to compute the deriv- atives of functions formed by composing other functions whose derivatives are known [more on the mathematical details of this algorithm can be found in Rumelhart, Hinton, and Williams (1986)].

Compute the output

ANN Model

Stop the learning

and freeze the weights

Is the desired output

achieved?

Yes

NoAdjust the weights

FIGURE 6.12 Supervised Learning Process of an ANN.

Chapter 6 • Deep Learning and Cognitive Computing 337

Backpropagation (short for back-error propagation) is the most widely used supervised learning algorithm in neural computing (Principe, Euliano, and Lefebvre, 2000). By using the SGD mentioned previously, the implementation of backpropaga- tion algorithms is relatively straightforward. A neural network with backpropagation learning includes one or more hidden layers. This type of network is considered feed- forward because there are no interconnections between the output of a processing element and the input of a node in the same layer or in a preceding layer. Externally provided correct patterns are compared with the neural network’s output during (su- pervised) training, and feedback is used to adjust the weights until the network has categorized all training patterns as correctly as possible (the error tolerance is set in advance).

Starting with the output layer, errors between network-generated actual output and the desired outputs are used to correct/adjust the weights for the connections between the neurons (see Figure 6.13). For any output neuron j, the error (delta) = (Zj - Yj) (df/dx), where Z and Y are the desired and actual outputs, respectively. Using the sigmoid func- tion, f = 31 + exp(-x)4-1, where x is proportional to the sum of the weighted inputs to the neuron, is an effective way to compute the output of a neuron in practice. With this function, the derivative of the sigmoid function df/dx = f (1 - f ) and of the error is a simple function of the desired and actual outputs. The factor f (1 - f ) is the logistic func- tion, which serves to keep the error correction well bounded. The weight of each input to the jth neuron is then changed in proportion to this calculated error. A more complicated expression can be derived to work backward in a similar way from the output neurons through the hidden layers to calculate the corrections to the associated weights of the inner neurons. This complicated method is an iterative approach to solving a nonlinear optimization problem that is very similar in meaning to the one characterizing multiple linear regression.

In backpropagation, the learning algorithm includes the following procedures:

1. Initialize weights with random values and set other parameters. 2. Read in the input vector and the desired output. 3. Compute the actual output via the calculations, working forward through the layers. 4. Compute the error. 5. Change the weights by working backward from the output layer through the hidden

layers.

W1

W2 Yi

Summation Transfer Function

f(S) Y 5 f(S)

a( Zi 2 Yi ) error

Wn Xn

X2

X1

Neuron (or PE)

S 5 i 5 1

n XiWiπ

FIGURE 6.13 Backpropagation of Error for a Single Neuron.

338 Part II • Predictive Analytics/Machine Learning

This procedure is repeated for the entire set of input vectors until the desired output and the actual output agree within some predetermined tolerance. Given the calcula- tion requirements for one iteration, training a large network can take a very long time; therefore, in one variation, a set of cases is run forward and an aggregated error is fed backward to speed the learning. Sometimes, depending on the initial random weights and network parameters, the network does not converge to a satisfactory performance level. When this is the case, new random weights must be generated, and the network parameters, or even its structure, may have to be modified before another attempt is made. Current research is aimed at developing algorithms and using parallel computers to improve this process. For example, genetic algorithms (GA) can be used to guide the selection of the network parameters to maximize the performance of the desired output. In fact, most commercial ANN software tools are now using GA to help users “optimize” the network parameters in a semiautomated manner.

A central concern in the training of any type of machine-learning model is over- fitting. It happens when the trained model is highly fitted to the training data set but performs poorly with regard to external data sets. Overfitting causes serious issues with respect to the generalizability of the model. A large group of strategies known as regu- larization strategies is designed to prevent models from overfitting by making changes or defining constraints for the model parameters or the performance function.

In the classic ANN models of small size, a common regularization strategy to avoid overfitting is to assess the performance function for a separate validation data set as well as the training data set after each iteration. Whenever the performance stopped improv- ing for the validation data, the training process would be stopped. Figure 6.14 shows a

Error Reduction on the Validation Set

Error Reduction on the Training Set

The Best Model

Training Iterations0 0

E rr

o r

FIGURE 6.14 Illustration of the Overfitting in ANN—Gradually Changing Error Rates in the Training and Validation Data Sets As the Number of Iterations Increases.

Chapter 6 • Deep Learning and Cognitive Computing 339

typical graph of the error measure by the number of iterations of training. As shown, in the beginning, the error decreases in both training and validation data by running more and more iterations; but from a specific point (shown by the dashed line), the error starts increasing in the validation set while still decreasing in the training set. It means that be- yond that number of iterations, the model becomes overfitted to the data set with which it is trained and cannot necessarily perform well when it is fed with some external data. That point actually represents the recommended number of iterations for training a given neural network.

Technology Insight 6.2 discusses some of the popular neural network software and offers some Web links to more comprehensive ANN-related software sites.

TECHNOLOGY INSIGHT 6.2 ANN Software

Many tools are available for developing neural networks (see this book’s Web site and the re- source lists at PC AI, pcai.com). Some of these tools function like software shells. They provide a set of standard architectures, learning algorithms, and parameters, along with the ability to manipulate the data. Some development tools can support several network paradigms and learn- ing algorithms.

Neural network implementations are also available in most of the comprehensive pre- dictive analytics and data mining tools, such as the SAS Enterprise Miner, IBM SPSS Modeler (formerly Clementine), and Statistica Data Miner. Weka, RapidMiner, Orange, and KNIME are open-source free data mining software tools that include neural network capabilities. These free tools can be downloaded from their respective Web sites; simple Internet searches on the names of these tools should lead you to the download pages. Also, most of the commercial software tools are available for download and use for evaluation purposes (usually they are limited on time of availability and/or functionality).

Many specialized neural network tools make the building and deployment of a neural network model an easier undertaking in practice. Any listing of such tools would be in- complete. Online resources such as Wikipedia (en.wikipedia.org/wiki/Artificial_neural_net- work), Google’s or Yahoo!’s software directory, and the vendor listings on pcai.com are good places to locate the latest information on neural network software vendors. Some of the vendors that have been around for a while and have reported industrial applications of their neural network software include California Scientific (BrainMaker), NeuralWare, NeuroDimension Inc., Ward Systems Group (Neuroshell), and Megaputer. Again, the list can never be complete.

Some ANN development tools are spreadsheet add-ins. Most can read spreadsheet, database, and text files. Some are freeware or shareware. Some ANN systems have been developed in Java to run directly on the Web and are accessible through a Web browser interface. Other ANN products are designed to interface with expert systems as hybrid de- velopment products.

Developers may instead prefer to use more general programming languages, such as C, C#, C++, Java, and so on, readily available R and Python libraries, or spreadsheets to program the model, perform the calculations, and deploy the results. A common practice in this area is to use a library of ANN routines. Many ANN software providers and open-source platforms provide such programmable libraries. For example, hav.Software (hav.com) provides a library of C++ classes for implementing stand-alone or embedded feedforward, simple recurrent, and random- order recurrent neural networks. Computational software such as MATLAB also includes neural network–specific libraries.

340 Part II • Predictive Analytics/Machine Learning

u SECTION 6.4 REVIEW QUESTIONS

1. List the nine steps in conducting a neural network project. 2. What are some of the design parameters for developing a neural network? 3. Draw and briefly explain the three-step process of learning in ANN. 4. How does backpropagation learning work? 5. What is overfitting in ANN learning? How does it happen, and how can it be mitigated? 6. Describe the different types of neural network software available today.

6.5 ILLUMINATING THE BLACK BOX OF ANN

Neural networks have been used as an effective tool for solving highly complex real- world problems in a wide range of application areas. Even though ANN have been proven to be superior predictors and/or cluster identifiers in many problem scenarios (compared to their traditional counterparts), in some applications, there exists an ad- ditional need to know “how the model does what it does.” ANN are typically known as black boxes, capable of solving complex problems but lacking the explanation of their capabilities. This lack of transparency situation is commonly referred to as the “black- box” syndrome.

It is important to be able to explain a model’s “inner being”; such an explanation offers assurance that the network has been properly trained and will behave as desired once deployed in a business analytics environment. Such a need to “look under the hood” might be attributable to a relatively small training set (as a result of the high cost of data acquisition) or a very high liability in case of a system error. One example of such an application is the deployment of airbags in vehicles. Here, both the cost of data acquisition (crashing vehicles) and the liability concerns (danger to human lives) are rather significant. Another representative example for the importance of explana- tion is loan-application processing. If an applicant is refused a loan, he or she has the right to know why. Having a prediction system that does a good job on differentiating good and bad applications may not be sufficient if it does not also provide the justifi- cation of its predictions.

A variety of techniques have been proposed for analysis and evaluation of trained neural networks. These techniques provide a clear interpretation of how a neural net- work does what it does; that is, specifically how (and to what extent) the individual inputs factor into the generation of specific network output. Sensitivity analysis has been the front-runner of the techniques proposed for shedding light into the black-box charac- terization of trained neural networks.

Sensitivity analysis is a method for extracting the cause-and-effect relationships among the inputs and the outputs of a trained neural network model. In the process of performing sensitivity analysis, the trained neural network’s learning capability is disabled so that the network weights are not affected. The basic procedure behind sensitivity analysis is that the inputs to the network are systematically perturbed within the allowable value ranges, and the corresponding change in the output is recorded for each and every input variable (Principe et al., 2000). Figure 6.15 shows a graphical illustration of this process. The first input is varied between its mean plus and minus of a user-defined number of standard deviations (or for categorical variables, all of its possible values are used) while all other input variables are fixed at their respective means (or modes). The network output is computed for a user-defined number of steps above and below the mean. This process is repeated for each input. As a result, a report is generated to summarize the variation of each output with respect to the variation in each input. The generated report often contains a column plot (along with

Chapter 6 • Deep Learning and Cognitive Computing 341

Systematically Perturbed

Inputs

Observed Change in Outputs

Trained ANN, the “Black Box”

D1

FIGURE 6.15 A Figurative Illustration of Sensitivity Analysis on an ANN Model.

According to the National Highway Traffic Safety Administration (NHTSA), over 6 million traffic acci- dents claim more than 41,000 lives each year in the United States. Causes of accidents and related injury severity are of special interest to traffic safety researchers. Such research is aimed at reducing not only the number of accidents but also the severity of injury. One way to accomplish the latter is to identify the most profound factors that affect injury severity. Understanding the circumstances under which driv- ers and passengers are more likely to be severely injured (or killed) in a vehicle accident can help improve the overall driving safety situation. Factors that potentially elevate the risk of injury severity of vehicle occupants in the event of an accident include demographic and/or behavioral characteris- tics of the person (e.g., age, gender, seatbelt usage, use of drugs or alcohol while driving), environmen- tal factors, and/or roadway conditions at the time of the accident (e.g., surface conditions, weather or light conditions, direction of the impact, vehicle ori- entation in the crash, occurrence of a rollover), as well as technical characteristics of the vehicle itself (e.g., age, body type).

In an exploratory data mining study, Delen et al. (2006) used a large sample of data—30,358 police- reported accident records obtained from the General Estimates System of NHTSA—to identify which fac- tors become increasingly more important in escalating the probability of injury severity during a traffic crash. Accidents examined in this study included a geograph- ically representative sample of multiple-vehicle colli- sion accidents, single-vehicle fixed-object collisions, and single-vehicle noncollision (rollover) crashes.

Contrary to many of the previous studies con- ducted in this domain, which have primarily used regression-type generalized linear models where the functional relationships between injury severity and crash-related factors are assumed to be linear (which is an oversimplification of the reality in most real-world situations), Delen and his colleagues (2006) decided to go in a different direction. Because ANN are known to be superior in capturing highly nonlinear complex relationships between the predictor variables (crash factors) and the target variable (severity level of the injuries), they decided to use a series of ANN models to estimate the significance of the crash factors on the level of injury severity sustained by the driver.

Application Case 6.4 Sensitivity Analysis Reveals Injury Severity Factors in Traffic Accidents

(Continued )

numeric values presented on the x-axis), reporting the relative sensitivity values for each input variable. A representative example of sensitivity analysis on ANN models is provided in Application Case 6.4.

342 Part II • Predictive Analytics/Machine Learning

From a methodological standpoint, Delen et al. (2006) followed a two-step process. In the first step, they developed a series of prediction models (one for each injury severity level) to capture the in-depth relation- ships between the crash-related factors and a specific level of injury severity. In the second step, they con- ducted sensitivity analysis on the trained neural network models to identify the prioritized importance of crash- related factors as they relate to different injury severity levels. In the formulation of the study, the five-class prediction problem was decomposed into a number of binary classification models to obtain the granularity of information needed to identify the “true” cause-and- effect relationships between the crash-related factors and different levels of injury severity. As shown in Figure 6.16, eight different neural network models have been developed and used in the sensitivity analy- sis to identify the key determinants of increased injury severity levels.

The results revealed considerable differences among the models built for different injury severity levels. This implies that the most influential factors in prediction models highly depend on the level of injury severity. For example, the study revealed that the variable seatbelt use was the most impor- tant determinant for predicting higher levels of injury severity (such as incapacitating injury or fatality), but it was one of the least significant pre- dictors for lower levels of injury severity (such as non-incapacitating injury and minor injury). Another interesting finding involved gender: The

driver’s gender was among the significant predic- tors for lower levels of injury severity, but it was not among the significant factors for higher lev- els of injury severity, indicating that more serious injuries do not depend on the driver being a male or a female. Another interesting and somewhat intuitive finding of the study indicated that age becomes an increasingly more significant factor as the level of injury severity increases, implying that older people are more likely to incur severe inju- ries (and fatalities) in serious vehicle crashes than younger people.

Questions for Case 6.4

1. How does sensitivity analysis shed light on the black box (i.e., neural networks)?

2. Why would someone choose to use a black-box tool such as neural networks over theoretically sound, mostly transparent statistical tools like logistic regression?

3. In this case, how did neural networks and sensi- tivity analysis help identify injury-severity factors in traffic accidents?

Sources: Delen, D., R. Sharda, & M. Bessonov. (2006). “Identifying Significant Predictors of Injury Severity in Traffic Accidents Using a Series of Artificial Neural Networks.” Accident Analysis and Prevention, 38(3), pp. 434–444; Delen, D., L. Tomak, K. Topuz, & E. Eryarsoy (2017). “Investigating Injury Severity Risk Factors in Automobile Crashes with Predictive Analytics and Sensitivity Analysis Methods.” Journal of Transport & Health, 4, pp. 118–131.

Model Label

1.1

1.2

1.3

1.4

2.1

2.2

2.3

2.4

No Injury (35.4%)

Probable Injury (23.6%)

Non-Incapacitating (19.6%)

Incapacitating (17.8%)

Fatal Injury (3.6%)

Binary category label 0 Binary category label 1

FIGURE 6.16 Graphical Representation of the Eight Binary ANN Model Configurations.

Application Case 6.4 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 343

u SECTION 6.5 REVIEW QUESTIONS

1. What is the so-called black-box syndrome? 2. Why is it important to be able to explain an ANN’s model structure? 3. How does sensitivity analysis work in ANN? 4. Search the Internet to find other methods to explain ANN methods. Report the results.

6.6 DEEP NEURAL NETWORKS

Until recently (before the advent of deep learning phenomenon), most neural network applications involved network architectures with only a few hidden layers and a limited number of neurons in each layer. Even in relatively complex business applications of neural networks, the number of neurons in networks hardly exceeded a few thousands. In fact, the processing capability of computers at the time was such a limiting factor that central processing units (CPU) were hardly able to run networks involving more than a couple of layers in a reasonable time. In recent years, development of graphics processing units (GPUs) along with the associated programming languages (e.g., CUDA by NVIDIA) that enable people to use them for data analysis purposes has led to more advanced appli- cations of neural networks. GPU technology has enabled us to successfully run neural net- works with over a million neurons. These larger networks are able to go deeper into the data features and extract more sophisticated patterns that could not be detected otherwise.

While deep networks can handle a considerably larger number of input variables, they also need relatively larger data sets to be trained satisfactorily; using small data sets for training deep networks typically leads to overfitting of the model to the training data and poor and unreliable results in case of applying to external data. Thanks to the Internet- and Internet of Things (IoT)-based data-capturing tools and technologies, larger data sets are now available in many application domains for deeper neural network training.

The input to a regular ANN model is typically an array of size R * 1, where R is the number of input variables. In the deep networks, however, we are able to use tensors (i.e., N-dimensional arrays) as input. For example, in image recognition networks, each input (i.e., image) can be represented by a matrix indicating the color codes used in the image pixels; or for video processing purposes, each video can be represented by several matrices (i.e., a 3D tensor), each representing an image involved in the video. In other words, tensors provide us with the ability to include additional dimensions (e.g., time, location) in analyzing the data sets.

Except for these general differences, the different types of deep networks involve various modifications to the architecture of standard neural networks that equip them with distinct capabilities of dealing with particular data types for advanced purposes. In the fol- lowing section, we discuss some of these special network types and their characteristics.

Feedforward Multilayer Perceptron (MLP)-Type Deep Networks

MLP deep networks, also known as deep feedforward networks, are the most general type of deep networks. These networks are simply large-scale neural networks that can contain many layers of neurons and handle tensors as their input. The types and characteristics of the network elements (i.e., weight functions, transfer functions) are pretty much the same as in the standard ANN models. These models are called feedforward because the flow of information that goes through them is always forwarding and no feedback connections (i.e., connections in which outputs of a model are fed back to itself) are allowed. The neural networks in which feedback connections are allowed are called recurrent neural networks (RNN). General RNN architectures, as well as a specific variation of RNNs called long short-term memory networks, are discussed in later sections of this chapter.

344 Part II • Predictive Analytics/Machine Learning

Generally, a sequential order of layers has to be held between the input and the output layers in the MLP-type network architecture. This means that the input vector has to pass through all layers sequentially and cannot skip any of them; moreover, it cannot be directly connected to any layer except for the very first one; the output of each layer is the input to the subsequent layer. Figure 6.17 demonstrates a vector representation of the first three layers of a typical MLP network. As shown, there is only one vector going into each layer, which is either the original input vector ( p for the first layer) or the output vector from the previous hidden layer in the network architecture (ai - 1 for the ith layer). There are, however, some special variations of MLP network architectures designed for specialized purposes in which these principles can be violated.

Impact of Random Weights in Deep MLP

Optimization of the performance (loss) function in many real applications of deep MLPs is a challenging issue. The problem is that applying the common gradient-based train- ing algorithms with random initialization of weights and biases that is very efficient for finding the optimal set of parameters in shallow neural networks most of the time could lead to getting stuck in the locally optimal solutions rather than catching the global opti- mum values for the parameters. As the depth of network increases, chances of reaching a global optimum using random initializations with the gradient-based algorithms decrease. In such cases, usually pretraining the network parameters using some unsupervised deep learning methods such as deep belief networks (DBNs) can be helpful (Hinton, Osindero, and Teh, 2006). DBNs are a type of a large class of deep neural networks called generative models. Introduction of DBNs in 2006 is considered as the beginning of the current deep learning renaissance (Goodfellow et al., 2016), since prior to that, deep models were considered too difficult to optimize. In fact, the primary application of DBNs today is to improve classification models by pretraining of their parameters.

Using these unsupervised learning methods, we can train the MLP layers, one at a time, starting from the first layer, and use the output of each layer as the input to the subsequent layer and initialize that layer with an unsupervised learning algorithm. At the end, we will have a set of initialized values for the parameters across the whole network. Those pre- trained parameters, instead of random initialized parameters, then can be used as the initial values in the supervised learning of the MLP. This pretraining procedure has been shown to cause significant improvements to the deep classification applications. Figure 6.18 illustrates the classification errors that resulted from training a deep MLP network with (blue circles) and without (black triangles) pretraining of parameters (Bengio, 2009). In this example, the blue line represents the observed error rates of testing a classification model (on 1,000 heldout examples) trained using a purely supervised approach with 10 million examples,

x p z1

a1 5 f1(w1p 1 b1) a2 5 f2(w2a1 1 b2)

a3 5 f 3(w3f 2(w2f 1(w1p1b1)1b2)1b3)

a3 5 f3(w3a2 1 b3)

w1 b1 w2 b2 w3 b3

n1 a1 z2 n2 a2 z3 n3 a3 S S Sf 1 x f 2 x f 3

I

n

p

u

t

FIGURE 6.17 Vector Representation of the First Three Layers in a Typical MLP Network.

Chapter 6 • Deep Learning and Cognitive Computing 345

whereas the black line indicates the error rates on the same testing data set when 2.5 million examples were initially used for unsupervised training of network parameters (using DBN) and then the other 7.5 million examples along with the initialized parameters were used to train a supervised classification model. The diagrams clearly show a significant improvement in terms of the classification error rate in the model pretrained by a deep belief network.

More Hidden Layers versus More Neurons?

An important question regarding the deep MLP models is “Would it make sense (and produce better results) to restructure such networks with only a few layers, but many neurons in each?” In other words, the question is why do we need deep MLP networks with many layers when we can include the same number of neurons in just a few layers (i.e., wide networks instead of deep networks). According to the universal approximation theorem (Cybenko, 1989; Hornik, 1991), a sufficiently large single-layer MLP network will be able to approximate any function. Although theoretically founded, such a layer with many neurons may be prohibitively large and hence may fail to learn the underlying pat- terns correctly. A deeper network can reduce the number of neurons required at each layer and hence decrease the generalization error. Whereas theoretically it is still an open research question, practically using more layers in a network seems to be more effective and computationally more efficient than using many neurons in a few layers.

Like typical artificial neural networks, multilayer perceptron networks can also be used for various prediction, classification, and clustering purposes. Especially when a large number of input variables are involved or in cases that the nature of input has to be an N -dimensional array, a deep multilayer network design needs to be employed.

Application Case 6.5 provides an excellent case for the use of advanced analytics to better manage traffic flows in crowded cities.

0 1024

1023

1022

1021

100

1 2 3 4 5 6 7 8 9

Number of Examples Seen (3106)

C la

s s ifi

c a ti

o n E

rr o r

10

FIGURE 6.18 The Effect of Pretraining Network Parameters on Improving Results of a Classification- Type Deep Neural Network.

346 Part II • Predictive Analytics/Machine Learning

The Background

When the Georgia Department of Transportation (GDOT) wanted to optimize the use of Big Data and advanced analytics to gain insight into transporta- tion, it worked with Teradata to develop a proof of concept evaluation of GDOT’s variable speed limit (VSL) pilot project.

The VSL concept has been adopted in many parts of the world, but it is still relatively new in the United States. As GDOT explains,

VSL are speed limits that change based on road, traffic, and weather conditions. Electronic signs slow down traffic ahead of congestion or bad weather to smooth out flow, diminish stop-and- go conditions, and reduce crashes. This low- cost, cutting edge technology alerts drivers in real time to speed changes due to conditions down the road. More consistent speeds improve safety by helping to prevent rear-end and lane changing collisions due to sudden stops.

Quantifying the customer service, safety, and efficiency benefits of VSL is extremely important to GDOT. This fits within a wider need to understand the effects of investments in intelligent transporta- tion systems as well as other transportation systems and infrastructures.

VSL Pilot Project on I-285 in Atlanta

GDOT conducted a VSL pilot project on the north- ern half, or “top end,” of I-285 that encircles Atlanta. This 36-mile stretch of highway was equipped with 88 electronic speed limit signs that adjusted speed limits in 10 mph increments from 65 miles per hour (mph) to the minimum of 35 mph. The objectives were twofold:

1. Analyze speeds on the highway before versus after implementation of VSL.

2. Measure the impact of VSL on driving conditions.

To obtain an initial view of the traffic, the Teradata data science solution identified the loca- tions and durations of “persistent slowdowns.” If highway speeds are above “reference speed,” then

traffic is considered freely flowing. Falling below the reference speed at any point on the highway is considered a slowdown. When slowdowns per- sist across multiple consecutive minutes, a persistent slowdown can be defined.

By creating an analytic definition of slow- downs, it is possible to convert voluminous and highly variable speed data into patterns to support closer investigation. The early analyses of the data revealed that the clockwise and counterclockwise directions of the same highway may show signifi- cantly different frequency and duration of slow- downs. To better understand how slowdowns affect highway traffic, it is useful to take our new defini- tion and zoom in on a specific situation. Figure 6.19 shows a specific but typical Atlanta afternoon on I-285, at a section of highway where traffic is mov- ing clockwise, from west to east, between mile markers MM10 in the west to the east end at MM46.

The first significant slowdown occurred at 3:00 p.m. near MM32. The size of the circles repre- sents duration (measured in minutes). The slowdown at MM32 was nearly four hours long. As the slow- down “persisted,” traffic speed diminished behind it. The slowdown formed on MM32 became a bottle- neck that caused traffic behind it to slow down as well. The “comet trail” of backed-up traffic at the top left of Figure 6.20 illustrates the sequential formation of slowdowns at MM32 and then farther west, each starting later in the afternoon and not lasting as long.

Measuring Highway Speed Variability

The patterns of slowdowns on the highway as well as their different timings and locations led us to ques- tion their impact on drivers. If VSL could help driv- ers better anticipate the stop-and-go nature of the slowdowns, then being able to quantify the impact would be of interest to GDOT. GDOT was particu- larly concerned about what happens when a driver first encounters a slowdown. “While we do not know what causes the slowdown, we do know that driv- ers have made speed adjustments. If the slowdown was caused by an accident, then the speed reduction could be quite sudden; alternatively, if the slowdown was just caused by growing volumes of traffic, then the speed reduction might be much more gradual.”

Application Case 6.5 Georgia DOT Variable Speed Limit Analytics Help Solve Traffic Congestions

Chapter 6 • Deep Learning and Cognitive Computing 347

Identifying Bottlenecks and Traffic Turbulence

A bottleneck starts as a slowdown at a particular loca- tion. Something like a “pinch point” occurs on the highway. Then, over a period of time, traffic slows down behind the original pinch point. A  bottle- neck is a length of highway where traffic falls below

60  percent of reverence speed and can stay at that level for miles. Figure 6.20 shows a conceptual repre- sentation of a bottleneck.

While bottlenecks are initiated by a pitch point, or slowdown, that forms the head of the queue, it is the end of the queue that is the most interest- ing. The area at the back of a queue is where traf- fic encounters a transition from free flow to slowly

14

2 PM

3 PM

M in

u te

o f B

o tt

le n e c k

S u s p e c te

d (D

e c e m

b e r

1 1

, 2

0 1

4 )

T im

e o

f D

a y

from West to East, by Mile MarkerDirection of Traffic

4 PM

5 PM

6 PM

7 PM

16 18 20 22 24 26 28

Pseudo Mile Marker

30 32 34 36 38 30 32

Slowdown Duration

100.0 200.0

219.0

FIGURE 6.19 Traffic Moving Clockwise during the Afternoon.

Turbulence Reduction Opportunity

Bottleneck (queuing traffic)

Tr a ffi

c S

p e e d (

m p h )

Zone of Influence

Bottleneck End

Bottleneck End

Direction of Travel

60% of Reference Speed

Speed of Traffic

Normal Traffic

FIGURE 6.20 Graphical Depiction of a Bottleneck on a Highway.

(Continued )

348 Part II • Predictive Analytics/Machine Learning

moving congested conditions. In the worst condi- tions, the end of the queue can experience a rapid transition. Drivers moving at highway speed may unexpectedly encounter slower traffic. This condi- tion is ripe for accidents and is the place where VSL can deliver real value.

Powerful New Insight on Highway Congestion

The availability of new Big Data sources that describe the “ground truth” of traffic conditions on highways provides rich new opportunities for developing and analyzing highway performance metrics. Using just a single data source on detailed highway speeds, we produced two new and distinctive metrics using Teradata advanced data science capabilities.

First, by defining and measuring persistent slowdowns, we helped traffic engineers understand the frequency and duration of slow speed locations on a highway. The distinction of measuring a per- sistent slowdown versus a fleeting one is uniquely challenging and requires data science. It provides the ability to compare the number, duration, and location of slowdowns in a way that is more infor- mative and compelling than simple averages, vari- ances, and outliers in highway speeds.

The second metric was the ability to measure turbulence caused by bottlenecks. By identifying where bottlenecks occur and then narrowing in on their very critical zones of influence, we can make measurements of speeds and traffic deceleration tur- bulence within those zones. Data science and ana- lytics capabilities demonstrated reduced turbulence when VSL is active in the critical zone of a bottleneck.

There is much more that could be explored within this context. For example, it is natural to assume that because most traffic is on the road dur- ing rush hours, VSL provides the most benefits dur- ing these high-traffic periods. However, the opposite may be true, which could provide a very important benefit of the VSL program.

Although this project was small in size and was just a proof of concept, a combination of similar projects beyond just transportation under the name of “smart cities” is underway around the United States and abroad. The goal is to use a variety of data from sensors to multimedia, rare event reports to satellite images along with advanced analytics that include deep learning and cognitive computing to transform the dynamic nature of cities toward bet- ter to best for all stakeholders.

Questions for Case 6.5

1. What was the nature of the problems that GDOT was trying to solve with data science?

2. What type of data do you think was used for the analytics?

3. What were the data science metrics developed in this pilot project? Can you think of other metrics that can be used in this context?

Source: Teradata Case Study. “Georgia DOT Variable Speed Limit Analytics Help Solve Traffic Congestion.” https:// www.teradata. com/Resources/Case-Studies/Georgia-DOT-Variable-Speed- Limit-Analytics (accessed July 2018); “Georgia DOT Variable Speed Limits.” www.dot.ga.gov/ DriveSmart/SafetyOperation/ Pages/VSL.aspx (accessed August 2018).Used with permission from Teradata.

In the next section, we discuss a very popular variation of deep MLP architecture called convolutional neural network (CNN) specifically designed for computer vision applications (e.g., image recognition, handwritten text processing).

u SECTION 6.6 REVIEW QUESTIONS

1. What is meant by “deep” in deep neural networks? Compare deep neural networks to shallow neural networks.

2. What is GPU? How does it relate to deep neural networks? 3. How does a feedforward multilayer perceptron-type deep network work?

Application Case 6.5 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 349

4. Comment on the impact of random weights in developing deep MLP. 5. Which strategy is better: more hidden layers versus more neurons?

6.7 CONVOLUTIONAL NEURAL NETWORKS

CNNs (LeCun et al., 1989) are among the most popular types of deep learning methods. CNNs are in essence variations of the deep MLP architecture, initially designed for com- puter vision applications (e.g., image processing, video processing, text recognition) but are also applicable to nonimage data sets.

The main characteristic of the convolutional networks is having at least one layer in- volving a convolution weight function instead of general matrix multiplication. Figure 6.21 illustrates a typical convolutional unit.

Convolution, typically shown by the symbol, is a linear operation that essentially aims at extracting simple patterns from sophisticated data patterns. For instance, in pro- cessing an image containing several objects and colors, convolution functions can extract simple patterns like the existence of horizontal or vertical lines or edges in different parts of the picture. We discuss convolution functions in more detail in the next section.

A layer containing a convolution function in a CNN is called a convolution layer. This layer is often followed by a pooling (a.k.a. subsampling) layer. Pooling layers are in charge of consolidating the large tensors to one with a smaller size and reducing the number of model parameters while keeping their important features. Different types of pooling layers are also discussed in the following sections.

Convolution Function

In the description of MLP networks, it was said that the weight function is generally a matrix manipulation function that multiplies the weight vector into the input vector to produce the output vector in each layer. Having a very large input vector/tensor, which is the case in most deep learning applications, we need a large number of weight pa- rameters so that each single input to each neuron could be assigned a single weight pa- rameter. For instance, in an image-processing task using a neural network for images of size 150 * 150 pixels, each input matrix will contain 22,500 (i.e., 150 times 150) integers, each of which should be assigned its own weight parameter per each neuron it goes into throughout the network. Therefore, having even only a single layer requires thousands of weight parameters to be defined and trained. As one might guess, this fact would dramatically increase the required time and processing power to train a network, since in each training iteration, all of those weight parameters have to be updated by the SGD algorithm. The solution to this problem is the convolution function.

S fp

Input

w b

z n a

Convolutional Unit a 5 f(w p 1 b)

FIGURE 6.21 Typical Convolutional Network Unit.

350 Part II • Predictive Analytics/Machine Learning

The convolution function can be thought of as a trick to address the issue defined in the previous paragraph. The trick is called parameter sharing, which in addition to computational efficiency provides additional benefits. Specifically, in a convolution layer, instead of having a weight for each input, there is a set of weights referred to as the convolution kernel or filter, which is shared between inputs and moves around the input matrix to produce the outputs. The kernel is typically represented as a small matrix of size Wr * c; for a given input matrix V , then, the convolution function can be stated as:

zi, j = a r

k = 1 a c

l = 1 wk,l vi + k - 1, j + l - 1

For example, assume that the input matrix to a layer and the convolution kernel is

V = £ 1 0 1

1 1 0

1 1 0

0 1 1

1 1 1

0 0 1

§ W = c0 1 1 1

d

Figure 6.22 illustrates how the convolution output can be computed. As shown, each element of the output matrix results from summing up the one-by-one point mul- tiplications of the kernel elements into a corresponding r * c (in this example, 2 * 2 because the kernel is 2 * 2) subset of the input matrix elements. So, in the example shown, the element at the second column of the first row of the output matrix is in fact 0(0) + 1(1) + 1(1) + 1(0) = 2.

It can be seen that the magnitude of each element in the output matrix directly depends on how the matched kernel (with the 2 * 2 matrix) and the input matrix are involved in calculation of that element. For example, the element at the fourth column of the first row of the output matrix is the result of convoluting the kernel by a part of the input matrix, which is exactly the same as the kernel (shown in Figure 6.23). This suggests that by applying the convolution operation, we actually are converting the input matrix into an output in which the parts that have a particular feature (reflected by the kernel) are placed in the square box.

This characteristic of convolution functions is especially useful in practical image- processing applications. For instance, if the input matrix represents the pixels of an image,

1

1

1 1

0

1

1

0

0

0

1

0

1

1

0

1

1

1 0 1

1 1

Kernel (W)

Input matrix (V) Output matrix (Z)

2

3

2

1

1

1

3

1

3

2

FIGURE 6.22 Convolution of a 2 : 2 Kernel by a 3 : 6 Input Matrix.

1

1

1 1

0

1

1

0

0

0

1

0

1

1

0

1

1

1

FIGURE 6.23 The Output of Convolution Operation Is Maximized When the Kernel Exactly Matches the Part of Input Matrix That Is Being Convoluted by.

Chapter 6 • Deep Learning and Cognitive Computing 351

a particular kernel representing a specific shape (e.g., a diagonal line) may be convoluted into that image to extract parts of the image involving that specific shape. Figure 6.24, for example, shows the result of applying a 3 * 3 horizontal line kernel to a 15 * 15 image of a square.

Clearly, the horizontal kernel produces an output in which the location of horizon- tal lines (as a feature) in the original input image is identified.

Convolution using a kernel of size r * c will reduce the number of rows and columns in the output by r - 1 and c - 1, respectively. In the recent case, for exam- ple, using a 2 * 2 kernel for convolution, the output matrix has 1 row and 1 column less than the input matrix. To prevent this change of size, we can pad the outside of the input matrix with zeros before convolving, that is, to add r - 1 rows and c - 1 columns of zeros to the input matrix. On the other hand, if we want the output matrix to be even smaller, we can have the kernel to take larger strides, or kernel movements. Normally, the kernel is moved one step at a time (i.e., stride = 1) when performing the convolution. By increasing this stride to 2, the size of the output matrix is reduced by a factor of 2.

Although the main benefit of employing convolution in the deep networks is pa- rameter sharing, which effectively reduces the required time and processing power to train the network by reducing the number of weight parameters, it involves some other benefits as well. A convolution layer in a network will have a property called equivari- ance for translation purposes (Goodfellow et al., 2016). It simply means that any changes in the input will lead to a change in the output in the same way. For instance, moving an object in the input image by 10 pixels in a particular direction will lead to moving its representation in the output image by 10 pixels in the same direction. Apart from image- processing applications, this feature is especially useful for analyzing time-series data using convolutional networks where convolution can produce a kind of timeline that shows when each feature appears in the input.

It should be noted that in almost all of the practical applications of convolutional networks, many convolution operations are used in parallel to extract various kinds of features from the data, because a single feature is hardly enough to fully describe the inputs for the classification or recognition purposes. Also, as noted before, in most real- world applications, we have to represent the inputs as multi-dimensional tensors. For instance, in the processing of color images as opposed to gray scale pictures, instead of having 2D tensors (i.e., matrices) that represent the color of pixels (i.e., black or white), one will have to use 3D tensors because each pixel should be defined using the intensity of red, blue, and green colors.

1

1

2

3

2

Horizontal Kernel

Input Image Output Image

3

1 2 3 4 5 6 7 8 9 101112131415 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 2 3 4 5 6 7 8 5 9

10 11 12 13 14 15

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

FIGURE 6.24 Example of Using Convolution for Extracting Features (Horizontal Lines in This Example) from Images.

352 Part II • Predictive Analytics/Machine Learning

Pooling

Most of the times, a convolution layer is followed by another layer known as the pooling (a.k.a. subsampling) layer. The purpose of a pooling layer is to consolidate elements in the input matrix to produce a smaller output matrix while maintaining the important fea- tures. Normally, a pooling function involves an r * c consolidation window (similar to a kernel in the convolution function) that moves around the input matrix and in each move calculates some summary statistics of the elements involved in the consolidation window so that it can be put in the output image. For example, a particular type of pooling func- tion called average pooling takes the average of the input matrix elements involved in the consolidation window and puts that average value as an element of the output matrix in the corresponding location. Similarly, the max pooling function (Zhou et al.) takes the maximum of the values in the window as the output element. Unlike convolution, for the pooling function, given the size of the consolidation window (i.e., r and c), stride should be carefully selected so that there would be no overlaps in the consolidations. The pooling operation using an r * c consolidation window reduces the number of rows and columns of the input matrix by a factor of r and c, respectively. For example, using a 3 * 3 consolidation window, a 15 * 15 matrix will be consolidated to a 5 * 5 matrix.

Pooling, in addition to reducing the number of parameters, is especially useful in the image-processing applications of deep learning in which the critical task is to determine whether a feature (e.g., a particular animal) is present in an image while the exact spatial lo- cation of the same in the picture is not important. However, if the location of features is im- portant in a particular context, applying a pooling function could potentially be misleading.

You can think of pooling as an operation that summarizes large inputs whose fea- tures are already extracted by the convolution layer and shows us just the important parts (i.e., features) in each small neighborhood in the input space. For instance, in the case of the image-processing example shown in Figure 6.24, if we place a max pooling layer after the convolution layer using a 3 * 3 consolidation window, the output will be like what is shown in Figure 6.25. As shown, the 15 * 15 already convoluted image is consolidated in a 5 * 5 image while the main features (i.e., horizontal lines) are maintained therein.

Sometimes pooling is used just to modify the size of matrices coming from the pre- vious layer and convert them to a specified size required by the following layer in the network.

There are various types of pooling operations such as max pooling, average pool- ing, the L2 norm of a rectangular neighborhood, and weighted average pooling. The

Horizontal Convoluted Square

Output

1 2 3 4 5 6 7 8 9 101112131415

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

Max Pooling

FIGURE 6.25 An Example of Applying Max Pooling on an Output Image to Reduce Its Size.

Chapter 6 • Deep Learning and Cognitive Computing 353

choice of proper pooling operation as well as the decision to include a pooling layer in the network at all depends highly on the context and properties of the problem that the network is solving. There are some guidelines in the literature to help the network de- signers in making such decisions (Boureau et al., 2011; Boureau, Ponce, and LeCun, 2010; Scherer, Müller, and Behnke, 2010).

Image Processing Using Convolutional Networks

Real applications of deep learning in general and CNNs in particular highly depend on the availability of large, annotated data sets. Theoretically, CNNs can be applied to many practical problems, and today there are many large and feature-rich databases for such applications available. Nevertheless, the biggest challenge is that in supervised learning applications, one needs an already annotated (i.e., labeled) data set to train the model be- fore we can use it for prediction/identification of other unknown cases. Whereas extract- ing features of data sets using CNN layers is an unsupervised task, the extracted features will not be of much use without having labeled cases to develop a classification network in a supervised learning fashion. That is why image classification networks traditionally involve two pipelines: visual feature extraction and image classification.

ImageNet (http://www.image-net.org) is an ongoing research project that pro- vides researchers with a large database of images, each linked to a set of synonym words (known as synset) from WordNet (a word hierarchy database). Each synset represents a particular concept in the WordNet. Currently, WordNet includes more than 100,000 synsets, each of which is supposed to be illustrated by an average of 1,000 images in the ImageNet. ImageNet is a huge database for developing image processing–type deep networks. It contains more than 15 million labeled images in 22,000 categories. Because of its sheer size and proper categorization, ImageNet is by far the most widely used benchmarking data set to assess the efficiency and accuracy of deep networks designed by deep learning researchers.

One of the first convolutional networks designed for image classification using the ImageNet data set was AlexNet (Krizhevsky, Sutskever, and Hinton, 2012). It was com- posed of five convolution layers followed by three fully connected (a.k.a. dense) layers (see Figure 6.26 for a schematic representation of AlexNet). One of the contributions of this relatively simple architecture that made its training remarkably faster and com- putationally efficient was the use of rectified linear unit (ReLu) transfer functions in the convolution layers instead of the traditional sigmoid functions. By doing so, the designers

3

C1

C2

96

55

55

5

5

27

3 13 13

13 13

13

3

3

3

13 27 384 384 256

4,096 4,096

1,000

3

256

C3 C4 C5

FC6 FC7

FC8

FIGURE 6.26 Architecture of AlexNet, a Convolutional Network for Image Classification.

354 Part II • Predictive Analytics/Machine Learning

addressed the issue called the vanishing gradient problem caused by very small deriva- tives of sigmoid functions in some regions of the images. The other important contribu- tion of this network that has a dramatic role in improving the efficiency of deep networks was the introduction of the concept of dropout layers to the CNNs as a regularization technique to reduce overfitting. A dropout layer typically comes after the fully connected layers and applies a random probability to the neurons to switch off some of them and make the network sparser.

In the recent years, in addition to a large number of data scientists who showcase their deep learning capabilities, a number of well-known industry-leading companies such as Microsoft, Google, and Facebook have participated in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The goal in the ILSVRC classification task is to design and train networks that are capable of classifying 1.2 million input images into one of the 1,000 image categories. For instance, GoogLeNet (a.k.a. Inception), a deep convolutional network architecture designed by Google researchers, was the win- ning architecture of ILSVRC 2014 with a 22-layer network and only a 6.66 percent clas- sification error rate, only slightly 15.1%2 worse than the human-level classification error (Russakovsky et al., 2015). The main contribution of the GoogLeNet architecture was to introduce a module called Inception. The idea of Inception is that because one would have no idea of the size of convolution kernel that would perform best on a particular data set, it is better to include multiple convolutions and let the network decide which one to use. Therefore, as shown in Figure 6.27, in each convolution layer, the data com- ing from the previous layer is passed through multiple types of convolution and the out- puts are concatenated before going to the next layer. Such architecture allows the model to take into account both local features via smaller convolutions and high abstracted features via larger ones.

Google recently launched a new service, Google Lens, that uses deep learning arti- ficial neural network algorithms (along with other AI techniques) to deliver information about the images captured by users from their nearby objects. This involves identifying the objects, products, plants, animals, and locations and providing information about them on the Internet. Some other features of this service are the capability of saving

Filter Concatenation

3 3 3 Convolutions

5 3 5 Convolutions

1 3 1 Convolutions

1 3 1 Convolutions

1 3 1 Convolutions

3 3 3 Max Pooling

131 Convolutions

Previous Layer

FIGURE 6.27 Conceptual Representation of the Inception Feature in GoogLeNet.

Chapter 6 • Deep Learning and Cognitive Computing 355

contact information from a business card image on the phone, identifying type of plants and breed of animals, identifying books and movies from their cover photos, and provid- ing information (e.g., stores, theaters, shopping, reservations) about them. Figure 6.28 shows two examples of using the Google Lens app on an Android mobile device.

Even though later more accurate networks have been developed (e.g., He, Zhang, Ren, & Sun, 2015) in terms of efficiency and processing requirements (i.e., smaller num- ber of layers and parameters), GoogLeNet is considered to be one of the best architec- tures to date. Apart from AlexNet and GoogLeNet, several other convolutional network architectures such as Residual Networks (ResNet), VGGNet, and Xception have been developed and contributed to the image-processing area, all relying on the ImageNet database.

In a May 2018 effort to address the labor-intensive task of labeling images on a large scale, Facebook published a weakly supervised training image recognition deep learning project (Mahajan et al., 2018). This project used hashtags made by the users on the im- ages posted on Instagram as labels and trained a deep learning image recognition model based on that. The model was trained using 3.5 billion Instagram images labeled with around 17,000 hashtags using 336 GPUs working in parallel; the training procedure took a few weeks to be accomplished. A preliminary version of the model (trained using only 1 billion images and 1,500 hashtags) was then tested on the ImageNet benchmark data set and is reported to have outperformed the state-of-the-art models in terms of accuracy by more than 2 percent. This big achievement by Facebook surely will open doors to a new world of image processing using deep learning since it can dramatically increase the size of available image data sets that are labeled for training purposes.

Use of deep learning and advanced analytics methods to classify images has evolved into the recognition of human faces and has become a very popular application for a variety of purposes. It is discussed in Application Case 6.6.

FIGURE 6.28 Two Examples of Using the Google Lens, a Service Based on Convolutional Deep Networks for Image Recognition. Source: ©2018 Google LLC, used with permission. Google and the

Google logo are registered trademarks of Google LLC.

356 Part II • Predictive Analytics/Machine Learning

Face recognition, although seemingly similar to image recognition, is a much more complicated undertaking. The goal of face recognition is to identify the individ- ual as opposed to the class it belongs to (human), and this identification task needs to be performed on a nonstatic (i.e., moving person) 3D environment. Face recognition has been an active research field in AI for many decades with limited success until recently. Thanks to the new generation of algorithms (i.e., deep learning) coupled with large data sets and computa- tional power, face recognition technology is starting to make a significant impact on real-world applications. From security to marketing, face recognition and the variety of applications/use cases of this technology are increasing at an astounding pace.

Some of the premier examples of face recogni- tion (both in advancements in technology and in the creative use of the technology perspectives) come from China. Today in China, face recognition is a very hot topic both from business development and from application development perspectives. Face recognition has become a fruitful ecosystem with hundreds of start-ups in China. In personal and/or business settings, people in China are widely using and relying on devices whose security is based on automatic recognition of their faces.

As perhaps the largest scale practical applica- tion case of deep learning and face recognition in the world today, the Chinese government recently started a project known as “Sharp Eyes” that aims at establishing a nationwide surveillance system based on face recognition. The project plans to integrate security cameras already installed in public places with private cameras on buildings and to utilize AI and deep learning to analyze the videos from those cameras. With millions of cameras and billions of lines of code, China is building a high-tech authori- tarian future. With this system, cameras in some cit- ies can scan train and bus stations as well as airports to identify and catch China’s most wanted suspected criminals. Billboard-size displays can show the faces of jaywalkers and list the names and pictures of peo- ple who do not pay their debts. Facial recognition scanners guard the entrances to housing complexes.

An interesting example of this surveillance system is the “shame game” (Mozur, 2018). An

intersection south of Changhong Bridge in the city of Xiangyang previously was a nightmare. Cars drove fast, and jaywalkers darted into the street. Then, in the summer of 2017, the police put up cameras linked to facial recognition technology and a big out- door screen. Photos of lawbreakers were displayed alongside their names and government identifica- tion numbers. People were initially excited to see their faces on the screen until propaganda outlets told them that this was a form of punishment. Using this, citizens not only became a subject of this shame game but also were assigned negative citizenship points. Conversely, on the positive side, if people are caught on camera showing good behavior, like pick- ing up a piece of trash from the road and putting it into a trash can or helping an elderly person cross an intersection, they get positive citizenship points that can be used for a variety of small awards.

China already has an estimated 200 million sur- veillance cameras—four times as many as the United States. The system is mainly intended to be used for tracking suspects, spotting suspicious behavior, and predicting crimes. For instance, to find a criminal, the image of a suspect can be uploaded to the system, matching it against millions of faces recognized from videos of millions of active security cameras across the country. This can find individuals with a high degree of similarity. The system also is merged with a huge database of information on medical records, travel bookings, online purchases, and even social media activities of every citizen and can monitor practically everyone in the country (with 1.4 billion people), tracking where they are and what they are doing each moment (Denyer, 2018). Going beyond narrowly defined security purposes, the govern- ment expects Sharp Eyes to ultimately assign every individual in the country a “social credit score” that specifies to what extent she or he is trustworthy.

While such an unrestricted application of deep learning (i.e., spying on citizens) is against the privacy and ethical norms and regulations of many western countries, including the United States, it is becoming a common practice in countries with less restrictive privacy laws and concerns as in China. Even western countries have begun to plan on employing similar technologies in limited scales only for security and

Application Case 6.6 From Image Recognition to Face Recognition

Chapter 6 • Deep Learning and Cognitive Computing 357

Text Processing Using Convolutional Networks

In addition to image processing, which was in fact the main reason for the popularity and development of convolutional networks, they have been shown to be useful in some large-scale text mining tasks as well. Especially since 2013, when Google published its word2vec project (Mikolov et al., 2013; Mikolov, Sutskever, Chen, Corrado, and Dean, 2013), the applications of deep learning for text mining have increased remarkably.

Word2vec is a two-layer neural network that gets a large text corpus as the input and converts each word in the corpus to a numeric vector of any given size (typically ranging from 100 to 1,000) with very interesting features. Although word2vec itself is not a deep learning algorithm, its outputs (word vectors also known as word embeddings) already have been widely used in many deep learning research and commercial projects as inputs.

One of the most interesting properties of word vectors created by the word2vec algorithm is maintaining the words’ relative associations. For example, vector operations

vector (‘King’) - vector (‘Man’) + vector (‘Woman’)

and

vector (‘London’) - vector (‘England’) + vector (‘France’)

will result in a vector very close to vector (‘Queen’) and vector (‘Paris’), respectively. Figure 6.29 shows a simple vector representation of the first example in a two-dimensional vector space.

Moreover, the vectors are specified in such a way that those of a similar context are placed very close to each other in the n-dimensional vector space. For instance, in the word2vec model pretrained by Google using a corpus including about 100 billion words (taken from Google News), the closest vectors to the vector (‘Sweden’) in terms of cosine distance, as shown in Table 6.2, identify European country names near the Scandinavian region, the same region in which Sweden is located.

Additionally, since word2vec takes into account the contexts in which a word has been used and the frequency of using it in each context in guessing the meaning of the word, it enables us to represent each term with its semantic context instead of just the syntactic/symbolic term itself. As a result, word2vec addresses several word variation issues that used to be problematic in traditional text mining activities. In other words,

crime prevention purposes. The FBI’s Next Generation Identification System, for instance, is a lawful appli- cation of facial recognition and deep learning that compares images from crime scenes with a national database of mug shots to identify potential suspects.

Questions for Case 6.6

1. What are the technical challenges in face recognition?

2. Beyond security and surveillance purposes, where else do you think face recognition can be used?

3. What are the foreseeable social and cultural problems with developing and using face recog- nition technology?

Sources: Mozur, P. (2018, June 8). “Inside China’s Dystopian Dreams: A.I., Shame and Lots of Cameras.” The New York Times. https://www.nytimes.com/2018/07/08/business/china- surveillance-technology.html; Denyer, S. (2018, January). “Beijing Bets on Facial Recognition in a Big Drive for Total Surveillance.” The Washington Post. https://www.washing- tonpost.com/news/world/wp/2018/01/07/feature/in- china-facial-recognition-is-sharp-end-of-a-drive-for-total- surveillance/?noredirect=on&utm_term=.e73091681b31.

358 Part II • Predictive Analytics/Machine Learning

word2vec is able to handle and correctly represent words including typos, abbreviations, and informal conversations. For instance, the words Frnce, Franse, and Frans would all get roughly the same word embeddings as their original counterpart France. Word embeddings are also able to determine other interesting types of associations such as distinction of entities (e.g., vector3‘human’4 - vector3‘animal’4~vector3‘ethics’4) or geopolitical associations (e.g., vector3‘Iraq’4 - vector3‘violence’4~vector3‘Jordan’4).

By providing such a meaningful representation of textual data, in recent years, word2vec has driven many deep learning–based text mining projects in a wide range of contexts (e.g., medical, computer science, social media, marketing), and various types of deep networks have been applied to the word embeddings created by this algorithm to accomplish different objectives. Particularly, a large group of studies had developed convolutional networks applied to the word embeddings with the aim of relation extrac- tion from textual data sets. Relation extraction is one of the subtasks of natural language processing (NLP) that focuses on determining whether two or more named entities rec- ognized in the text form specific relationships (e.g., “A causes B”; “B is caused by A”). For instance, Zeng et al. (2014) developed a deep convolutional network (see Figure 6.30) to classify relations between specified entities in sentences. To this end, these researchers

King

Queen

ManKing-Man

Woman

FIGURE 6.29 Typical Vector Representation of Word Embeddings in a Two-Dimensional Space

TABLE 6.2 Example of the word2vec Project Indicating the Closest Word Vectors to the Word “Sweden”

Word Cosine Distance

Norway 0.760124

Denmark 0.715460

Finland 0.620022

Switzerland 0.588132

Belgium 0.585635

Netherlands 0.574631

Iceland 0.562368

Estonia 0.547621

Slovenia 0.531408

Chapter 6 • Deep Learning and Cognitive Computing 359

used a matrix format to represent each sentence. Each column of the input matrices is in fact the word embedding (i.e., vector) associated with one of the words involved in the sentence. Zeng et al. then used a convolutional network, shown in the right box in Figure 6.30, to automatically learn the sentence-level features and concatenate those features (i.e., the output vector of the CNN) with some basic lexical features (e.g., the order of the two words of interest within the sentence and the left and right tokens for each of them). The concatenated feature vector then is fed into a classification layer with a softmax transfer function, which determines the type of relationship between the two words of interest among multiple predefined types. The softmax transfer function is the most com- mon type of function to be used for classification layers, especially when the number of classes is more than two. For classification problems with only two outcome categories, log-sigmoid transfer functions are also very popular. The proposed approach by Zeng et al. was shown to correctly classify the relation between the marked terms in sentences of a sample data set with an 82.7 percent accuracy.

In a similar study, Nguyen and Grishman (2015) used a four-layer convolutional net- work with multiple kernel sizes in each convolution layer fed by the real-valued vectors of words included in sentences to classify the type of relationship between the two marked words in each sentence. In the input matrix, each row was the word embedding associated with a word in the same sequence in the sentence as the row number. In addition, these researchers included two more columns to the input matrices to represent the relative posi- tion of each word (either positive or negative) with regard to each of the marked terms. The automatically extracted features then were passed through a classification layer with soft- max function for the type of relationship to be determined. Nguyen and Grishman trained their model using 8,000 annotated examples (with 19 predefined classes of relationships) and tested the trained model on a set of 2,717 validation data sets and achieved a classifica- tion accuracy of 61.32 percent (i.e., more than 11 times better performance than guessing).

Such text mining approaches using convolutional deep networks can be extended to various practical contexts. Again, the big challenge here, just as in image processing, is lack of sufficient large annotated data sets for supervised training of deep networks. A distant supervision method of training has been proposed (Mintz et al., 2009) to ad- dress this challenge. It suggests that large amounts of training data can be produced by aligning knowledge base (KB) facts with texts. In fact, this approach is based on the assumption that if a particular type of relation exists between an entity pair (e.g., “A” is a component of “B”) in the KB, then every text document containing the mention of the

tanh(W2x )

Word Representation

[People] have been moving back into [downtown]

Window Processing

Sentence level Features

Convolution

Lexical level features

Feature Extraction

Output

W3X

W1

WF

PF

Max over times

Sentence level features

FIGURE 6.30 CNN Architecture for Relation Extraction Task in Text Mining.

360 Part II • Predictive Analytics/Machine Learning

entity pair would express that relation. However, since this assumption was not very realistic, Riedel, Yao, and McCallum (2010) later relaxed it by modeling the problem as a multi-instance learning problem. They suggest assigning labels to a bag of instances rather than a single instance that can reduce the noise of the distant supervision method and create more realistic labeled training data sets (Kumar, 2017).

u SECTION 6.7 REVIEW QUESTIONS

1. What is CNN? 2. For what type of applications can CNN be used? 3. What is convolution function in CNN and how does it work? 4. What is pooling in CNN? How does it work? 5. What is ImageNet and how does it relate to deep learning? 6. What is the significance of AlexNet? Draw and describe its architecture. 7. What is GoogLeNet? How does it work? 8. How does CNN process text? What are word embeddings, and how do they work? 9. What is word2vec, and what does it add to traditional text mining?

6.8 RECURRENT NETWORKS AND LONG SHORT-TERM MEMORY NETWORKS

Human thinking and understanding to a great extent relies on context. It is crucial for us, for example, to know that a particular speaker uses very sarcastic language (based on his previous speeches) to fully catch all the jokes that he makes. Or to understand the real meaning of the word fall (i.e., either the season or to collapse) in the sentence “It is a nice day of fall” without knowledge about the other words in the surrounding sentences would only be guessing, not necessarily understanding. Knowledge of context is typically formed based on observing events that happened in the past. In fact, human thoughts are persistent, and we use every piece of information we previously acquired about an event in the process of analyzing it rather than throwing away our past knowledge and thinking from scratch every time we face similar events or situations. Hence, there seems to be a recurrence in the way humans process information.

While deep MLP and convolutional networks are specialized for processing a static grid of values like an image or a matrix of word embeddings, sometimes the sequence of input values is also important to the operation of the network to accomplish a given task and hence should be taken into account. Another popular type of neural networks is recurrent neural network (RNN) (Rumelhart et al., 1986), which is specifically de- signed to process sequential inputs. An RNN basically models a dynamic system where (at least in one of its hidden neurons) the state of the system (i.e., output of a hidden neuron) at each time point t depends on both the inputs to the system at that time and its state at the previous time point t - 1. In other words, RNNs are the type of neural networks that have memory and that apply that memory to determine their future out- puts. For instance, in designing a neural network to play chess, it is important to take into account several previous moves while training the network, because a wrong move by a player can lead to the eventual loss of the game in the subsequent 10–15 plays. Also, to understand the real meaning of a sentence in an essay, sometimes we need to rely on the information portrayed in the previous several sentences or paragraphs. That is, for a true understanding, we need the context built sequentially and collectively over time. Therefore, it is crucial to consider a memory element for the neural network that takes into account the effect of prior moves (in the chess example) and prior sentences and paragraphs (in the essay example) to determine the best output. This memory portrays and creates the context required for the learning and understanding.

Chapter 6 • Deep Learning and Cognitive Computing 361

In static networks like MLP-type CNNs, we are trying to find some functions (i.e., network weights and biases) that map the inputs to some outputs that are as close as possible to the actual target. In dynamic networks like RNNs, on the other hand, both inputs and outputs are sequences (patterns). Therefore, a dynamic network is a dynamic system rather than a function because its output depends not only on the input but also on the previous outputs. Most of the RNNs use the following general equation to define the values of their hidden units (Goodfellow et al., 2016).

a(t) = f (a(t - 1), p(t),u)

In this equation, a(t) represents the state of the system at time t, and p(t) and u rep- resent the input to the unit at time t and the parameters, respectively. Applying the same general equation for calculating the state of system at time t - 1, we will have:

a(t - 1) = f (a(t - 2), p(t - 1),u)

In other words:

a(t) = f ( f (a(t - 2), p(t - 1),u), p(t),u)

And this equation can be extended multiple times for any given sequence length. Graphically, a recurrent unit in a network can be depicted in a circuit diagram like the one shown in Figure 6.31. In this figure, D represents the tap delay lines, or simply the delay element of the network that, at each time point t, contains a1t2, the previous output value of the unit. Sometimes instead of just one value, we store several previous output values in D to account for the effect of all of them. Also iw and lw represent the weight vectors applied to the input and the delay, respectively.

Technically speaking, any network with feedback can actually be called a deep net- work, because even with a single layer, the loop created by the feedback can be thought of as a static MLP-type network with many layers (see Figure 6.32 for a graphical illustra- tion of this structure). However, in practice, each recurrent neural network would involve dozens of layers, each with feedback to itself, or even to the previous layers, which makes a recurrent neural network even deeper and more complicated.

Because of the feedbacks, computation of gradients in the recurrent neural net- works would be somewhat different from the general backpropagation algorithm used

X

XIw

Input Recurrent Neuron a(t) 5 f(iw.p(t) 1 lw.a(t) 1 b)

iw b

p(t) n(t) a(t)

D

fS

FIGURE 6.31 Typical Recurrent Unit.

362 Part II • Predictive Analytics/Machine Learning

for the static MLP networks. There are two alternative approaches for computing the gradients in the RNNs, namely, real-time recurrent learning (RTRL) and backpropagation through time (BTT), whose explanation is beyond the scope of this chapter. Nevertheless, the general purpose remains the same; once the gradients have been computed, the same procedures are applied to optimize the learning of the network parameters.

The LSTM networks (Hochreiter & Schmidhuber, 1997) are variations of recurrent neural networks that today are known as the most effective sequence modeling tech- nique and are the base of many practical applications. In a dynamic network, the weights are called the long-term memory while the feedbacks role is the short-term memory.

In essence, only the short-term memory (i.e., feedbacks; previous events) provides a network with the context. In a typical RNN, the information in the short-term memory is continuously replaced as new information is fed back into the network over time. That is why RNNs perform well when the gap between the relevant information and the place that is needed is small. For instance, for predicting the last word in the sentence “The referee blew his whistle,” we just need to know a few words back (i.e., the referee) to correctly predict. Since in this case the gap between the relevant information (i.e., the ref- eree) and where it is needed (i.e., to predict whistle) is small, an RNN network can easily perform this learning and prediction task.

However, sometimes the relevant information required to perform a task is far away from where it is needed (i.e., the gap is large). Therefore, it is quite likely that it would have already been replaced by other information in the short-term memory by the time it is needed for the creation of the proper context. For instance, to predict the last word in “I went to a carwash yesterday. It cost $5 to wash my car,” there is a relatively larger gap between the relevant information (i.e., carwash) and where it is needed. Sometimes we may even need to refer to the previous paragraphs to reach the relevant information for predicting the true meaning of a word. In such cases, RNNs usually do not perform well since they cannot keep the information in their short-term memory for a long enough time. Fortunately, LSTM networks do not have such a shortcoming. The term long short- term memory network then refers to a network in which we are trying to remember what happened in the past (i.e., feedbacks; previous outputs of the layers) for a long enough time so that it can be used/leveraged in accomplishing the task when needed.

From an architectural viewpoint, the memory concept (i.e., remembering “what happened in the past”) is incorporated in LSTM networks by incorporating four addi- tional layers into the typical recurrent network architecture: three gate layers, namely input gate, forget (a.k.a. feedback) gate, and output gate, and an additional layer called Constant Error Carousel (CEC), also known as the state unit that integrates those gates and interacts them with the other layers. Each gate is nothing but a layer with two inputs, one from the network input and the other a feedback from the final output of the whole network. The gates involve log-sigmoid transfer functions. Therefore, their outputs will be between 0 and 1 and describe how much of each component (either input, feedback, or output) should be let through the network. Also, CEC is a layer that falls between the

a(...)

x(...)

a(...)a(t21)

x(t11) x(t12) x(t13)x(t)

a(t11) a(t12)a(t) f f f f f

FIGURE 6.32 Unfolded View of a Typical Recurrent Network.

Chapter 6 • Deep Learning and Cognitive Computing 363

input and the output layers in a recurrent network architecture and applies the gates out- puts to make the short-term memory long.

To have a long short-term memory means that we want to keep the effect of previ- ous outputs for a longer time. However, we typically do not want to indiscriminately re- member everything that has happened in the past. Therefore, gating provides us with the capability of remembering prior outputs selectively. The input gate will allow selective inputs to the CEC; the forget gate will clear the CEC from the unwanted previous feed- backs; and the output gate will allow selective outputs from the CEC. Figure 6.33 shows a simple depiction of a typical LSTM architecture.

In summary, the gates in the LSTM are in charge of controlling the flow of informa- tion through the network and dynamically change the time scale of integration based on the input sequence. As a result, LSTM networks are able to learn long-term dependencies among the sequence of inputs more easily than the regular RNNs.

Application Case 6.7 illustrates the use of text processing in the context of under- standing customer opinions and sentiments toward innovatively designing and develop- ing new and improved products and services.

CECx

1 x Input Layer

Input Gate

Forget (feedback) Gate

Output Gate

Output Layer

a7(t)

a7(t)

a7(t)

a5(t) a6(t) a7(t)

a1(t)

a2(t)

a3(t)

x

p(t)

p(t)

p(t)

a7(t) a4(t)

p(t)

FIGURE 6.33 Typical Long Short-Term Memory (LSTM) Network Architecture.

Analyzing product and customer behavior provides valuable insights into what consumers want, how they interact with products, and where they encoun- ter usability issues. These insights can lead to new feature designs and development or even new products.

Understanding customer sentiment and know- ing what consumers truly think about products or a brand are traditional pain points. Customer jour- ney analytics provides insights into these areas, yet these solutions are not all designed to integrate vital sources of unstructured data such as call center

Application Case 6.7 Deliver Innovation by Understanding Customer Sentiments

(Continued )

364 Part II • Predictive Analytics/Machine Learning

notes or social media feedback. In today’s world, unstructured notes are part of core communications in virtually every industry, for example:

• Medical professionals record patient obser- vations.

• Auto technicians write down safety information. • Retailers track social media for consumer

comments. • Call centers monitor customer feedback and

take notes.

Bringing together notes, which are usually avail- able as free-form text, with other data for analysis has been difficult. That is because each industry has its own specific terms, slang, shorthand, and acronyms embed- ded in the data. Finding meaning and business insights first requires the text to be changed into a structured form. This manual process is expensive, time consum- ing, and prone to errors, especially as data scales to ever-increasing volumes. One way that companies can leverage notes without codifying the text is to use text clustering. This analytic technique quickly identifies common words or phrases for rapid insights.

Text and Notes Can Lead to New and Improved Products

Leveraging the insights and customer sentiment uncovered during a text and sentiment analysis can spark innovation. Companies such as vehicle manufacturers can use the intelligence to improve customer service and deliver an elevated customer experience. By learning what customers like and dislike about current products, companies can improve their design, such as adding new features to a vehicle to enhance the driving experience.

Forming word clusters also allows companies to identify safety issues. If an auto manufacturer sees that numerous customers are expressing negative sen- timents about black smoke coming from their vehicle, the company can respond. Likewise, manufacturers can address safety issues that are a concern to custom- ers. With comments grouped into buckets, companies have the ability to focus on specific customers who experienced a similar problem. This allows a com- pany to, for instance, offer a rebate or special promo- tion to those who experienced black smoke.

Understanding sentiments can better inform a vehicle manufacturer’s policies. For example,

customers have different lifetime values. A cus- tomer who complains just once but has a very large lifetime value can be a more urgent candidate for complaint resolution than a customer with a lower lifetime value with multiple issues. One may have spent $5,000 buying the vehicle from a used vehicle lot. Another may have a history of buying new cars from the manufacturer and spent $30,000 to buy the vehicle on the showroom floor.

Analyzing Notes Enables High-Value Business Outcomes

Managing the life cycle of products and services continues to be a struggle for most companies. The massive volumes of data now available have com- plicated life cycle management, creating new chal- lenges for innovation. At the same time, the rapid rise of consumer feedback through social media has left businesses without a strategy for digesting, mea- suring, or incorporating the information into their product innovation cycle—meaning they miss a cru- cial amount of intelligence that reflects a customer’s actual thoughts, feelings, and emotions.

Text and sentiment analysis is one solution to this problem. Deconstructing topics from masses of text allows companies to see what common issues, com- plaints, or positive or negative sentiments customers have about products. These insights can lead to high- value outcomes, such as improving products or cre- ating new ones that deliver a better user experience, responding timely to safety issues, and identifying which product lines are most popular with consumers.

Example: Visualizing Auto Issues with “The Safety Cloud”

The Teradata Art of Analytics uses data science, Teradata® Aster® Analytics, and visualization tech- niques to turn data into one-of-a-kind artwork. To demonstrate the unique insights offered by text clus- tering, data scientists used the Art of Analytics to create “The Safety Cloud.”

The scientists used advanced analytics algo- rithms on safety inspector and call center notes from an automobile manufacturer. The analytics identi- fied and systematically extracted common words and phrases embedded in the data volumes. The blue cluster represents power steering failure. The pink is engine stalls. Yellow is black smoke in the exhaust.

Application Case 6.7 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 365

Orange is brake failure. The manufacturer can use this information to gauge how big the problem is and whether it is safety related, and if so, then take actions to fix it.

For a visual summary, you can watch the video (http://www.teradata.com/Resources/Videos/ Art-of-Analytics-Safety-Cloud).

Questions for Case 6.7

1. Why do you think sentiment analysis is gaining overwhelming popularity?

2. How does sentiment analysis work? What does it produce?

3. In addition to the specific examples in this case, can you think of other businesses and industries that can benefit from sentiment analysis? What is common among the companies that can benefit greatly from sentiment analysis?

Source: Teradata Case Study. “Deliver Innovation by Understanding Customer Sentiments.” http://assets.teradata. com/resourceCenter/downloads/CaseStudies/EB9859.pdf (accessed August 2018). Used with permission.

LSTM Networks Applications

Since their emergence in the late 1990s (Hochreiter & Schmidhuber, 1997), LSTM networks have been widely used in many sequence modeling applications, includ- ing image captioning (i.e., automatically describing the content of images) (Vinyals, Toshev, Bengio, and Erhan, 2017, 2015; Xu et al., 2015), handwriting recognition and generation (Graves, 2013; Graves and Schmidhuber, 2009; Keysers et al. 2017), parsing (Liang et al. 2016; Vinyals, Kaiser, et al., 2015), speech recognition (Graves and Jaitly, 2014; Graves, Jaitly, and Mohamed, 2013; Graves, Mohamed, and Hinton, 2013), and machine translation (Bahdanau, Cho, and Bengio, 2014; Sutskever, Vinyals, and Le, 2014).

366 Part II • Predictive Analytics/Machine Learning

Currently, we are surrounded by multiple deep learning solutions working on the basis of speech recognition, such as Apple’s Siri, Google Now, Microsoft’s Cortana, and Amazon’s Alexa, several of which we deal with on a daily basis (e.g., checking on the weather, asking for a Web search, calling a friend, and asking for directions on the map). Note taking is not a difficult, frustrating task anymore since we can easily record a speech or lecture, upload the digital recording on one of the several cloud-based speech-to-text service providers’ platforms, and download the transcript in a few seconds. The Google cloud-based speech-to-text service, for example, supports 120 languages and their vari- ants and has the ability to convert speech to text either in real time or using recorded audios. The Google service automatically handles the noise in the audio; accurately punc- tuates the transcripts with commas, question marks, and periods; and can be customized by the user to a specific context by getting a set of terms and phrases that are very likely to be used in a speech and recognizing them appropriately.

Machine translation refers to a subfield of AI that employs computer programs to translate speech or text from one language to another. One of the most comprehensive machine translation systems is the Google’s Neural Machine Translation (GNMT) platform. GNMT is basically an LSTM network with eight encoder and eight decoder layers designed by a group of Google researchers in 2016 (Wu et al., 2016). GNMT is specialized for trans- lating whole sentences at a time as opposed to the former version of Google Translate platform, which was a phrase-based translator. This network is capable of naturally han- dling the translation of rare words (that previously was a challenge in machine translation) by dividing the words into a set of common subword units. GNMT currently supports au- tomatic sentence translations between more than 100 languages. Figure 6.34 shows how a sample sentence was translated from French to English by GNMT and a human translator. It also indicates how closely the GNMT translations between different language pairs were ranked by the human speakers compared with translations made by humans.

For the former secretary of state, this is to forget a month of bungling and convince the audience that Mr. Trump has not the makings of a president

Phrase Based†

Input Sentence

Neural Network† Human

English French

Chinese

Spanish

Spanish

French

Chinese

Translation Method Phrase Based† Neural Network† Human

543

Pour l’ancienne secrétaire d’Etat, il s’agit de faire oublier un mois de cafouillages et de convaincre l’auditoire que M. Trump n’a pas l’étoffe d’un président

For the former secretary of state, it is a question of forgetting a month of muddles and convincing the audience that Mr. Trump does not have the stuff of a president

The former secretary of state has to put behind her a month of setbacks and convince the audience that Mr. Trump does not have what it takes to be a president

Perfect Translation 5 6

English

English

English

FIGURE 6.34 Example Indicating the Close-to-Human Performance of the Google Neural Machine Translator (GNMT)

Chapter 6 • Deep Learning and Cognitive Computing 367

Although machine translation has been revolutionized by the virtue of LSTMs, it en- counters challenges that make it far from a fully automated high-quality translation. Like image-processing applications, there is a lack of sufficient training data (manually trans- lated by humans) for many language pairs on which the network can be trained. As a result, translations between rare languages are usually done through a bridging language (mostly English) that may result in higher chances of error.

In 2014, Microsoft launched its Skype Translator service, a free voice translation service involving both speech recognition and machine translation with the ability of translating real-time conversations in 10 languages. Using this service, people speaking different languages can talk to each other in their own languages via a Skype voice or video call, and the system recognizes their voices and translates their every sentence through a translator bot in near real time for the other party. To provide more accurate translations, the deep networks used in the backend of this system were trained using conversational language (i.e., using materials such as translated Web pages, movie sub- titles, and casual phrases taken from people’s conversations in social networking Web sites) rather than the formal language commonly used in documents. The output of the speech recognition module of the system then goes through TrueText, a Microsoft tech- nology for normalizing text that is capable of identifying mistakes and disfluencies (e.g., pauses during the speech or repeating some parts of speech, or adding fillers like “um” and “ah” when speaking) that people commonly conduct in their conversations and ac- count for them for making better translations. Figure 6.35 shows the four-step process involved in the Skype Translator by Microsoft, each of which relies on the LSTM type of deep neural networks.

u SECTION 6.8 REVIEW QUESTIONS

1. What is RNN? How does it differ from CNN? 2. What is the significance of “context,” “sequence,” and “memory” in RNN? 3. Draw and explain the functioning of a typical recurrent neural network unit. 4. What is the LSTM network, and how does it differ from RNNs? 5. List and briefly describe three different types of LSTM applications. 6. How do Google’s Neural Machine Translation and Microsoft Skype Translator work?

Can you hear me?

can can you hear me

Speech

Automatic Speech Recognition Machine

Translation

Text to Speech

?

me

hear

You

Can

Speech

TrueText

can can you here me

hear

A B C

FIGURE 6.35 Four-Step Process of Translating Speech Using Deep Networks in the Microsoft Skype Translator.

368 Part II • Predictive Analytics/Machine Learning

6.9 COMPUTER FRAMEWORKS FOR IMPLEMENTATION OF DEEP LEARNING

Advances in deep learning owe its recent popularity, to a great extent, to advances in the software and hardware infrastructure required for its implementation. In the past few de- cades, GPUs have been revolutionized to support the playing of high-resolution videos as well as advanced video games and virtual reality applications. However, GPUs’ huge pro- cessing potential had not been effectively utilized for purposes other than graphics pro- cessing up until a few years ago. Thanks to software libraries such as Theano (Bergstra et al., 2010), Torch (Collobert, Kavukcuoglu, and Farabet, 2011), Caffe (Jia et al., 2014), PyLearn2 (Goodfellow et al., 2013), Tensorflow (Abadi et al., 2016), and MXNet (Chen et al., 2015) developed with the purpose of programming GPUs for general-purpose processing (just as CPUs), and particularly for deep learning and analysis of Big Data, GPUs have become a critical enabler for the modern-day analytics. The operation of these libraries mostly relies on a parallel computing platform and application programming in- terface (API) developed by NVIDIA called Compute Unified Device Architecture (CUDA), which enables software developers to use GPUs made by NVIDIA for general-purpose processing. In fact, each deep learning framework consists of a high-level scripting lan- guage (e.g., Python, R, Lua) and a library of deep learning routines usually written in C (for using CPUs) or CUDA (for using GPUs).

We next introduce some of the most popular software libraries used for deep learn- ing by researchers and practitioners, including Torch, Caffe, Tensorflow, Theano, and Keras, and discuss some of their specific properties.

Torch

Torch (Collobert et al., 2011) is an open-source scientific computing framework (avail- able at www.torch.ch) for implementing machine-learning algorithms using GPUs. The Torch framework is a library based on LuaJIT, a compiled version of the popular Lua pro- gramming language (www.lua.org). In fact, Torch adds a number of valuable features to Lua that make deep learning analyses possible; it enables supporting n-dimensional arrays (i.e., tensors), whereas tables (i.e., two-dimensional arrays) normally are the only data-structuring method used by Lua. Additionally, Torch includes routine libraries for manipulating (i.e., indexing, slicing, transposing) tensors, linear algebra, neural network functions, and optimization. More importantly, while Lua by default uses CPU to run the programs, Torch enables use of GPUs for running programs written in the Lua language.

The easy and extremely fast scripting properties of LuaJIT along with its flexibility have made Torch a very popular framework for practical deep learning applications such that today its latest version, Torch7, is widely used by a number of big companies in the deep learning area, including Facebook, Google, and IBM, in their research labs, as well as for their commercial applications.

Caffe

Caffe is another open-source deep learning framework (available at http://caffe. berkeleyvision.org) created by Yangqing Jia (2013), a PhD student at the University of California–Berkeley, which the Berkeley AI Research (BAIR) then further developed. Caffe has multiple options to be used as a high-level scripting language, including the command line, Python, and MATLAB interfaces. The deep learning libraries in Caffe are written in the C++ programming language.

In Caffe, everything is done using text files instead of code. That is, to implement a network, generally we need to prepare two text files with the .prototxt extension that are communicated by the Caffe engine via JavaScript Object Notation (JSON) format.

Chapter 6 • Deep Learning and Cognitive Computing 369

The first text file, known as the architecture file, defines the architecture of the network layer by layer, where each layer is defined by a name, a type (e.g., data, convolution, output), the names of its previous (bottom) and next (top) layers in the architecture, and some required parameters (e.g., kernel size and stride for a convolutional layer). The sec- ond text file, known as the solver file, specifies the properties of the training algorithm, including the learning rate, maximum number of iterations, and processing unit (CPU or GPU) to be used for training the network.

While Caffe supports multiple types of deep network architectures like CNN and LSTM, it is particularly known to be an efficient framework for image processing due to its incredible speed in processing image files. According to its developers, it is able to pro- cess over 60 million images per day (i.e., 1 ms/image) using a single NVIDIA K40 GPU. In 2017, Facebook released an improved version of Caffe called Caffe2 (www.caffe2.ai) with the aim of improving the original framework to be effectively used for deep learning architectures other than CNN and with a special emphasis on portability for performing cloud and mobile computations while maintaining scalability and performance.

TensorFlow

Another popular open-source deep learning framework is TensorFlow. It was origi- nally developed and written in Python and C++ by the Google Brain Group in 2011 as DistBelief, but it was further developed into TensorFlow in 2015. TensorFlow at this time is the only deep learning framework that, in addition to CPUs and GPUs, supports Tensor Processing Units (TPUs), a type of processor developed by Google in 2016 for the specific purpose of neural network machine learning. In fact, TPUs were specifically designed by Google for the TensorFlow framework.

Although Google has not yet made TPUs available to the market, it is reported that it has used them in a number of its commercial services such as Google search, Street View, Google Photos, and Google Translate with significant improvements reported. A detailed study performed by Google shows that TPUs deliver 30 to 80 times higher perfor- mance per watt than contemporary CPUs and GPUs (Sato, Young, and Patterson, 2017). For example, it has been reported (Ung, 2016) that in Google Photos, an individual TPU can process over 100 million images per day (i.e., 0.86 ms/image). Such a unique feature will probably put TensorFlow way ahead of the other alternative frameworks in the near future as soon as Google makes TPUs commercially available.

Another interesting feature of TensorFlow is its visualization module, TensorBoard. Implementing a deep neural network is a complex and confusing task. TensorBoard re- fers to a Web application involving a handful of visualization tools to visualize network graphs and plot quantitative network metrics with the aim of helping users to better un- derstand what is going on during training procedures and to debug possible issues.

Theano

In 2007, the Deep Learning Group at the University of Montreal developed the initial version of a Python library, Theano (http://deeplearning.net/software/theano), to define, optimize, and evaluate mathematical expressions involving multi-dimensional ar- rays (i.e., tensors) on CPU or GPU platforms. Theano was one of the first deep learning frameworks but later became a source of inspiration for the developers of TensorFlow. Theano and TensorFlow both pursue a similar procedure in the sense that in both a typi- cal network implementation involves two sections: in the first section, a computational graph is built by defining the network variables and operations to be done on them; and the second section runs that graph (in Theano by compiling the graph into a function and in TensorFlow by creating a session). In fact, what happens in these libraries is that the user defines the structure of the network by providing some simple and symbolic

370 Part II • Predictive Analytics/Machine Learning

syntax understandable even for beginners in programming, and the library automatically generates appropriate codes in either C (for processing on CPU) or CUDA (for process- ing on GPU) to implement the defined network. Hence, users without any knowledge of programming in C or CUDA and with just a minimum knowledge of Python are able to efficiently design and implement deep learning networks on the GPU platforms.

Theano also includes some built-in functions to visualize computational graphs as well as to plot the network performance metrics even though its visualization features are not comparable to TensorBoard.

Keras: An Application Programming Interface

While all described deep learning frameworks require users to be familiar with their own syntax (through reading their documentations) to be able to successfully train a network, fortunately there are some easier, more user-friendly ways to do so. Keras (https:// keras.io/) is an open-source neural network library written in Python that functions as a high-level application programming interface (API) and is able to run on top of various deep learning frameworks including Theano and TensorFlow. In essence, Keras just by getting the key properties of network building blocks (i.e., type of layers, transfer func- tions, and optimizers) via an extremely simple syntax automatically generates syntax in one of the deep learning frameworks and runs that framework in the backend. While Keras is efficient enough to build and run general deep learning models in just a few minutes, it does not provide several advanced operations provided by TensorFlow or Theano. Therefore, in dealing with special deep network models that require advanced settings, one still needs to directly use those frameworks instead of Keras (or other APIs such as Lasagne) as a proxy.

u SECTION 6.9 REVIEW QUESTIONS

1. Despite the short tenure of deep learning implementation, why do you think there are several different computing frameworks for it?

2. Define CPU, NVIDIA, CUDA, and deep learning, and comment on the relationship between them.

3. List and briefly define the characteristics of different deep learning frameworks. 4. What is Keras, and how is it different from the other frameworks?

6.10 COGNITIVE COMPUTING

We are witnessing a significant increase in the way technology is evolving. Things that once took decades are now taking months, and the things that we see only in SciFi movies are becoming reality, one after another. Therefore, it is safe to say that in the next decade or two, technological advancements will transform how people live, learn, and work in a rather dramatic fashion. The interactions between humans and technology will become in- tuitive, seamless, and perhaps transparent. Cognitive computing will have a significant role to play in this transformation. Generally speaking, cognitive computing refers to the com- puting systems that use mathematical models to emulate (or partially simulate) the human cognition process to find solutions to complex problems and situations where the potential answers can be imprecise. While the term cognitive computing is often used interchange- ably with AI and smart search engines, the phrase itself is closely associated with IBM’s cognitive computer system Watson and its success on the television show Jeopardy! Details on Watson’s success on Jeopardy! can be found in Application Case 6.8.

According to Cognitive Computing Consortium (2018), cognitive computing makes a new class of problems computable. It addresses highly complex situations that are

Chapter 6 • Deep Learning and Cognitive Computing 371

characterized by ambiguity and uncertainty; in other words, it handles the kinds of prob- lems that are thought to be solvable by human ingenuity and creativity. In today’s dy- namic, information-rich, and unstable situations, data tend to change frequently, and they often conflict. The goals of users evolve as they learn more and redefine their objectives. To respond to the fluid nature of users’ understanding of their problems, the cognitive computing system offers a synthesis not just of information sources but also of influences, contexts, and insights. To achieve such a high-level of performance, cognitive systems often need to weigh conflicting evidence and suggest an answer that is “best” rather than “right.” Figure 6.36 illustrates a general framework for cognitive computing where data and AI technologies are used to solve complex real-world problems.

How Does Cognitive Computing Work?

As one would guess from the name, cognitive computing works much like a human thought process, reasoning mechanism, and cognitive system. These cutting-edge compu- tation systems can find and synthesize data from various information sources and weigh context and conflicting evidence inherent in the data to provide the best possible answers to a given question or problem. To achieve this, cognitive systems include self-learning technologies that use data mining, pattern recognition, deep learning, and NLP to mimic the way the human brain works.

Outcomes

Cognitive Computing

Saved lives

Improved economy

Better security

Engaged customers

Higher revenues Reduced risks

Improved living

Test

Built

Validate

Structured Data

(POS, transactions, OLAP, CRM, SCM,

external, etc.)

Unstructured Data

(social media, multimedia, loT, literature, etc.)

Complex Problems

(health, economic, humanitarian, social, etc.)

AI Algorithms Soft/Hardware

(machine learning, NLP, search, cloud,

GPU, etc.)

FIGURE 6.36 Conceptual Framework for Cognitive Computing and Its Promises.

372 Part II • Predictive Analytics/Machine Learning

Using computer systems to solve the types of problems that humans are typically tasked with requires vast amounts of structured and unstructured data fed to machine- learning algorithms. Over time, cognitive systems are able to refine the way in which they learn and recognize patterns and the way they process data to become capable of antici- pating new problems and modeling and proposing possible solutions.

To achieve those capabilities, cognitive computing systems must have the following key attributes as defined by the Cognitive Computing Consortium (2018):

• Adaptive: Cognitive systems must be flexible enough to learn as information changes and goals evolve. The systems must be able to digest dynamic data in real time and make adjustments as the data and environment change.

• Interactive: Human-computer interaction (HCI) is a critical component in cogni- tive systems. Users must be able to interact with cognitive machines and define their needs as those needs change. The technologies must also be able to interact with other processors, devices, and cloud platforms.

• Iterative and stateful: Cognitive computing technologies can also identify prob- lems by asking questions or pulling in additional data if a stated problem is vague or incomplete. The systems do this by maintaining information about similar situa- tions that have previously occurred.

• Contextual: Understanding context is critical in thought processes, so cogni- tive systems must understand, identify, and mine contextual data, such as syntax, time, location, domain, requirements, and a specific user’s profile, tasks, or goals. Cognitive systems may draw on multiple sources of information, including struc- tured and unstructured data and visual, auditory, or sensor data.

How Does Cognitive Computing Differ from AI?

Cognitive computing is often used interchangeably with AI, the umbrella term used for technologies that rely on data and scientific methods/computations to make (or help/sup- port in making) decisions. But there are differences between the two terms, which can largely be found within their purposes and applications. AI technologies include—but are not limited to—machine learning, neural computing, NLP, and, most recently, deep learn- ing. With AI systems, especially in machine-learning systems, data are fed into the algo- rithm for processing (an iterative and time-demanding process that is often called training) so that the systems “learn” variables and interrelationships among those variables so that it can produce predictions (or characterizations) about a given complex problem or situa- tion. Applications based on AI and cognitive computing include intelligent assistants, such as Amazon’s Alexa, Google Home, and Apple’s Siri. A simple comparison between cogni- tive computing and AI is given in Table 6.3 (Reynolds and Feldman, 2014; CCC, 2018).

As can be seen in Table 6.3, the differences between AI and cognitive computing are rather marginal. This is expected because cognitive computing is often character- ized as a subcomponent of AI or an application of AI technologies tailored for a specific purpose. AI and cognitive computing both utilize similar technologies and are applied to similar industry segments and verticals. The main difference between the two is the pur- pose: while cognitive computing is aimed at helping humans to solve complex problems, AI is aimed at automating processes that are performed by humans; at the extreme, AI is striving to replace humans with machines for tasks requiring “intelligence,” one at a time.

In recent years, cognitive computing typically has been used to describe AI systems that aim to simulate human thought process. Human cognition involves real-time analysis of environment, context, and intent among many other variables that inform a person’s ability to solve problems. A number of AI technologies are required for a computer sys- tem to build cognitive models that mimic human thought processes, including machine learning, deep learning, neural networks, NLP, text mining, and sentiment analysis.

Chapter 6 • Deep Learning and Cognitive Computing 373

In general, cognitive computing is used to assist humans in their decision-making process. Some examples of cognitive computing applications include supporting medical doctors in their treatment of disease. IBM Watson for Oncology, for example, has been used at Memorial Sloan Kettering Cancer Center to provide oncologists evidence-based treatment options for cancer patients. When medical staff input questions, Watson generates a list of hypotheses and offers treatment options for doctors to consider. Whereas AI relies on algo- rithms to solve a problem or to identify patterns hidden in data, cognitive computing systems have the loftier goal of creating algorithms that mimic the human brain’s reasoning process to help humans solve an array of problems as the data and the problems constantly change.

In dealing with complex situations, context is important, and cognitive computing systems make context computable. They identify and extract context features such as time, location, task, history, or profile to present a specific set of information that is ap- propriate for an individual or for a dependent application engaged in a specific process at a specific time and place. According to the Cognitive Computing Consortium, they provide machine-aided serendipity by wading through massive collections of diverse information to find patterns and then apply those patterns to respond to the needs of the user at a particular moment. In a sense, cognitive computing systems aim at redefining the nature of the relationship between people and their increasingly pervasive digital environment. They may play the role of assistant or coach for the user, and they may act virtually autonomously in many problem-solving situations. The boundaries of the pro- cesses and domains these systems can affect are still elastic and emergent. Their output may be prescriptive, suggestive, instructive, or simply entertaining.

In the short time of its existence, cognitive computing has proved to be useful in many domain and complex situations and is evolving into many more. The typical use cases for cognitive computing include the following:

• Development of smart and adaptive search engines • Effective use of natural language processing • Speech recognition • Language translation • Context-based sentiment analysis

TABLE 6.3 Cognitive Computing versus Artificial Intelligence (AI)

Characteristic Cognitive Computing Artificial Intelligence (AI)

Technologies used • Machine learning • Natural language processing • Neural networks • Deep learning • Text mining • Sentiment analysis

• Machine learning • Natural language processing • Neural networks • Deep learning

Capabilities offered Simulate human thought processes to assist humans in finding solutions to complex problems

Find hidden patterns in a variety of data sources to identify problems and provide potential solutions

Purpose Augment human capability Automate complex processes by acting like a human in certain situations

Industries Customer service, marketing, healthcare, entertainment, service sector

Manufacturing, finance, healthcare, banking, securities, retail, government

374 Part II • Predictive Analytics/Machine Learning

• Face recognition and facial emotion detection • Risk assessment and mitigation • Fraud detection and mitigation • Behavioral assessment and recommendations

Cognitive analytics is a term that refers to cognitive computing–branded technol- ogy platforms, such as IBM Watson, that specialize in processing and analyzing large, unstructured data sets. Typically, word processing documents, e-mails, videos, images, audio files, presentations, Web pages, social media, and many other data formats need to be manually tagged with metadata before they can be fed into a traditional analytics engine and Big Data tools for computational analyses and insight generation. The princi- pal benefit of utilizing cognitive analytics over those traditional Big Data analytics tools is that for cognitive analytics such data sets do not need to be pretagged. Cognitive analyt- ics systems can use machine learning to adapt to different contexts with minimal human supervision. These systems can be equipped with a chatbot or search assistant that un- derstands queries, explains data insights, and interacts with humans in human languages.

Cognitive Search

Cognitive search is the new generation search method that uses AI (advanced indexing, NLP, and machine learning) to return results that are much more relevant to users. Forrester de- fines cognitive search and knowledge discovery solutions as “a new generation of enterprise search solutions that employ AI technologies such as natural language processing and ma- chine learning to ingest, understand, organize, and query digital content from multiple data sources” (Gualtieri, 2017). Cognitive search creates searchable information out of nonsearch- able content by leveraging cognitive computing algorithms to create an indexing platform.

Searching for information is a tedious task. Although current search engines do a very good job in finding relevant information in a timely manner, their sources are limited to publically available data over the Internet. Cognitive search proposes the next genera- tion of search tailored for use in enterprises. It is different from traditional search because, according to Gualtieri (2017), it:

• Can handle a variety of data types. Search is no longer just about unstructured text contained in documents and in Web pages. Cognitive search solutions can also accommodate structured data contained in databases and even nontraditional enter- prise data such as images, video, audio, and machine-/sensor-generated logs from IoT devices.

• Can contextualize the search space. In information retrieval, the context is important. Context takes the traditional syntax-/symbol-driven search to a new level where it is defined by semantics and meaning.

• Employ advanced AI technologies. The distinguishing characteristic of cogni- tive search solutions is that they use NLP and machine learning to understand and organize data, predict the intent of the search query, improve the relevancy of results, and automatically tune the relevancy of results over time.

• Enable developers to build enterprise-specific search applications. Search is not just about a text box on an enterprise portal. Enterprises build search applica- tions that embed search in customer 360 applications, pharma research tools, and many other business process applications. Virtual digital assistants such as Amazon Alexa, Google Now, and Siri would be useless without powerful searches behind the scenes. Enterprises wishing to build similar applications for their customers will also benefit from cognitive search solutions. Cognitive search solutions provide soft- ware development kits (SDKs), APIs, and/or visual design tools that allow develop- ers to embed the power of the search engine in other applications.

Chapter 6 • Deep Learning and Cognitive Computing 375

Figure 6.37 shows the progressive evolution of search methods from good old key- word search to modern-day cognitive search on two dimensions—ease of use and value proposition.

IBM Watson: Analytics at Its Best

IBM Watson is perhaps the smartest computer system built to date. Since the emergence of computers and subsequently AI in the late 1940s, scientists have compared the per- formance of these “smart” machines with human minds. Accordingly, in the mid- to late-1990s, IBM researchers built a smart machine and used the game of chess (generally credited as the game of smart humans) to test its ability against the best of human players. On May 11, 1997, an IBM computer called Deep Blue beat the world chess grandmaster after a six-game match series: two wins for Deep Blue, one for the champion, and three draws. The match lasted several days and received massive media coverage around the world. It was the classic plot line of human versus machine. Beyond the chess contest, the intention of developing this kind of computer intelligence was to make computers able to handle the kinds of complex calculations needed to help discover new drugs and to do the broad financial modeling needed to identify trends and do risk analysis, handle large database searches, and perform massive calculations needed in advanced fields of science.

After a couple of decades, IBM researchers came up with another idea that was perhaps more challenging: a machine that could not only play the American TV quiz show Jeopardy! but also beat the best of the best. Compared to chess, Jeopardy! is much more challenging. While chess is well structured and has very simple rules and therefore is a very good match for computer processing, Jeopardy! is neither simple nor structured. Jeopardy! is a game designed to test human intelligence and creativity. Therefore, a com- puter designed to play the game needed to be a cognitive computing system that can work and think like a human. Making sense of imprecision inherent in human language was the key to success.

Value Proposition

Natural Human Interaction (NHI)

Cognitive Search

Contextual Search

Indexing NLP

Indexing NLP Machine Learning

Semantic Search

Keyword Search

Machine Learning

Natural Language Processing (NLP)

E a s e o

f U

s e

Indexing

Indexing

FIGURE 6.37 Progressive Evolution of Search Methods.

376 Part II • Predictive Analytics/Machine Learning

In 2010, an IBM research team developed Watson, an extraordinary computer system—a novel combination of advanced hardware and software—designed to answer questions posed in natural human language. The team built Watson as part of the DeepQA project and named it after IBM’s first president, Thomas J. Watson. The team that built Watson was looking for a major research challenge: one that could rival the scientific and popular interest of Deep Blue and would have clear relevance to IBM’s business interests. The goal was to advance computational science by exploring new ways for computer technology to affect science, business, and society at large. Accordingly, IBM research undertook a challenge to build Watson as a computer system that could compete at the human champion level in real time on Jeopardy! The team wanted to create a real-time automatic contestant on the show capable of listening, understanding, and responding, not merely a laboratory exercise. Application Case 6.8 provides some of the details on IBM Watson’s participation in the game show.

In 2011, to test its cognitive abilities, Watson com- peted on the quiz show Jeopardy! in the first-ever human-versus-machine matchup for the show. In a two-game, combined-point match (broadcast in three Jeopardy! episodes during February 14–16), Watson beat Brad Rutter, the highest all-time money winner on Jeopardy! and Ken Jennings, the record holder for the longest championship streak (75 days). In these episodes, Watson consistently outperformed its human opponents on the game’s signaling device, but it had trouble responding to a few categories, notably those having short clues containing only a few words. Watson had access to 200 million pages of structured and unstructured content, consum- ing four terabytes of disk storage. During the game, Watson was not connected to the Internet.

Meeting the Jeopardy! challenge required advancing and incorporating a variety of text mining and NLP technologies, including parsing, question classification, question decomposition, automatic source acquisition and evaluation, entity and rela- tionship detection, logical form generation, and knowledge representation and reasoning. Winning at Jeopardy! required accurately computing confi- dence in answers. The questions and content are ambiguous and noisy, and none of the individual algorithms is perfect. Therefore, each component must produce a confidence in its output, and indi- vidual component confidences must be combined to compute the overall confidence of the final

answer. The final confidence is used to determine whether the computer system should risk choosing to answer at all. In Jeopardy! this confidence is used to determine whether the computer will “ring in” or “buzz in” for a question. The confidence must be computed during the time the question is read and before the opportunity to buzz in. This is roughly between one and six seconds with an average around three seconds.

Watson was an excellent example for the rapid advancement of the computing technology and what it is capable of doing. Although still not as creatively/natively smart as human beings, com- puter systems like Watson are evolving to change the world we are living in, hopefully for the better.

Questions for Case 6.8

1. In your opinion, what are the most unique fea- tures about Watson?

2. In what other challenging games would you like to see Watson compete against humans? Why?

3. What are the similarities and differences between Watson’s and humans’ intelligence?

Sources: Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, D. Kalyanpur, A. Lally, J. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. (2010). “Building Watson: An Overview of the DeepQA Project.” AI Magazine, 31(3), pp. 59–79; IBM Corporation. (2011). “The DeepQA Project.” https://researcher.watson.ibm. com/researcher/view_group.php?id=2099 (accessed May 2018).

Application Case 6.8 IBM Watson Competes against the Best at Jeopardy!

Chapter 6 • Deep Learning and Cognitive Computing 377

How Does Watson Do It?

What is under the hood of Watson? How does it do what it does? The system behind Watson, which is called DeepQA, is a massively parallel, text mining–focused, probabilis- tic evidence–based computational architecture. For the Jeopardy! challenge, Watson used more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique the IBM team used was how it combined them in DeepQA such that overlapping approaches could bring their strengths to bear and contribute to improvements in accuracy, confidence, and speed.

DeepQA is architecture with an accompanying methodology that is not specific to the Jeopardy! challenge. These are the overarching principles in DeepQA:

• Massive parallelism. Watson needed to exploit massive parallelism in the con- sideration of multiple interpretations and hypotheses.

• Many experts. Watson needed to be able to integrate, apply, and contextually evaluate a wide range of loosely coupled probabilistic questions and content analytics.

• Pervasive confidence estimation. No component of Watson committed to an answer; all components produced features and associated confidences, scoring dif- ferent question and content interpretations. An underlying confidence-processing substrate learned how to stack and combine the scores.

• Integration of shallow and deep knowledge. Watson needed to balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

Figure 6.38 illustrates the DeepQA architecture at a very high level. More technical details about the various architectural components and their specific roles and capabilities can be found in Ferrucci et al. (2010).

What Is the Future for Watson?

The Jeopardy! challenge helped IBM address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense re- search and development by a core team of about 20 researchers, as well as a significant

Hypothesis n Soft Filtering Evidence Scoring

Hypothesis 3 Soft Filtering Evidence Scoring

Hypothesis 2 Soft Filtering Evidence Scoring

Hypothesis 1

Question (in natural language)

Soft Filtering Evidence Scoring

Question (translation to digital)

Analysis (decomposition)

Primary Search

Answer Sources

Evidence Sources

Synthesis (combining)

Answer (and level of confidence)

Merging and Ranking

... ... ...

Candidate Generation

Support Evidence Retrieval

Deep Evidence Scoring

1 2 3

4 5

FIGURE 6.38 A High-Level Depiction of DeepQA Architecture

378 Part II • Predictive Analytics/Machine Learning

R&D budget, Watson managed to perform at human expert levels in terms of precision, confidence, and speed on the Jeopardy! quiz show.

After the show, the big question was “So what now?” Was developing Watson all for a quiz show? Absolutely not! Showing the rest of the world what Watson (and the cognitive system behind it) could do became an inspiration for the next generation of intelligent information systems. For IBM, it was a demonstration of what is possible with cutting-edge analytics and computational sciences. The message is clear: If a smart ma- chine can beat the best of the best in humans at what they are the best at, think about what it can do for your organizational problems.

The innovative and futuristic technologies that made Watson one of the most ac- claimed technological advances of this decade are being leveraged as computational foundation for several tools to analyze and characterize unstructured data for prediction- type problems. These experimental tools include Tone Analyzer and Personality Insights. Using textual content, these tools have shown the ability to predict outcomes of complex social events and globally popular competitions.

WATSON PREDICTS THE WINNER OF 2017 EUROVISION SONG CONTEST. A tool developed on the foundations of IBM Watson, Watson Tone Analyzer, uses computational linguistics to identify tone in written text. Its broader goal is to have business managers use the Tone Analyzer to understand posts, conversations, and communications of target customer pop- ulations and to respond to their needs and wants in a timely manner. One could, for ex- ample, use this tool to monitor social media and other Web-based content, including wall posts, tweets, product reviews, and discussion boards as well as longer documents such as articles and blog posts. Or one could use it to monitor customer service interactions and support related conversations. Although it sounds as if any other text-based detection system can build on sentiment analysis, Tone Analyzer differs from these systems in that it analyzes and characterizes textual content. Watson Tone Analyzer measures social tenden- cies and opinions, using a version of the Big-5, the five categories of personality traits (i.e., openness, agreeableness, conscientiousness, extroversion, and neuroticism), along with other emotional categories to detect the tone in a given textual content. As an example, Slowey (2017b) used IBM’s Watson Tone Analyzer to predict the winner of the 2017 Eurovision Songs Contest. Using nothing but the lyrics of the previous years’ competitions, Slowey discovered a pattern that suggested most winners had high levels of agreeableness and conscientiousness. The results (produced before the contest) indicated that Portugal would win the contest, and that is exactly what happened. Try it out yourself:

• Go to Watson Tone Analyzer (https://tone-analyzer-demo.ng.bluemix.net). • Copy and paste your own text in the provided text entry field. • Click “Analyze.” • Observe the summary results as well as the specific sentences where specific tones

are the strongest

Another tool built on the linguistic foundations of IBM Watson is Watson Personality Insight, which seems to work quite similar to Watson Tone Analyzer. In another fun applica- tion case, Slowey (2017a) used Watson Personality Insight to predict the winner of the best picture category at the 2017 Oscar Academy Awards. Using the scripts of the movies from the past years, Slowey developed a generalized profile for winners and then compared that profile to those of the newly nominated movies to identify the upcoming winner. Although in this case, Slowey incorrectly predicted Hidden Figures as the winner, the methodology she followed was unique and innovative and hence deserves credit. To try Watson Personality Insight tool yourself, just go to https://personality-insights-demo.ng.bluemix.net/, copy and paste your own textual content into the “Body of Text” section, and observe the outcome.

Chapter 6 • Deep Learning and Cognitive Computing 379

One of the worthiest endeavors for Watson (or Watson-like large-scale cognitive computing systems) is to help doctors and other medical professionals to diagnose dis- eases and identify the best treatment options that would work for an individual patient. Although Watson is new, this very novel and worthy task is not new to the world of com- puting. In the early 1970s, several researchers at Stanford University developed a com- puter system, MYCIN, to identify bacteria causing severe infections, such as bacteremia and meningitis, and to recommend antibiotics with the dosage adjusted for the specifics of an individual patient (Buchanan and Shortliffe, 1984). This six-year effort relied on a rule-based expert system, a type of AI system, where the diagnoses and treatment knowl- edge nuggets/rules were elicited from a large number of experts (i.e., doctors with ample experience in the specific medical domain). The resulting system was then tested on new patients, and its performance was compared to those of the experienced doctors used as the knowledge sources/experts. The results favored MYCIN, providing a clear indication that properly designed and implemented AI-based computer systems can meet and often exceed the effectiveness and efficiency of even the best medical experts. After more than four decades, Watson is now trying to pick up where MYCIN left the mission of using smart computer systems to improve the health and well-being of humans by helping doc- tors with the contextual information that they need to better and more quickly diagnose and treat their patients.

The first industry targeted to utilize Watson was healthcare, followed by security, finance, retail, education, public services, and research. The following sections pro- vide short descriptions of what Watson can do (and, in many cases, is doing) for these industries.

HEALTHCARE AND MEDICINE The challenges that healthcare is facing today are rather big and multifaceted. With the aging U.S. population, which may be partially attributed to better living conditions and advanced medical discoveries fueled by a variety of tech- nological innovations, demand for healthcare services is increasing faster than the supply of resources. As we all know, when there is an imbalance between demand and supply, prices go up and quality suffers. Therefore, we need cognitive systems like Watson to help decision makers optimize the use of their resources in both clinical and managerial settings.

According to healthcare experts, only 20 percent of the knowledge that physicians use to diagnose and treat patients is evidence based. Considering that the amount of medical information available is doubling every five years and that much of these data are unstructured, physicians simply do not have time to read every journal that can help them keep up-to-date with the latest advances. Given the growing demand for services and the complexity of medical decision making, how can healthcare providers address these problems? The answer could be to use Watson or similar cognitive systems that have the ability to help physicians in diagnosing and treating patients by analyzing large amounts of data—both structured data coming from electronic medical record databases and unstructured text coming from physician notes and published literature—to provide evidence for faster and better decision making. First, the physician and the patient can describe symptoms and other related factors to the system in natural language. Watson can then identify the key pieces of information and mine the patient’s data to find rel- evant facts about family history, current medications, and other existing conditions. It can then combine that information with current findings from tests and then can form and test hypotheses for potential diagnoses by examining a variety of data sources—treatment guidelines, electronic medical record data, doctors’ and nurses’ notes, and peer-reviewed research and clinical studies. Next, Watson can suggest potential diagnostics and treat- ment options with a confidence rating for each suggestion.

380 Part II • Predictive Analytics/Machine Learning

Watson also has the potential to transform healthcare by intelligently synthesizing fragmented research findings published in a variety of outlets. It can dramatically change the way medical students learn. It can help healthcare managers to be proactive about upcoming demand patterns, optimally allocate resources, and improve processing of pay- ments. Early examples of leading healthcare providers that use Watson-like cognitive systems include MD Anderson, The Cleveland Clinic, and Memorial Sloan Kettering.

SECURITY As the Internet expands into every facet of our lives—e-commerce, e-business, smart grids for energy, smart homes for remote control of residential gad- gets and appliances—to make things easier to manage, it also opens up the potential for ill-intended people to intrude in our lives. We need smart systems like Watson that are capable of constantly monitoring for abnormal behavior and, when it is identified, preventing people from accessing our lives and harming us. This could be at the corpo- rate or even national security system level; it could also be at the personal level. Such a smart system could learn who we are and become a digital guardian that could make inferences about activities related to our life and alert us whenever abnormal things happen.

FINANCE The financial services industry faces complex challenges. Regulatory measures as well as social and governmental pressures for financial institutions to be more inclusive have increased. And the customers the industry serves are more empowered, demand- ing, and sophisticated than ever before. With so much financial information generated each day, it is difficult to properly harness the appropriate information on which to act. Perhaps the solution is to create smarter client engagement by better understanding risk profiles and the operating environment. Major financial institutions are already working with Watson to infuse intelligence into their business processes. Watson is tackling data- intensive challenges across the financial services sector, including banking, financial plan- ning, and investing.

RETAIL The retail industry is rapidly changing according to customers’ needs and wants. Empowered by mobile devices and social networks that give them easier access to more information faster than ever before, customers have high expectations for products and services. While retailers are using analytics to keep up with those expectations, their big- ger challenge is efficiently and effectively analyzing the growing mountain of real-time insights that could give them a competitive advantage. Watson’s cognitive computing capabilities related to analyzing massive amounts of unstructured data can help retail- ers reinvent their decision-making processes around pricing, purchasing, distribution, and staffing. Because of Watson’s ability to understand and answer questions in natural language, Watson is an effective and scalable solution for analyzing and responding to social sentiment based on data obtained from social interactions, blogs, and customer reviews.

EDUCATION With the rapidly changing characteristics of students—who are more visu- ally oriented/stimulated, constantly connected to social media and social networks, and with increasingly shorter attention spans—what should the future of education and the classroom look like? The next generation of educational systems should be tailored to fit the needs of the new generation with customized learning plans, personalized textbooks (digital ones with integrated multimedia—audio, video, animated graphs/charts, etc.), dynamically adjusted curriculum, and perhaps smart digital tutors and 24/7 personal advi- sors. Watson seems to have what it takes to make all this happen. With its NLP capability, students can converse with it just as they do with their teachers, advisors, and friends.

Chapter 6 • Deep Learning and Cognitive Computing 381

This smart assistant can answer students’ questions, satisfy their curiosity, and help them keep up with the endeavors of the educational journey.

GOVERNMENT For local, regional, and national governments, the exponential rise of Big Data presents an enormous dilemma. Today’s citizens are more informed and em- powered than ever before, and that means they have high expectations for the value of the public sector serving them. And government organizations can now gather enormous volumes of unstructured, unverified data that could serve their citizens, but only if those data can be analyzed efficiently and effectively. IBM Watson’s cognitive computing may help make sense of this data deluge, speeding governments’ decision-making processes and helping public employees to focus on innovation and discovery.

RESEARCH Every year, hundreds of billions of dollars are spent on research and develop- ment, most of it documented in patents and publications, creating an enormous amount of unstructured data. To contribute to the extant body of knowledge, one needs to sift through these data sources to find the outer boundaries of research in a particular field. This is very difficult, if not impossible, work if it is done with traditional means, but Watson can act as a research assistant to help collect and synthesize information to keep people updated on recent findings and insights. For instance, the New York Genome Center is using the IBM Watson cognitive computing system to analyze the genomic data of patients diagnosed with a highly aggressive and malignant brain cancer and to more rapidly deliver personalized, life-saving treatment to patients with this disease (Royyuru, 2014).

u SECTION 6.10 REVIEW QUESTIONS

1. What is cognitive computing, and how does it differ from other computing paradigms? 2. Draw a diagram and explain the conceptual framework of cognitive computing.

Make sure to include inputs, enablers, and expected outcomes in your framework.

3. List and briefly define the key attributes of cognitive computing. 4. How does cognitive computing differ from ordinary AI techniques? 5. What are the typical use cases for cognitive analytics? 6. Explain what the terms cognitive analytics and cognitive search mean. 7. What is IBM Watson and what is its significance to the world of computing? 8. How does Watson work? 9. List and briefly explain five use cases for IBM Watson.

Chapter Highlights

• Deep learning is among the latest trends in AI that come with great expectations.

• The goal of deep learning is similar to those of the other machine-leaning methods, which is to use sophisticated mathematical algorithms to learn from data similar to the way that humans learn.

• What deep learning has added to the classic machine-learning methods is the ability to auto- matically acquire the features required to accom- plish highly complex and unstructured tasks.

• Deep learning belongs to the representation learning within the AI learning family of methods.

• The recent emergence and popularity of deep learning can largely be attributed to very large data sets and rapidly advancing commuting infrastructures.

• Artificial neural networks emulate the way the human brain works. The basic processing unit is a neuron. Multiple neurons are grouped into lay- ers and linked together.

382 Part II • Predictive Analytics/Machine Learning

• In a neural network, knowledge is stored in the weight associated with the connections between neurons.

• Backpropagation is the most popular learning paradigm of feedforward neural networks.

• An MLP-type neural network consists of an input layer, an output layer, and a number of hidden layers. The nodes in one layer are connected to the nodes in the next layer.

• Each node at the input layer typically represents a single attribute that may affect the prediction.

• The usual process of learning in a neural network involves three steps: (1) compute temporary out- puts based on inputs and random weights, (2) compute outputs with desired targets, and (3) ad- just the weights and repeat the process.

• Developing neural network–based systems re- quires a step-by-step process. It includes data preparation and preprocessing, training and test- ing, and conversion of the trained model into a production system.

• Neural network software allows for easy experi- mentation with many models. Although neural network modules are included in all major data mining software tools, specific neural network packages are also available.

• Neural network applications abound in almost all business disciplines as well as in virtually all other functional areas.

• Overfitting occurs when neural networks are trained for a large number of iterations with rela- tively small data sets. To prevent overfitting, the training process is controlled by an assessment process using a separate validation data set.

• Neural networks are known as black-box models. Sensitivity analysis is often used to shed light into the black box to assess the relative importance of input features.

• Deep neural networks broke the generally ac- cepted notion of “no more than two hidden lay- ers are needed to formulate complex prediction problems.” They promote increasing the hidden layer to arbitrarily large numbers to better repre- sent the complexity in the data set.

• MLP deep networks, also known as deep feedfor- ward networks, are the most general type of deep networks.

• The impact of random weights in the learning process of deep MLP is shown to be a signifi- cant issue. Nonrandom assignment of the initial weights seems to significantly improve the learn- ing process in deep MLP.

• Although there is no generally accepted theoreti- cal basis for this, it is believed and empirically shown that in deep MLP networks, multiple lay- ers perform better and converge faster than few layers with many neurons.

• CNNs are arguably the most popular and most successful deep learning methods.

• CNNs were initially designed for computer vision applications (e.g., image processing, video process- ing, text recognition) but also have been shown to be applicable to nonimage or non-text data sets.

• The main characteristic of the convolutional net- works is having at least one layer involving a convolution weight function instead of general matrix multiplication.

• The convolution function is a method to address the issue of having too many network weight pa- rameters by introducing the notion of parameter sharing.

• In CNN, a convolution layer is often followed by another layer known as the pooling (a.k.a. sub- sampling) layer. The purpose of a pooling layer is to consolidate elements in the input matrix in order to produce a smaller output matrix while maintaining the important features.

• ImageNet is an ongoing research project that provides researchers with a large database of images, each linked to a set of synonym words (known as synset) from WordNet (a word hierar- chy database).

• AlexNet is one of the first convolutional net- works designed for image classification using the ImageNet data set. Its success rapidly popularized the use and reputation of CNNs.

• GoogLeNet (a.k.a. Inception), a deep convolu- tional network architecture designed by Google researchers, was the winning architecture at ILSVRC 2014.

• Google Lens is an app that uses deep learning ar- tificial neural network algorithms to deliver infor- mation about the images captured by users from their nearby objects.

• Google’s word2vec project remarkably increased the use of CNN-type deep learning for text min- ing applications.

• RNN is another deep learning architecture de- signed to process sequential inputs.

• RNNs have memory to remember previous in- formation in determining context-specific, time- dependent outcomes.

• A variation of RNN, the LSTM network is today known as the most effective sequence modeling

Chapter 6 • Deep Learning and Cognitive Computing 383

technique and is the base of many practical applications.

• Two emerging LSTM applications are Google Neural Machine Translator and Microsoft Skype Translator.

• Deep learning implementation frameworks include Torch, Caffe, TensorFlow, Theano, and Keras.

• Cognitive computing makes a new class of prob- lems computable by addressing highly complex situations that are characterized by ambiguity and uncertainty; in other words, it handles the kinds of problems that are thought to be solvable by human ingenuity and creativity.

• Cognitive computing finds and synthesizes data from various information sources and weighs the context and conflicting evidence inherent in the data in order to provide the best possible answers to a given question or problem.

• The key attributes of cognitive computing include adaptability, interactivity, being iterative, stateful, and contextual.

• Cognitive analytics is a term that refers to cognitive computing–branded technology platforms, such as IBM Watson, that specialize in the processing and analysis of large unstructured data sets.

• Cognitive search is the new generation of search method that uses AI (advanced indexing, NLP, and machine learning) to return results that are much more relevant to the user than traditional search methods.

• IBM Watson is perhaps the smartest computer system built to date. It has coined and popular- ized the term cognitive computing.

• IBM Watson beat the best of men (the two most winning competitors) at the quiz game Jeopardy!, showcasing the ability of commut- ers to do tasks that are designed for human intelligence.

• Watson and systems like it are now in use in many application areas including healthcare, fi- nance, security, and retail.

Key Terms

activation function artificial intelligence (AI) artificial neural networks (ANN) backpropagation black-box syndrome Caffe cognitive analytics cognitive computing cognitive search connection weight constant error carousel (CEC) convolution function convolutional neural network

(CNN) deep belief network (DBN) deep learning deep neural network DeepQA

Google Lens GoogLeNet Google Neural Machine Translator

(GNMT) graphics processing unit (GPU) hidden layer IBM Watson ImageNet Keras long short-term memory (LSTM) machine learning Microsoft Skype Translator multilayer perceptron (MLP) MYCIN network structure neural network neuron overfitting

perceptron performance function pooling processing element (PE) recurrent neural network (RNN) representation learning sensitivity analysis stochastic gradient

descent (SGD) summation function supervised learning TensorFlow Theano threshold value Torch transfer function word embeddings word2vec

Questions for Discussion

1. What is deep learning? What can deep learning do that traditional machine-learning methods cannot?

2. List and briefly explain different learning paradigms/ methods in AI.

3. What is representation learning, and how does it relate to machine learning and deep learning?

4. List and briefly describe the most commonly used ANN activation functions.

5. What is MLP, and how does it work? Explain the function of summation and activation weights in MLP-type ANN.

6. List and briefly describe the nine-step process in con- ducting a neural network project.

384 Part II • Predictive Analytics/Machine Learning

7. Draw and briefly explain the three-step process of learning in ANN.

8. How does the backpropagation learning algorithm work? 9. What is overfitting in ANN learning? How does it hap-

pen, and how can it be prevented? 10. What is the so-called black-box syndrome? Why is

it important to be able to explain an ANN’s model structure?

11. How does sensitivity analysis work in ANN? Search the Internet to find other methods to explain ANN methods.

12. What is meant by “deep” in deep neural networks? Compare deep neural network to shallow neural network.

13. What is GPU? How does it relate to deep neural networks?

14. How does a feedforward multilayer perceptron–type deep network work?

15. Comment on the impact of random weights in develop- ing deep MLP.

16. Which strategy is better: more hidden layers versus more neurons?

17. What is CNN? 18. For what type of applications can CNN be used? 19. What is the convolution function in CNN, and how does

it work? 20. What is pooling in CNN? How does it work? 21. What is ImageNet, and how does it relate to deep

learning? 22. What is the significance of AlexNet? Draw and describe

its architecture. 23. What is GoogLeNet? How does it work? 24. How does CNN process text? What is word embeddings,

and how does it work? 25. What is word2vec, and what does it add to the tradi-

tional text mining?

26. What is RNN? How does it differ from CNN? 27. What is the significance of context, sequence, and mem-

ory in RNN? 28. Draw and explain the functioning of a typical recurrent

neural network unit. 29. What is LSTM network, and how does it differ from

RNNs? 30. List and briefly describe three different types of LSTM

applications. 31. How do Google’s Neural Machine Translation and

Microsoft Skype Translator work? 32. Despite its short tenure, why do you think deep learn-

ing implementation has several different computing frameworks?

33. Define and comment on the relationship between CPU, NVIDIA, CUDA, and deep learning.

34. List and briefly define the characteristics of different deep learning frameworks.

35. What is Keras, and how does it differ from other frameworks?

36. What is cognitive computing and how does it differ from other computing paradigms?

37. Draw a diagram and explain the conceptual frame- work of cognitive computing. Make sure to include inputs, enablers, and expected outcomes in your framework.

38. List and briefly define the key attributes of cognitive computing.

39. How does cognitive computing differ from ordinary AI techniques?

40. What are the typical use cases for cognitive analytics? 41. What is cognitive analytics? What is cognitive search? 42. What is IBM Watson, and what is its significance to the

world of computing? 43. How does IBM Watson work? 44. List and briefly explain five use cases for IBM Watson.

Exercises

Teradata University Network (TUN) and Other Hands-On and Internet Exercises

1. Go to the Teradata University Network Web site (teradatauniversitynetwork.com). Search for teach- ing and learning materials (e.g., articles, application cases, white papers, videos, exercises) on deep learn- ing, cognitive computing, and IBM Watson. Read the material you have found. If needed, also conduct a search on the Web to enhance your findings. Write a report on your findings.

2. Deep learning is relatively new to the world of analytics. Its application cases and success stories are just start- ing to emerge in the Web. Conduct a comprehensive search on your school’s digital library resources to iden- tify at least five journal articles where interesting deep

learning applications are described. Write a report on your findings.

3. Most of the applications of deep learning today are developed using R- and/or Python-based open-source computing resources. Identify those resources (frame- works such as Torch, Caffe, TensorFlow, Theano, Keras) available for building deep learning models and applications. Compare and contrast their capabilities and limitations. Based on your findings and understand- ing of these resources, if you were to develop a deep learning application, which one would you choose to employ? Explain and justify/defend your choice.

4. Cognitive computing has become a popular term to define and characterize the extent of the ability of machines/ computers to show “intelligent” behavior. Thanks to IBM

Chapter 6 • Deep Learning and Cognitive Computing 385

Watson and its success on Jeopardy!, cognitive comput- ing and cognitive analytics are now part of many real- world intelligent systems. In this exercise, identify at least three application cases where cognitive computing was used to solve complex real-world problems. Summarize your findings in a professionally organized report.

5. Download KNIME analytics platform, one of the most popular free/open-source software tools from knime. org. Identify the deep learning examples (where Keras is used to build some exemplary prediction/classifica- tion models) in its example folder. Study the models in detail. Understand what it does and how exactly it does it. Then, using a different but similar data set, build and test your own deep learning prediction model. Report your findings and experiences in a written document.

6. Search for articles related to “cognitive search.” Identify at least five pieces of written material (a combination of journal articles, white papers, blog posts, application cases, etc.). Read and summarize your findings. Explain your understanding of cognitive search and how it dif- fers from regular search methods.

7. Go to Teradata.com. Search and find application case studies and white papers on deep learning and/or cogni- tive computing. Write a report to summarize your find- ings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

8. Go to SAS.com. Search and find application case stud- ies and white papers on deep learning and/or cognitive computing. Write a report to summarize your findings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

9. Go to IBM.com. Search and find application case stud- ies and white papers on deep learning and/or cognitive computing. Write a report to summarize your findings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

10. Go to TIBCO.com or some other advanced analytics company Web site. Search and find application case studies and white papers on deep learning and/or cog- nitive computing. Write a report to summarize your find- ings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

References

Abad, M., P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, . . . M. Isard. (2016). “TensorFlow: A System for Large-Scale Machine Learning.” OSDI, 16, pp. 265–283.

Altman, E. I. (1968). “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy.” The Journal of Finance, 23(4), pp. 589–609.

Bahdanau, D., K. Cho, & Y. Bengio. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” ArXiv Preprint ArXiv:1409.0473.

Bengio, Y. (2009). “Learning Deep Architectures for AI.” Foundations and Trends® in Machine Learning, 2(1), pp. 1–127.

Bergstra, J., O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, . . . Y. Bengio. (2010). “Theano: A CPU and GPU Math Compiler in Python.” Proceedings of the Ninth Python in Science Conference, Vol. 1.

Bi, R. (2014). “When Watson Meets Machine Learning.” www. kdnuggets.com/2014/07/watson-meets-machine- learning.html (accessed June 2018).

Boureau, Y.-L., N. Le Roux, F. Bach, J. Ponce, & Y. LeCun (2011). “Ask the Locals: Multi-Way Local Pooling for Image Recognition.” Proceedings of the International Com- puter Vision (ICCV’11) IEEE International Conference, pp. 2651–2658.

Boureau, Y.-L., J. Ponce, & Y. LeCun. (2010). “A Theoretical Analysis of Feature Pooling in Visual Recognition.” Pro- ceedings of International Conference on Machine Learn- ing (ICML’10), pp. 111–118.

Buchanan, B. G., & E. H. Shortliffe. (1984). Rule Based Ex- pert Systems: The MYCIN Experiments of the Stanford

Heuristic Programming Project. Reading, MA: Addison- Wesley.

Cognitive Computing Consortium. (2018). https://cogni- tivecomputingconsortium.com/resources/cognitive- computing-defined/#1467829079735-c0934399- 599a (accessed July 2018).

Chen, T., M. Li, Y. Li, M. Lin, N. Wang, M. Wang, . . . Z. Zhang. (2015). “Mxnet: A Flexible and Efficient Machine Learn- ing Library for Heterogeneous Distributed Systems.” ArXiv Preprint ArXiv:1512.01274.

Collobert, R., K. Kavukcuoglu, & C. Farabet. (2011). “Torch7: A Matlab-like Environment for Machine Learning.” Big- Learn, NIPS workshop.

Cybenko, G. (1989). “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems, 2(4), 303–314.

DeepQA. (2011). “DeepQA Project: FAQ, IBM Corporation.” https://researcher.watson.ibm.com/researcher/ view_group.php?id=2099 (accessed May 2018).

Delen, D., R. Sharda, & M. Bessonov, M. (2006). “Identifying Sig- nificant Predictors of Injury Severity in Traffic Accidents Us- ing a Series of Artificial Neural Networks.” Accident Analysis & Prevention, 38(3), 434–444.

Denyer, S. (2018, January). “Beijing Bets on Facial Recognition in a Big Drive for Total Surveillance.” The Washington Post. https://www.washingtonpost.com/news/world/ w p / 2 0 1 8 / 0 1 / 0 7 / f e a t u r e / i n - c h i n a - f a c i a l - recognition-is-shar p-end-of-a-dr ive-for-total- s u r v e i l l a n c e / ? n o r e d i r e c t = o n & u t m _ t e r m = . e73091681b31.

386 Part II • Predictive Analytics/Machine Learning

Feldman, S., J. Hanover, C. Burghard, & D. Schubmehl. (2012). “Unlocking the Power of Unstructured Data.” IBM White Paper. http://.www-01.ibm.com/software/ebusiness/ jstart/downloads/unlockingUnstructuredData.pdf. (accessed May 2018).

Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, Kalyanpur, A. A. Lally, J. W. Murdock, E. Nyberg, J. Prag- er, N. Schlaefer, & C. Welty. (2010). “Building Watson: An Overview of the DeepQA Project.” AI Magazine, 31(3), pp. 59–79.

Goodfellow, I., Y. Bengio, & A. Courville. (2016). “Deep Learning.” Cambridge, MA: MIT Press.

Goodfellow, I. J., D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, . . . Y. Bengio. (2013). “Pylearn2: A Machine Learning Research Library.” ArXiv Preprint ArX- iv:1308.4214.

Graves, A. (2013). “Generating Sequences with Recurrent Neural Networks.” ArXiv Preprint ArXiv:1308.0850.

Graves, A., & N. Jaitly. (2014). “Towards End-to-End Speech Recognition with Recurrent Neural Networks.” Proceed- ings on International Conference on Machine Learning, pp. 1764–1772.

Graves, A., N. Jaitly, & A. Mohamed. (2013). “Hybrid Speech Recognition with Deep Bidirectional LSTM.” IEEE Work- shop on Automatic Speech Recognition and Understand- ing, pp. 273–278.

Graves, A., A. Mohamed, & G. Hinton. (2013). “Speech Recog- nition with Deep Recurrent Neural Networks.” IEEE Acous- tics, Speech and Signal Processing (ICASSP) International Conference, pp. 6645–6649.

Graves, A., & J. Schmidhuber. (2009). “Offline Handwriting Recognition with Multidimensional Recurrent Neural Net- works.” Advances in Neural Information Processing Sys- tems. Cambridge, MA: MIT Press, pp. 545–552.

Gualtieri, M. (2017). “Cognitive Search Is the AI Version of Enterprise Search, Forrester.” go.forrester.com/blogs/ 17-06-12-cognitive_search_is_the_ai_version_of_ enterprise_search/ (accessed July 2018).

Haykin, S. S. (2009). Neural Networks and Learning Machines, 3rd ed. Upper Saddle River, NJ: Prentice Hall.

He, K., X. Zhang, S. Ren, & J. Sun. (2015). “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp. 1026–1034.

Hinton, G. E., S. Osindero, & Y.-W. Teh. (2006). “A Fast Learn- ing Algorithm for Deep Belief Nets.” Neural Computation, 18(7), 1527–1554.

Hochreiter, S., & J. Schmidhuber (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735–1780.

Hornik, K. (1991). “Approximation Capabilities of Multilayer Feedforward Networks.” Neural Networks, 4(2), 251–257.

IBM. (2011). “IBM Watson.” www.ibm.com/watson/ (ac- cessed July 2017).

Jia, Y. (2013). “Caffe: An Open Source Convolutional Architec- ture for Fast Feature Embedding.” http://Goo.Gl/Fo9YO8 (accessed June 2018).

Jia, Y., E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, . . . T. Darrell, T. (2014). “Caffe: Convolutional Architecture for Fast Feature Embedding.” Proceedings of the ACM International Conference on Multimedia, pp. 675–678.

Keysers, D., T. Deselaers, H. A. Rowley, L.-L. Wang, & V.  Carbune. (2017). “Multi-Language Online Handwriting Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), pp. 1180–1194.

Krizhevsky, A., I. Sutskever, & G. Hinton. (2012). “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, pp. 1097–1105S.

Kumar, S. (2017). “A Survey of Deep Learning Methods for Relation Extraction.” http://arxiv.org/abs/1705.03645. (accessed June 2018)

LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, & L. D. Jackel. (1989). “Backpropagation Ap- plied to Handwritten ZIP Code Recognition.” Neural Com- putation, 1(4), 541–551.

Liang, X., X. Shen, J. Feng, L. Lin, & S. Yan. (2016). “Seman- tic Object Parsing with Graph LSTM.” European Con- ference on Computer Vision. New York, NY: Springer, pp. 125–143.

Mahajan, D., R. Girshick, V. Ramanathan, M. Paluri, & L. van der Maaten. (2018). “Advancing State-of-the-Art Image Recognition with Deep Learning on Hashtags.” https:// code.facebook.com/posts/1700437286678763/ advancing-state-of-the-art-image-recognition-with- deep-learning-on-hashtags/. (accessed June 2018)

Mikolov, T., K. Chen, G. Corrado, & J. Dean. (2013). “Effi- cient Estimation of Word Representations in Vector Space.” ArXiv Preprint ArXiv:1301.3781.

Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, & J. Dean. (2013). “Distributed Representations of Words and Phrases and Their Compositionality” Advances in Neural Informa- tion Processing Systems, pp. 3111–3119.

Mintz, M., S. Bills, R. Snow, & D. Jurafsky. (2009). “Distant Supervision for Relation Extraction Without Labeled Data.” Proceedings of the Joint Conference of the Forty-Seventh Annual Meeting of the Association for Computational Lin- guistics and the Fourth International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2, pp. 1003–1011.

Mozur, P. (2018, June 8). “Inside China’s Dystopian Dreams: A.I., Shame and Lots of Cameras.” The New York Times, is- sue June 8, 2018.

Nguyen, T. H., & R. Grishman. (2015). “Relation Extraction: Perspective from Convolutional Neural Networks.” Pro- ceedings of the First Workshop on Vector Space Modeling for Natural Language Processing, pp. 39–48.

Olson, D. L., D. Delen, and Y. Meng. (2012). “Comparative Analysis of Data Mining Models for Bankruptcy Predic- tion.” Decision Support Systems, 52(2), pp. 464–473.

Principe, J. C., N. R. Euliano, and W. C. Lefebvre. (2000). Neu- ral and Adaptive Systems: Fundamentals Through Simula- tions. New York: Wiley.

Chapter 6 • Deep Learning and Cognitive Computing 387

Reynolds, H., & S. Feldman. (2014, July/August). “Cognitive Computing: Beyond the Hype.” KM World, 23(7), p. 21.

Riedel, S., L. Yao, & A. McCallum. (2010). “Modeling Rela- tions and Their Mentions Without Labeled Text.” Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases., New York, NY: Springer, pp. 148–163

Robinson, A., J. Levis, & G. Bennett. (2010, October). “Informs to Officially Join Analytics Movement.” ORMS Today.

Royyuru, A. (2014). “IBM’s Watson Takes on Brain Cancer: Analyzing Genomes to Accelerate and Help Clinicians Per- sonalize Treatments.” Thomas J. Watson Research Center, www.research.ibm.com/articles/genomics.shtml (ac- cessed September 2014).

Rumelhart, D. E., G. E. Hinton, & R. J. Williams. (1986). “Learn- ing Representations by Back-Propagating Errors.” Nature, 323(6088), pp. 533.

Russakovsky, O., J. Deng, H. Su, J. Krause, S. Satheesh, S.  Ma,  .  .  . M. Bernstein. (2015). “Imagenet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision, 115(3), 211–252.

Sato, K., C. Young, & D. Patterson. (2017). “An In-Depth Look at Google’s First Tensor Processing Unit (TPU).” https:// cloud.google.com/blog/big-data/2017/05/an-in- depth-look-at-googles-first-tensor-processing-unit- tpu. (accessed June 2018)

Scherer, D., A. Müller, & S. Behnke. (2010). “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition.” International Conference on Artificial Neural Networks., New York, NY: Springer, 92–101.

Slowey, L. (2017a, January 25). “Winning the Best Picture Os- car: IBM Watson and Winning Predictions.” https://www. ibm.com/blogs/internet-of-things/best-picture-oscar- watson-predicts/(accessed August 2018).

Slowey, L. (2017b, May 10). “Watson Predicts the Winners: Euro- vision 2017.” https://www.ibm.com/blogs/internet-of- things/eurovision-watson-tone-predictions/(accessed August 2018).

Sutskever, I., O. Vinyals, & Q. V. Le. (2014). “Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, pp. 3104–3112.

Ung, G. M. (2016, May). “Google’s Tensor Processing Unit Could Advance Moore’s Law 7 Years into the Future.” PC- World. https://www.pcworld.com/article/3072256/ google-io/googles-tensor-processing-unit-said-to- advance-moores-law-seven-years-into-the-future.html (accessed July 2018).

Vinyals, O., L. Kaiser, T. Koo, S. Petrov, I. Sutskever, & G. Hinton, G. (2015). “Grammar As a Foreign Language.” Advances in Neural Information Processing Systems, pp. 2773–2781.

Vinyals, O., A. Toshev, S. Bengio, & D. Erhan. (2015). “Show and Tell: A Neural Image Caption Generator.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164.

Vinyals, O., A. Toshev, S. Bengio, & D. Erhan. (2017). “Show and Tell: Lessons Learned from the 2015 MSCOCO Image Cap- tioning Challenge.” Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663.

Wilson, R. L., & R. Sharda. (1994). “Bankruptcy Prediction Using Neural Networks.” Decision Support Systems, 11(5), 545–557.

Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Ma- cherey, & K. Macherey. (2016). “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.” ArXiv Preprint ArXiv:1609.08144.

Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, & Y. Bengio. (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” Proceedings of the Thirty-Second International Conference on Machine Learning, pp. 2048–2057.

Zeng, D., K. Liu, S. Lai, G. Zhou, & J. Zhao (2014). “Relation Classification via Convolutional Deep Neural Network.” http://doi.org/http://aclweb.org/anthology/C/C14/ C14-1220.pdf. (accessed June 2018).

Zhou, Y.-T., R. Chellappa, A. Vaid, & B. K. Jenkins. (1988). “Image Restoration Using a Neural Network.” IEEE Trans- actions on Acoustics, Speech, and Signal Processing, 36(7), pp. 1141–1151.

Get help from top-rated tutors in any subject.

Efficiently complete your homework and academic assignments by getting help from the experts at homeworkarchive.com