diff --git a/other/SVM-proj_example.html b/other/SVM-proj_example.html new file mode 100644 index 0000000..fcf9b3f --- /dev/null +++ b/other/SVM-proj_example.html @@ -0,0 +1,837 @@ + + + + + + + + + + + + + +Final Project: Support Vector Machines (SVM) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Introduction

+

Welcome to this tutorial on support vector machines (SVM)! The SVM algorithm is a powerful and flexible approach to modeling a variety of real-world problems, and we have written this tutorial to give you everything you need to start applying the approach today.

+

Who are we? We are two graduate students at the School of Data Science at the University of Virginia. Both of us have a strong interest in SVMs and their applications, and we want to share that enthusiasm and knowledge with other students and professionals.

+

But more importantly, who are you? You are probably an undergraduate, graduate, or even a new professional interested in expanding your knowledge of machine learning and data science. You’ve heard of some of the basic approaches, like linear, logistic, and lasso or ridge regression, and you’re familiar with some of the basic concepts, like cross-validation, the types of machine learning problems (classification and regression, etc.), and maybe the bias-variance trade-off. You haven’t seen SVMs before and you’re intrigued, or maybe you’ve tried other tutorials but found your head spinning with mathematical jargon: Primal problem. Dual problem. Kernel trick. Hilbert space.

+

Good news for you! This tutorial has been designed for readers who are more interested in conceptual understanding and practical applications than deep mathematical derivations, and we will focus on developing an intuition for how several varieties of SVM work: by the end of the tutorial, you should be prepared to apply your knowledge to binary classification, multiclass classification, regression, and one-class (novelty detection) problems. The tutorial should be understandable at the Master’s or advanced undergraduate level—but if you want to delve deeper into the theory, this tutorial should equip you with the conceptual understanding to better approach the formal literature.

+

Throughout the tutorial, we will include conceptual questions and coding demonstrations for you to develop a practical understanding. Although we will include example answers, we encourage you to actually attempt these problems before inspecting the solutions: the best way to develop a practical understanding is to try to apply it, especially when you have the answers available to check your work and learn from your mistakes. This tutorial is built in R, but the approaches learned here will generalize to other languages such as Python. Feel free to follow along in a different programming language of your choosing, but you will need to learn the conventions of different packages from the ones used here. (For Python, we recommend scikit-learn.)

+
+
+

Why should I use SVM?

+

Before we get started, let’s establish some of the advantages and disadvantages of SVM, relative to other approaches you may have seen—such as generalized linear models (GLMs), lasso/ridge/elastic net regularization, or deep learning. As a data scientist, analyst, or engineer, you should first consider the advantages and the disadvantages of a technique before applying it to a problem, rather than the other way around (applying something incorrectly, and finding errors only after wasting your own time and effort).

+

Depending on your experience, some of this may not make sense immediately—that’s OK! Consider returning to this section after completing the rest of the tutorial, to further cement what you have learned.

+
+

Advantages

+
    +
  • One method for classification, regression and novelty/outlier detection. There are not many methods in data science and machine learning that are general enough to handle not only supervised problems with binary class labels, multiple class labels, or quantitative labels, but also unsupervised learning! SVM generalizes to all of these problems.

  • +
  • Easily model non-linear class boundaries and relationships, without extensive feature engineering and model selection. This will sound unremarkable to practitioners of deep learning, but for GLM and lasso/ridge users, SVM opens a wide range of new possibilities. No method will entirely eliminate your need to do feature engineering—it’s a key part of the work of any data scientist—but SVM kernels (which will be explained in more detail in a later section) make life much easier. The SVM will easily fit non-linear relationships, in a automatic and data-driven way—rather than an exhaustive, arbitrary, and subjective approach.

  • +
  • Balance interpretation with predictive power. Often, machine learning is visualized on a kind of continuum: a trade-off between simpler, less powerful, but interpretable models on one hand (like linear regression) and highly complicated, state-of-the-art, completely uninterpretable models on the other (like some deep learning). Using this approach, SVMs sit directly in the middle. Using the simplest kernel (linear), you can model very similar relationships to GLMs. Increasing the complexity (and losing interpretability) is then as simple as changing the kernel—a single parameter. Rarely in other techniques do you, the data scientist, have such control over this trade-off. In SVMs, you can simply “flip a switch” from one side to the other, and observe whether you receive enough increased predictive power to justify sacrificing interpretation.

  • +
  • SVMs are, quite simply, fast. SVMs give some of the advantages typical of more complicated models (see above) but at a computational speed more typical of GLM-like methods.

  • +
  • Regularization comes “built-in.” A GLM practitioner may spend a lot of time and effort choosing the best approach to regularize a problem: should we use the ridge penalty? How about lasso? Do we need something even more general, like elastic net? SVMs do not have this consideration. Although the hyperparameters still need to be tuned, as in GLMs, it isn’t necessary to “pick” a regularization approach: it’s already a part of SVM. SVMs are especially powerful in contexts where there are far more dimensions than observations, which are intractable for GLMs without regularization. For example, SVMs are frequently used in brain imaging, on datasets with 10,000s of dimensions but only a few hundred observations.

  • +
+
+
+

Diasdvantages

+
    +
  • SVM just predicts classes, not class probabilities. If your previous experience is largely with GLMs, Bayesian modeling, and/or deep learning, this may be a surprising distinction, but it is critically important. Most of the previous methods usually provide probabilities of class membership. SVM does not. For example,

    +
      +
    • Probabilistic methods: We predict that patient 1 has a 64 percent chance of having the disease.

    • +
    • SVM: We predict that patient 1 has the disease. We don’t make any claim about the probability. It may be that your intended use case really needs predicted probabilities—but if that’s the case, SVM should not be the method you use.

    • +
  • +
  • SVM uses geometry, not statistics. For data scientists and academics with a more statistics-oriented background, it can be jarring to suddenly lose access to all of the statistial inference a GLM provides: hypothesis tests for the overall model (e.g., \(F\)-test) and every coefficient (e.g., Wald test) and extensive diagnostic plots to test the model assumptions. SVM is a much more direct optimization problem, and doesn’t make as many assumptions (or produce as many statistical inferences).

  • +
  • Depending on your model, you may have more hyperparameters to tune. The cost of cross-validation, with a grid search, grows exponentially as you have more parameters. If you perform a grid search with 10 possibilities, then with two hyperparameters you have \(10^2=100\) possibilities; for three hyperparameters, \(10^3 = 1000\), etc. With classification and a linear kernel, you will only have one hyperparameter \(C\). But with regression and a RBF kernel, you will have three: \(C, \varepsilon, \gamma\). Your cross-validation may become more computationally intensive than a GLM-based approach.

  • +
+
+
+
+

The Maximium-Margin Classifier

+

Suppose you are given a set of training observations belonging to one of two classes, each of which is described by some predictor variables we’ll call \(X_1\) and \(X_2\). You’re then asked to classify a new unknown point based simply on the training points you have before you. How might you go about this? An intuitive or natural start might be to plot the observations in order to see how they’re are clustered visually. Could you draw a line separating both groups? Depending on the side of the line that new point lands on, you should be able to make a more informed decision on how to classify that point. Now how is this accomplished when we have larger dimensions of predictors?

+

+

In the most general sense, we just defined a 1-D line that could fully separate our 2-D training data by their target class. If our data was 3-D, we would’ve drawn a 2-D surface. Basically, given \(p\) number of predictor variables our goal is to developed a \(p-1\) dimensional separator, otherwise known as a hyperplane. Mathematically, this hyperplane can be defined by the equation,

+

\[\begin{equation} + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p = 0 + \tag{1} +\end{equation}\]

+

for parameters \(\beta_i\) and predictors \(X_i\). If an observation with \(p\) predictors (which we’ll represent as a \(p\) dimensional vector \(\textbf{X} = (X_1, X_2, \ldots X_p)^T\)) satisfies this equation, the vector will lie on this hyperplane. In our case however, we purposely define a hyperplane so that none of the training observations directly lie on it, and we really only care as to which side of the hyperplane the new test observation lies.

+

If we are able to fully separate the two classes, then where exactly do we draw our separating hyperplane? There are actually an infinite number of ways we could do this, once you think about all the infinitesimal little translations and rotations that we could apply to the hyperplane.

+

Before scrolling any further, take a few minutes to consider where you would draw the hyperplane on the following data. Why would you choose this one?

+

+

Our best choice is to use the hyperplane that is the farthest away from all the training observations on either side of the plane, otherwise known as the maximum-margin hyperplane. The margin is just the distance between the closest observation and the separator, and we want it to be as large as possible. If we use this hyperplane to decide the target class of our test observations, we call this the maximum-margin classifier.

+

+

Now step back and think about what it means to classify observations this way. Not only do we care about which side of the hyperplane an observation lies, but if that observation is far from it then we don’t expect it to make a difference where the maximum-margin classifier lies. You might hope that the training and future test observations end up far away from (and on the correct side of) the classifier, otherwise we would have to redefine it—and you’d be right! You might also be thinking about how this applies to data where no separating hyperplane actually exists. Unfortunately, this is where the maximum-margin classifier falls short, but fortunately we have a backup plan for this.

+

+

Do we really benefit the most from fully separating our data with complete accuracy, or can we get away with only being right most of the time? It turns out the latter is not only possible but it is often more robust than using the maximum-margin classifier. Unlike how we initially showed, we’ll now allow some observations to lie within the margin and sometimes even the incorrect side of the hyperplane. These observations are known as support vectors. This is what’s known as using a soft margin,and it leads us to a generalization of the maximum-marginal classifier: the support vector classifier.

+
+
+

Introducing the Hyperparameter, \(C\)

+

Ideally, we would like the dataset to be linearly separable, and all of the points to lie outside the margin, on the correct side of the hyperplane. As discussed above, however, that cannot always be the case. Let \(\xi_i\) be the degree to which the \(i\)th point is “on the wrong side.” That is:

+
    +
  1. If the datapoint is outside the margin and correctly classified, \(\xi_i=0\).
  2. +
  3. If the datapoint is inside the margin but still correctly classified, \(\xi_i\) is the distance from the margin. In this case, \(\xi_i\) is still relatively small.
  4. +
  5. If the datapoint is inside the margin and not correctly classified, \(\xi_i\) is the distance from the correct side of the margin—on the other side of the hyperplane. In this case, \(\xi_i\) could be quite large.
  6. +
+

Observations in cases 2 and 3 are the support vectors, vectors where \(\xi_i>0\).

+

+

Before proceeding, try to determine how many support vectors there are, of each class in the above graph, and which point has the largest \(\xi_i\).

+

Did you say four (two red, and two blue) support vectors? And the leftmost red point has the largest \(\xi_i\)? That is correct.

+

We can then write the following optimization, where \(\beta\) are the coefficients of the hyperplane and \(\beta_0\) is the intercept, with some \(C > 0\):

+

\[\begin{equation} + \min_{\beta, \beta_0} \left( \frac{1}{2} \|\beta\|^2 + C \sum_{i=1}^N \xi_i \right) + \tag{2} +\end{equation}\] Since the \(\xi_i\) are distances, all \(\xi_i \geq 0\).

+

This is the solution for the support vector machine! Note that the margin \(M = \frac{1}{\|\beta\|}\). So SVM trades off between two priorities:

+ +

There is an obvious trade-off here. We could maximize the margin to an enormous value—extending far away from the hyperplane—but then the \(\sum \xi_i\) will be enormous, since most points will fall within the margin. We could also minimize \(C\sum \xi_i\) to 0 trivially, if we set the margin to 0 (since no points, other than those falling exactly on the hyperplane, could be inside it).

+

Let’s begin with the linearly-separable case again. Imagine you see the following data:

+

+

Take a few minutes, before scrolling any further, to consider where you would draw the hyperplane here.

+

There isn’t one correct answer! Instead, it depends on what value you choose for \(C\). Take a second look at Equation 2. You can see that \(C\) is the weight applied to the various \(\xi_i\), which you can think of as our classifier’s errors. (You may be used to seeing \(\epsilon\) used for errors, but epsilon has another meaning in SVM. See the support vector regression section.) So you can think of \(C\) as the cost of misclassifications.

+

You may have decided that you wanted to linearly separate the training data, no matter what. After all, that’s all the data you have, so it’s what you should use, right? You might have chosen a hyperplane like this, which corresponds to a very large value of \(C\):

+
library(e1071)
+svm = svm(formula=y ~ ., data=data, cost=1e10, kernel='linear')
+
+plot(svm, data, grid=1e3)
+

+

Alternatively, you might have chosen to tolerate that strange outlier. The classes otherwise seem quite separable, after all. That point could be an observation error or just a very rare occurence, you may have reasoned, and the true separating hyperplane is probably something like this (a model with smaller \(C\)):

+
svm = svm(formula=y ~ ., data=data, cost=1, kernel='linear')
+
+plot(svm, data, grid=1e3)
+

+

Here are a few key concepts to understand:

+ +

Take a third look at equation 2, reprinted here for your convenience:

+

\[\begin{equation} + \min_{\beta, \beta_0} \left( \frac{1}{2} \|\beta\|^2 + C \sum_{i=1}^N \xi_i \right) + \tag{2 revisited} +\end{equation}\]

+

This leads us to one additional insight: Large values of \(C\) lead to smaller margins; smaller values of \(C\) lead to larger margins. Again, remember the trade-off. If we reduce the second term by a smaller \(C\), we increase the first term. Since the margin is \(\frac{1}{\| \beta \|}\), this increases the margin.

+

How can we choose \(C\)? On very rare occasions, you might be lucky and not have to do so. In brain imaging, for example, there are typically so many dimensions (tens of thousands) compared to observations (hundreds) that it’s very easy for SVM to find a perfectly separating hyperplane. (To see why having more dimensions than observations makes data easily separable, imagine separating only two points in 3D space.) In this field, it’s actually common just to set \(C\) arbitrarily large to make sure a separating hyperplane is selected, e.g., \(C = 100\).

+

Usually, however, you won’t be so lucky. Like any hyperparameter in other domains of machine learning (such as \(\lambda\) in lasso and ridge penalties), you will probably find \(C\) using cross-validation with a grid search. A common approach is to vary along an exponential scale, for example, by testing powers of 2: \(2^{-5}, 2^{-4}, \ldots , 2^4, 2^5\). If you absolutely need a default, \(C=1\) is a reasonably balanced choice, but as a rule SVM hyperparamters usually need to be tuned. Advanced users might also choose \(C\) more efficiently using Bayesian optimization.

+
+
+

An alternative: Introducing the Hyperparameter, \(\nu\)

+

You may also encounter an alternative parameterization of SVM called \(\nu\)-SVM, where there is no longer any parameter \(C\), but instead a \(\nu\). Instead of a cost \(C > 0\), this parameterization chooses a proportion \(\nu \in [0, 1]\). This is a lower bound on the proportion of points that are support vectors, and an upper bound on our training misclassification rate. (Recall that the support vectors are the points on the wrong side of the margin, or even the wrong side of the hyperplane: points with \(\xi_i > 0\).) You can probably already see the resemblance between the two approaches: increasing \(\nu\) increases the proportion of support vectors, which means it increases the margin, the same as decreasing \(C\). The optimization can be written like this:

+

\[\begin{equation} + \min_{\beta, \beta_0} \left( \frac{1}{2} \|\beta\|^2 + \nu \beta_0 + \frac{1}{2} \sum_{i=1}^N \xi_i \right) + \tag{3} +\end{equation}\]

+

If you read the literature, you may see some authors write this in terms of the training sample size \(\ell\) (rather than \(N\)), the non-intercept coefficients \(\mathbf{w}\) (rather than \(\beta\)), and a negative intercept \(\rho = -\beta_0\). You can see clearly that \(\sum_{i=1}^N \xi_i\) is no longer multiplied by \(C\), but by \(\frac{1}{\nu}\): these parameters have a roughly inverse relationship.

+

Can you guess which one of these models has a larger \(\nu\)? Hint: the support vectors are shown with X symbols, and all other points are Os.

+
svm = svm(formula=y ~ ., data=data, type='nu-classification', nu=0.001, kernel='linear')
+
+plot(svm, data, grid=1e3)
+

+
svm = svm(formula=y ~ ., data=data, type='nu-classification', nu=0.1, kernel='linear')
+
+plot(svm, data, grid=1e3)
+

+

The second plot has many more support vectors, so it’s intuitive that it has a larger \(\nu\). It also looks similar to the plot with smaller \(C\) from above.

+
+
+

Introducing the Kernels

+

So far, we’ve assumed that the best hyperplane to act as our classifier is a linear boundary in our input feature space. In practice, however, we might not expect this assumption to hold in every situation we see. Maybe there is some other enlarged feature space, with more dimensions, in which we could separate the data—for example, we could add quadratic terms for some vectors \(x = (a,b)\):

+

\[\begin{equation*} + (a, b) \rightarrow (a^2, ab, b^2, a, b) +\end{equation*}\]

+

But as we increase the number of dimensions, mapping our data to this expanded space, calculating distances to the margin (to run the SVM mentioned previously), and mapping the resulting hyperplane back to linear space becomes expensive. In some popular kernels, such as the radial basis function (below), the number of dimensions in the expanded space is actually infinite! How can we possibly calculate a distance in an infinite-dimensional space? The problem seeems computationally intractable.

+

One way SVMs get around this is to use the kernel trick: we can use certain functions to compute the inner products (in this enlarged space) between every pair of input vectors, which can be used to enlarge our feature space and find better separations between classes. Three popular types of these functions, known as kernel functions are:

+ +

where the inner product of two \(p\)-vectors \(x\) and \(x'\) is defined as

+

\[\begin{equation} + \langle x, x' \rangle = \sum^{p}_{j = 1}x_j x'_j + \tag{4} +\end{equation}\]

+

To demonstrate these kernels, we’ll use the following data:

+

We’ll also formally introduce code for running SVMs in R using the e1071 library. Here are the four methods you will mainly use:

+ +

We will use best.svm for simplicity, and tune the hyperparameters using 10-fold cross-validation.

+
+

Linear

+
# 2-d representations of the linear kernel
+svm = best.svm(y ~ ., data=data, kernel='linear',
+               cost = 10^(-5:5),
+               tunecontrol = tune.control(sampling='cross', cross=10))
+plot(svm, data, grid=1000)
+

+
+
+

Polynomial

+
svm = best.svm(y ~ ., data=data, kernel='polynomial',
+               cost = 10^(-5:5), degree=2:5, gamma=1, coef0=1,
+               tunecontrol = tune.control(sampling='cross', cross=10))
+plot(svm, data, grid=1000)
+

+
+
+

Radial basis function

+
svm = best.svm(y ~ ., data=data, kernel='radial',
+               cost = 10^(-3:3), gamma=10^(-5:5),
+               tunecontrol = tune.control(sampling='cross', cross=10))
+plot(svm, data, grid=1000)
+

+
+
+

Sigmoid

+
svm = best.svm(y ~ ., data=data, kernel='sigmoid',
+               cost = 10^(-3:3), gamma=10^(-5:5), coef0=10^(-5:5),
+               tunecontrol = tune.control(sampling='cross', cross=10))
+plot(svm, data, grid=1000)
+

+

In the enlarged space (which we never directly see), the hyperplane is still linear. But when it is mapped back to the original space, it can take on a variety of curved shapes (determined by the choice of kernel). If this seems unintuitive, you have probably already seen this concept, when you fit a simple quadratic curve \(y = ax^2 + bx + c\) using linear regression: you created two-dimensional data, \(\mathbf{x} = (x, x^2)\), fit the model using multiple linear regression, and then plotted the curve \(y = f(x)\) as a function of \(x\) alone. The same idea underlies the kernel trick.

+

The difference now is that we cannot easily map the data into the new space directly—so we rewrite inner products in that space, in terms of the smaller space’s variables. It happens that this allows us to write formulas that are tractable, even for some infinite-dimensional enlarged feature spaces.

+

The effect of the parameter \(C\) should be even clearer in these enlarged feature spaces. Previously for the non-separable linear boundary case, large/small values of \(C\) decreased/increased the size of our margin. Now, large values of \(C\) will lead to tighter fitting boundaries around classes in the original feature space, whereas small values of \(C\) will lead to smoother curved boundaries.

+
# RBF example to demonstrate the following point
+svm = svm(y ~ ., data=data, kernel='radial',
+          cost = 1e5, gamma=1)
+plot(svm, data, grid=1000)
+

+

Considerations for choosing kernels:

+
    +
  • You may also have guessed already that the greater flexibility of our hyperplane might carry a risk of overfitting. After all, we are no longer constrained to linear forms—for RBF, we’re not even constrained to forms with a finite number of terms. If you anticipated this, great work! You are correct. Tuning \(C\) properly, on unseen data (e.g., using cross-validation), is now even more important. Above, see an example of a badly-overfit model using the RBF kernel and a very large \(C = 10^5\).

  • +
  • Choosing increasingly complex kernels sacrifices the interpretation of the model coefficients. For RBF, you cannot even write them out any more! However, RBF is highly flexible and often achieves high accuracy. When choosing a kernel, feel free to try multiple options for accuracy, and consider whether (for your particular problem) sacrificing interpretability is worth it.

  • +
  • Note that many kernels have hyperparameters you will have to tune, in addition to \(C\). In the above definitions, the polynomial kernel has a degree; the RBF kernel has a hyperparameter \(\gamma\). Some authors parameterize these kernels with even more hyperparameters; for example, in the R package e1071, the polynomial kernel is more general and has 3 hyperparameters \(\gamma, c_0, d\): \(K(x, x') = (c_0 + \gamma \langle x, x'\rangle)^d\). The above definition was the special case where \(\gamma = c_0 = 1\).

  • +
+

In summary, the advantage of using kernel functions that we won’t have to actually compute the coordinates of our input data in the kernel’s higher-dimensional implicit feature space. Instead, it is possible for us just to compute what the inner products of all pairs of our data would be in the higher-dimensional space, which can be computationally more feasible. By using these higher-dimensional feature spaces, we can fit many different non-linear relationships while still using SVM just as before.

+
+
+
+

Multiclass-classification

+

Up until now, we’ve only used SVMs to solve binary classification. Luckily, there have been many different proposals to extend the use of SVMs into multi-classification problems. We will briefly present two simple and popular approaches here:

+ +

For example, we might train (with four classes) the following classifiers:

+ + + + + + + + + + + + + + + + + + + +
Classifiers1 vs 2,3,42 vs 1,3,43 vs 1,2,44 vs 1,2,3
\(\beta_{0k} + \boldsymbol{X}^T \boldsymbol{\beta_k}\)0.51.21.40.3
+

So we select class 3 as the prediction.

+ +

For example, we might train (with four classes) the following classifiers:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ClassifiersVotes
1 vs 21 vs 31 vs 4114
2 vs 32 vs 434
3 vs 44
+

So 4 receives the most “votes,” and we select 4 as the prediction.

+

Here’s a demo of multiclass classification with the R package e1071, using a classic problem in machine learning, the Iris dataset. Note that behind the scenes, the One-versus-One approach is used.

+
# Load Iris data
+data(iris)
+head(iris)
+
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+## 1          5.1         3.5          1.4         0.2  setosa
+## 2          4.9         3.0          1.4         0.2  setosa
+## 3          4.7         3.2          1.3         0.2  setosa
+## 4          4.6         3.1          1.5         0.2  setosa
+## 5          5.0         3.6          1.4         0.2  setosa
+## 6          5.4         3.9          1.7         0.4  setosa
+
# Size of data
+n <- nrow(iris)
+
+# Proportion of data separated into training and test sets
+train_size = .75
+
+# Randomly choose training data indices, set seed for reproducibility 
+set.seed(1)
+ind <- sample(n, round(train_size*n))  
+train <- iris[ind,]
+test <- iris[-ind,]
+
+# Train the SVM
+svm_iris <- best.svm(Species~., 
+                 data = train, 
+                 type='C-classification',
+                 kernel='radial',  # Use RBF kernel 
+                 gamma=10^(-5:5), cost=10^(-5:5),  # Tune the parameters
+                 tunecontrol = tune.control(sampling='cross', cross=10))
+
+# Plot the trained classification Petal.Width vs Length where Sepal.Width & Length are 3 & 5 respectively
+plot(svm_iris, train, Petal.Width ~ Petal.Length,  
+     slice=list(Sepal.Width=3, Sepal.Length=5), grid=1000)
+

+
# Predict on the test set with trained SVM
+prediction <- predict(svm_iris, test)
+
+# Show confusion matrix of our predictions
+conf_mat <- table(test$Species, prediction)
+conf_mat
+
##             prediction
+##              setosa versicolor virginica
+##   setosa         13          0         0
+##   versicolor      0         14         0
+##   virginica       0          1        10
+
# Accuracy of our predictions
+sum(diag(conf_mat))/nrow(test)
+
## [1] 0.9736842
+
+
+

Support Vector Regression (SVR)

+

We will now show how the support vector approach generalizes beyond just classification problems, starting with regression. Now, our labels are not binary classes, but instead real-valued.

+

Instead of trying to maximize the margin on our hyperplane (dividing two classes), our hyperplane will attempt to fit to the data, just in a typical regression problem. We define another hyperparameter \(\varepsilon\), the width of a “\(\varepsilon\)-tube” around our hyperplane. It closely resembles the margin from the previous approach, but instead of separating two classes, it’s the bounds of our hyperplane’s prediction.

+

In this approach, we define two different types of “errors,”, \(\xi_i\) and \(\xi^*_i\). Each observation \(x_i\) falls into one of three categories. Let \(y_i\) be the actual label, and \(\hat{y}_i\) the SVR prediction.

+ +

+

Already, you may see the conceptual basis of \(C\) and \(\varepsilon\) emerging. \(\varepsilon\) represents a kind of tolerance for errors: if the prediction is wrong by \(\varepsilon\) or less, then we apply no cost/penalty for this observation. (Note that SVMs scale both the \(x\) and \(y\) to mean 0 and unit variance, so \(\varepsilon=0.1\), for example, means the same thing for any SVR problem.) If we do apply a penalty, however, because the point falls outside the \(\varepsilon\)-tube, then \(C\) controls how much penalty we apply. Note also that \(\xi_i\) and \(\xi^*_i\) are never negative, and at least one of them is always 0 (both are 0 only for points inside the tube).

+

We can then set up the optimization for this problem

+

\[\begin{equation} + \min_{\beta, \beta_0} \left( \frac{1}{2} \|\beta\|^2 + C \sum_{i=1}^N (\xi_i + \xi^*_i) \right) + \label{5} +\end{equation}\]

+

Does this look familiar? Once you understand the \(\varepsilon\)-tube, and what it means for \(\xi_i\) and \(\xi^*_i\), the optimization problem is almost the exact same as support vector classification! The first term is identical, and the second changes only to add our new type of error term: it’s still a penalty, weighted by \(C\), for the size of our model’s mistakes.

+

This approach is especially powerful when you combine it with what you’ve already learned about kernels. Every student in data science and statistics has had the experience of frustratingly trying to find the best non-linear transformations to model a messy non-linear relationship. For example, consider a very difficult problem like this one:

+

+

The true relationship, plotted in black, is \[\begin{equation*} + E[y \mid X = x] = 5 \frac{1+x}{1+x^2} + 0.005x^3 +\end{equation*}\] One hundred points are sampled from this relationship, with a random error term of \(\mathcal{N}(0, 0.5)\) added to every \(y\)-value (blue). This kind of problem would be exceedingly difficult with an ordinary GLM if you didn’t know the data-generating mechanism (given above). What kind of features could you possibly engineer for this?

+

With the kernel trick, we don’t need to worry about this. Below, we demonstrate how easily the RBF kernel can fit to this problem.

+
svm = svm(Y ~ X, data=data, kernel='radial',
+          cost = 1e3, epsilon=0.1, gamma=1e3,
+          tunecontrol = tune.control(sampling='cross', cross=10))
+
+# We have to plot the data ourselves, since plot.svm only handles classification problems.
+yhat = fitted.values(svm)
+plot(X, yhat, col='red', xlab="X", ylab="Y", type='p')
+lines(support, f(support), col='black')
+

+

Using the kernel trick, our predictions (red) fit remarkably well—and we didn’t even have to do any feature engineering!

+

(Note that svm, tune.svm, and best.svm automatically detect that our \(y\)-variable is continuous, rather than a discrete factor. These methods will default to classifiation or regression (with \(C\)) based on the \(y\)-variable you provide. You can also specify it manually (or to use \(\nu\) rather than \(C\)) using the optional type argument.)

+
+
+

One-class SVM

+

At this point, you should be well acquainted with the typical pattern of SVM classification. Based on our training from both types of target classes, we can develop a model that can detect both types of classes within a set of test observations. Now consider the following example:

+

You are given a list of foods that someone enjoys. The list also contains information about each food’s corresponding recipe/ingredients. You’re then asked to recommend different types of foods that you think they might also enjoy. In other words, just find foods that you know are similar to the accepted ones. However, we also know that this person is a particular person is a picky eater, so we want to be careful not to recommend foods that they won’t enjoy. Do you think perhaps we could use SVMs to do the job?

+

There’s one fundamental difference to this scenario and the one that we’ve presented to you throughout this tutorial: Here, our training set is only comprised of observations of the positive class. It turns out, this constraint makes it harder for us to identify representative negative examples. This scenario is what’s known as novelty detection, where our training set is not polluted with by outliers and we are interested in detecting new observations that are outliers.

+

We can actually use a very similar optimization to what we used for \(\nu\)-SVM! Here’s the optimization:

+

\[\begin{equation} + \min_{\beta, \beta_0} \left( \frac{1}{2} \|\beta\|^2 + \frac{1}{\nu N} \sum_{i=1}^N \xi_i + \beta_0 \right) + \tag{6} +\end{equation}\]

+

Rather than \(C\), one-class SVM uses \(\nu \in [0, 1]\). As in \(\nu\)-SVM, \(\nu\) is a lower bound on the proportion of points that are support vectors. If \(\nu\) approaches 0, it’s a hard-margin classifier (we won’t have any vectors within the margin). Increasing \(\nu\) allows a softer approach.

+

Here’s a demo of one-class classification, using the banana dataset from the imbalance library:

+
library(imbalance)
+data("banana")
+
+# Show the imbalance in the classes
+summary(banana$Class)
+
## negative positive 
+##     2376      264
+
plot(banana$At1, banana$At2, col=banana$Class)
+

+
# Choose only one of the classes 
+df <- subset(banana, Class == "positive")
+x <- subset(df, select = -Class) # Make x training variables
+y <- df$Class # Make y training variable(dependent)
+model <- svm(x,
+             type='one-classification', 
+             kernel = "radial",
+             gamma= .01, nu=.9) # Train an one-classification 
+
+# create predictions
+prediction <- predict(model,subset(banana, select=-Class))
+
+# Show confusion matrix of our predictions
+conf_mat <- table(banana$Class, prediction)
+conf_mat
+
##           prediction
+##            FALSE TRUE
+##   negative  2376    0
+##   positive   237   27
+
# Accuracy of our predictions
+sum(diag(conf_mat))/nrow(banana)
+
## [1] 0.9102273
+
+
+

Conclusion

+

Congratulations! You have reached the end of the tutorial. You know have a good understanding of support vector machines, in all of its various parameterizations and contexts. You’ve seen enough code to begin applying the approach yourself in R, and the documentation for e1071 will help you the rest of the way. More importantly, though, you’ve developed a conceptual understanding that will accompany you to other packages and programming languages, like Python or Matlab.

+

We hope you have enjoyed the tutorial!

+
+
+

References

+

[1]P.-H. Chen, C.-J. Lin, and B. Schölkopf, “A tutorial on ν-support vector machines,” Appl. Stochastic Models Bus. Ind., vol. 21, no. 2, pp. 111–136, Mar. 2005.

+

[2]I. Cordón, S. García, A. Fernández, and F. Herrera, imbalance: Preprocessing Algorithms for Imbalanced Datasets. 2018.

+

[3]H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. Vapnik, “Support Vector Regression Machines,” in Advances in Neural Information Processing Systems 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. MIT Press, 1997, pp. 155–161.

+

[4]T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning, Second. Stanford: Springer, 2008.

+

[5]G. James, D. Witten, T. Hastie, and R. Tibshirani, Eds., An introduction to statistical learning: with applications in R. New York: Springer, 2013.

+

[6]S. LaConte, S. Strother, V. Cherkassky, J. Anderson, and X. Hu, “Support vector machines for temporal classification of block design fMRI data,” NeuroImage, vol. 26, no. 2, pp. 317–329, Jun. 2005.

+

[7]L. M. Manevitz and M. Yousef, “One-Class SVMs for Document Classification,” Journal of Machine Learning Research, pp. 139–154, 2001.

+

[8]D. Meyer et al., e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2019.

+

[9]B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the Support of a High-Dimensional Distribution,” Neural Computation, vol. 13, p. 2001, 1999.

+

[10]A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, Aug. 2004.

+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/other/Trees-proj_example.html b/other/Trees-proj_example.html new file mode 100644 index 0000000..1b88c76 --- /dev/null +++ b/other/Trees-proj_example.html @@ -0,0 +1,2360 @@ + + + + + + + + + + + + +Overview of Classification Trees using R + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + +
+

Overview of Classification Trees using R

+ +
+ +
+ +
+

Introduction to Decision Trees and CART

+

All of us have used Decision Trees in our day today lives. Infact, my mornings often start with using a decision tree to decide “Do I really have to wake up?”.

+
+Every Morning Decision Tree +
+
+

Decision Tree is a hierarchial representation of possible solutions conditioned on certain factors. It is a tree like model that can be used to determine a course of action.

+
+

Take the above decision tree for instance, weather I wake up or not is dependent on weather my alarm rang and further on the fact that I have less than 30min to class. Hence, decisions are being made recursively.

+
+

Terminologies

+

Before diving further into decision trees and how they help address problems of classification and regression, lets familiarize ourselves with a few jargons:

+
    +
  • Leaf/Terminal Node: Represents our final decision
  • +
  • Decision Node: Represents the choice that we are making and will have outward branches based on the number of choices we have.
  • +
  • Root Node: Represents the top-most decision node.
  • +
+

Let’s dive in and understand how decision trees are being used to solve machine learning problems.

+
+
+

Classification and Regression Trees (CART)

+
+

CART is a supervised machine learning algorithm introduced by Breiman et al. that uses recursive partitioning to model classification and regression problems as binary decision trees.

+
+

To understand how CART works, we will be working with a dataset on online news popularity.

+
+
+

Implementation

+
+

About the Dataset

+
    +
  • Source: data.world
  • +
  • Description: The dataset contains 61 attributes of 39644 news articles. Detailed description of the attributes can be found here
  • +
+
+
+

Problem Statement

+

This dataset will be used to model the popularity of a news article as a classification problem. One of the attributes of the dataset is the number of shares an article gets, and it is a measure of the popularity. If an article gets 1400 shares or more it is considered to be a popular article \(^{[3]}\), else it is unpopular.

+
+
+

Overview of the modeling process

+
    +
  1. Split the data into train and test set
  2. +
  3. Train the model on training set
  4. +
  5. Predict using the trained model on the test set
  6. +
  7. Evaluate the model
  8. +
+

1. Split the data into train and test set

+

We will be using 80% of the data as training set and 20% as test. To get a random split we can use the sample() in R. You can set a seed to ensure that you get the same set of randomized numbers everytime.

+ +
+

Note: A very important step in model building is feature engineering. We have skipped this completely as we only focus on algorithms in this tutorial and feature engineering is outside the scope of this tutorial. Feature engineering and transformation can provide greater accuracy.

+
+

2. Train the model on training set

+

Now that we have a training set, we will use package rpart to create and train a classification tree and determine if the article is popular or not. Another package rpart.plot is used to visualize the modeled tree.

+ +

+

Using the training data, the CART algorithm is able to learn which features help in predicting the popularity of an article. From the above tree, we can see that based on the training data, if the article of the best keyword (most popular keyword) from the article has an average number of shares (kw_avg_avg) of 2824 or more and the article is on entertainment, then whether the article would be popular or not is conditioned on it being published during the weekdays or weekends. The article has a 0.68 probability of being popular if it is published on a weekend* but only a probability of 0.42 if it’s published on a weekday. Hence, a logical inference from this model would be to for entertainment articles containing popular keywords, to ensure the article is popular it should be published on a weekend and not on a weekday.

+
+

Note: is_weekend = 0 indicates it is a weekday.

+
+

3. Predict using the trained model on the test set

+

Using the predict() function, we can get predictions on the test dataset using the trained model.

+ +

4. Evaluate the model

+

Accuracy, area under curve(AUC), confusion matrix, precision, recall/sensitivity, and specificity are some of the metrics popularly used to evaluate classification models.

+

The confusionMatrix() from the caret package provides us with a summary of these metrics, based on the predictions from the model.

+ +
## Confusion Matrix and Statistics
+## 
+##           Reference
+## Prediction    0    1
+##          0 2128 1374
+##          1 1547 2880
+##                                           
+##                Accuracy : 0.6316          
+##                  95% CI : (0.6209, 0.6422)
+##     No Information Rate : 0.5365          
+##     P-Value [Acc > NIR] : < 2e-16         
+##                                           
+##                   Kappa : 0.2569          
+##                                           
+##  Mcnemar's Test P-Value : 0.00146         
+##                                           
+##             Sensitivity : 0.5790          
+##             Specificity : 0.6770          
+##          Pos Pred Value : 0.6077          
+##          Neg Pred Value : 0.6506          
+##              Prevalence : 0.4635          
+##          Detection Rate : 0.2684          
+##    Detection Prevalence : 0.4417          
+##       Balanced Accuracy : 0.6280          
+##                                           
+##        'Positive' Class : 0               
+## 
+
+
+
+

How to determine the best split?

+

In the section above we saw that the CART algorithm was able to infer rules which split our dataset into as homogenous a subset as possible. The underlying goal of a classification tree is to split the dataset into subsets, where each subset belongs to only one class. A classification tree tries to come up with decision boundaries by splitting the input features in such a way that it leads to subsets with most samples belonging to a particular class. This raises the question How to determine the best split?

+

As the goal is to partition the dataset into subsets that are as homogenous (pure) as possible; the best split can be determined by measuring the impurity (misclassfications) introduced by each rule and choose the rule that minimizes the impurity the most.

+
+

Gini Index

+
+

Gini Index is an impurity based criterion that measures the probability of an observation being wrongly classified given the partitioning criterion.

+
+

Mathematically, Gini Index is calculated as \[G = 1 - \sum_{c \in C}^C p_c^2\] where,
\(C\) is the set of classes, for our example \(C = {1, 0}\) where 1 indicates popular and 0 indicates unpopular
\(p_i\) is the proportion of the samples that belong to class \(i\)

+

The value of Gini Index lies between \([0,1)\), where 0 indicates that there are no misclassifications as the \(p_c = 1\) indicating that all samples belong to the same class. In case of a binary classfication, random distribution would imply that the proportion of each class is 0.5, therefore in our case for example the gini index would be calculated as, \[G = 1 - P(popular = 0)^2 - P(popular = 1)^2\] \[= 1 - 0.5^2 - 0.5^2 = 1 - 0.25 - 0.25 = 0.5\] So the maximum value a Gini Index can take would be 0.5 for binary classifications, and \(1 - \frac{1}{n}\) for n-class classification. As lower gini index indicates lesser impurity, at each decision node the algorithm will try to minimize the gini index, and choose the splitting criterion which yields the minimum for it.

+
+
+

Information Gain

+
+

Information Gain is also an impurity based criterion that measures impurity as a difference in entropy of the samples before and after the split.

+
+

Entropy is a measure of randomness. Mathematically Entropy and Information Gain are calculated as follows:

+

\[H(S) = \sum_{c \in C} -p(c).\log_{2}p(c)\] where,

+
    +
  • \(S\) is the dataset before the split
  • +
  • \(C\) is the set of classes in \(S\), for our example \(C = {1, 0}\) where 1 indicates popular and 0 indicates unpopular
  • +
  • \(p(c)\) is the proportion of observations in class c in the dataset \(S\)
  • +
+

\(H(S) = 0\) implies that there is no randomness in the data, and all observations belong to the same class

+

\[IG(A, S) = H(S) - \sum_{t \in T} p(t)H(t)\] where,

+
    +
  • \(H(S)\) is the entropy of the dataset before the split.
  • +
  • \(T\) are the subsets created after splitting \(S\) on feature \(A\) based on some criterion. For example, in our case \(A\) could be is_weekend, then for \(A\) the corresponding \(S\) dataset would be all observations where data_channel_is_entertainment is 1, and \(T\) would be the 2 subsets created for weekend and weekdays.
  • +
  • \(p(t)\) is the proportion of the number of observations in \(T\) to the number of observations in \(S\)
  • +
  • \(H(t)\) is the entropy of the subset \(t\)
  • +
+

Information Gain provides us with a measure of the reduction in randomness given the splitting criterion. Hence, at each decision node the algorithm tries to maximize the information gain, and chooses splitting criterion which leads to maximum information gain.

+

The rpart package uses gini index as the default splitting criterion but it provides an option to use information gain as the splitting criterion too.

+ +

+

As seen above, gini index and information gain produce similar results in most cases. Based on a study by L. E. Raileanu et al., the choice of gini index and information gain matters in only about 2% of the cases. Hence, usually gini index is the way to go as it is computationally less expensive than information gain.

+
+
+

Stopping Criterion:

+

The CART algorithm recursively uses the impurity based criterion to determine splits at each decision node, until it reaches a homogenous subset, or the improvement is below a certain threshold.

+
+
+
+

Hyperparameter Tuning

+

Hyperparameters are parameters that define the model architecture. They control the mechanics of an algorithm and are set before the algorithm is trained. These control the responsiveness, speed of learning and efficiency of the algorithm. Thus, tuning these helps fine tune the model and can produce better performance.

+

A decision tree has 3 important hyperparameters:

+
    +
  • cp : complexity parameter is the minimum improvement in the model needed at each node. It is based on the cost function of the model and acts as a stopping parameter.
  • +
  • minsplit : the minimum number of datapoints needed to accept a split.
  • +
  • maxdepth : the maximum depth upto which a tree can grow.
  • +
+

CP is considered as one of the most important hyperparameters as it helps speed up splitting criterion search. It identifies the splits which do not meet the set threshold and prunes them. Let us look at how playing around with these parameters alters model performance. plotcp() plots the cp values vs cross validation errors.

+ +

+

Using rpart.control() we can define hyperparameters

+ +

+ +

+ +
## [1] "Decision Boundary at 0.559922744379563"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2202 1343
+##          1 1473 2911
+## [1] "Accuracy: 0.644848026232816"
+

As you can see, tuning the tree improved our accuracy from around 63% to 64%, but the tree has become harder to visualize and interpret. Based on the cp plot we can see that beyond 0.003, there is hardly any improvement in the cross validation error and so we can prune the tree to simplify it.

+ +

+ +
## [1] "Decision Boundary at 0.568987943003288"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2136 1334
+##          1 1539 2920
+## [1] "Accuracy: 0.637659225627444"
+

After pruning, the model is much easier to interpret and the accuracy is 63.7% which is still around 64%. Trade off between model complexity and accuracy depends on a case by case basis and must be handled depending on the problem statement. For example, if this was a dataset predicting whether a patient is likely to hae a disease or not, we would choose higher accuracy over model complexity.

+
+
+

Advantages and Disadvantages

+

A Decision Tree comes with its own shares of pros and cons:

+
+

Advantages

+
    +
  1. They are easy to understand, interpret and visualize.
  2. +
  3. A decision tree can handle both numerical and categorical variables natively. At each decision node, the splitting criterion is based on a feature. For a categorical feature the splitting criterion is set with respect to observations belonging to a certain category/level/class, hence data transformation like one-hot encoding is not required. For numerical features, the splitting criterion is based on observations higher or lower than the threshold. As discussed above the best splitting criterion is chosen based on the impurity measure, and the type of feature is irrelevant to this selection.
  4. +
  5. Decision trees being a hierarchial structure are able to learn multiple decision boundaries from the data and hence are more efficient in modeling non-linear data.
  6. +
  7. Performs well with larger datasets, as trees are computationally more efficient.
  8. +
+
+
+

Disadvantages

+
    +
  1. Decision tree are prone to overfitting, more prominently when the tree is deep. It tries to capture the granularity of the data and hence, becomes a complex model. A slight change in the data can drastically affect the model, making them unstable.
  2. +
  3. In case of categorical variables with several levels, impurity based criterion are more biased in favor of attributes for which there are more observations, i.e. for an imbalanced dataset, decision trees can give biased results.
  4. +
+
+
+
+
+

Bagged Trees

+
+

Motivation

+

A major drawback of decision trees is their high variance. A slight change in the dataset can result in a completely different series of splits, which makes the model interpretation unreliable. Bagged Trees help address this issue by averaging multiple trees grown from subsamples of the data.

+
+

Bagging aka. Bootstrap Aggregation is an ensemble technique, wherein multiple models are trained on bootstrapped samples from the dataset and an aggregate of the predictions from these models is considered as the final prediction value.

+
+

Ensembling refers to combining several weak learners together to form one strong learner. In the context of decision trees, weak learners would be represented by shallow trees, that perform only slightly better than random guessing.

+

Bootstrap Sampling refers to sampling from a dataset with replacement. This implies that the sample can have some observations multiple times and some not at all.

+
+Bootstrap Sampling +
+
+
+

Algorithm

+
+Bagging +
+

Bagging can be mathematically represented as: \[\hat{y} = \frac{1}{N} \sum_{i=1}^N T_i(x_i)\] where, - \(\hat{y}\) is the predicted value - \(N\) is the number of trees created - \(x_i\) represents the bootstrapped sample for tree \(T_i\)

+
+

How do bagged trees help in reducing the variance of our predictions?

+

Mathematically the variance is given as, \[Var(X) = E[X^2] - (E[X])^2\]

+

Therefore, the variance of our predicted values, is calculated as:

+

\[Var(\hat{y}) = \frac{1}{N^2}\sum_{i=1}^N Var(T_i(x)) + \frac{2}{N^2} \sum_{i<j} Cov(T_i(x), T_j(x))\]

+

\[= \frac{1}{N^2}\sum_{i=1}^N Var(T_i(x)) + \frac{2}{N^2} \sum_{i<j} \sqrt{Var(T_i(x))Var(T_j(x))} Cor(T_i(x_i), T_j(x_i))\]

+

From the above equation we can see that the variance in the predicted values is proportional to the correlation between the values from different trees. As decision trees are unstable, generating trees using bootstrapped samples gives models that have little correlation. As the correlation between models decreases, so does the variance and hence we are able to reduce the overall variance in our predictions using bagged trees.

+
+
+
+

Implementation

+

To understand the bagging algorithm, let us first implement the algorithm on our own.

+ +
## [1] "Decision Boundary at 0.482908550592783"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 1893 1074
+##          1 1782 3180
+## [1] "Accuracy: 0.639803253878169"
+

You can do the same thing using the bagging() in ipred package. This implementation has the provision to generate out of bag estimates of error rates (misclassification error in case of classification trees) using the samples it did not use in the training set.

+ + + +
## 
+## Bagging classification trees with 30 bootstrap replications 
+## 
+## Call: bagging.data.frame(formula = is_popular ~ ., data = train_df, 
+##     nbagg = 30, coob = TRUE)
+## 
+## Out-of-bag estimate of misclassification error:  0.3633
+ +
## [1] "Decision Boundary at 0.516666666666667"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2281 1440
+##          1 1394 2814
+## [1] "Accuracy: 0.642577878673225"
+

since bagged trees is simply an ensemble of independent decision trees, the same hyperparameters that are applicable for decision trees also apply to bagged trees. Using rpart.control() you can specify hyperparameters for bagged trees. The above bagged tree gave an accuracy of 64.25% by using simple decision trees as we had seen in our first model as it would have used the same default hyperparameter values.

+
+
+

Advantages and Disadvantages

+
+

Advantages

+
    +
  1. Bagged Trees reduce the variance of predictions and thus help in preventing overfitting.
  2. +
+
+
+

Disadvantages

+
    +
  1. The cost of computation is high since we need to train large number of trees.
  2. +
  3. Compared to decision trees, bagged models are harder to interpret. Since the final output is a combination of multiple trees, predictions of the model are no longer intuitive.
  4. +
+
+
+
+
+

Random Forest

+

Random forest is a supervised learning algorithm which can perform both Classification and Regression just like CART. What makes Random Forest different from Bagged Trees is that it does not take into consideration all the variables present in the dataset to choose which variable to split on. Rather it works only on a subset of the columns at each node. So, random forests can be though of as apply bootstrapping to features instead of observations.

+

Some advantages of Random Forest over previously discussed algorithms are

+
    +
  1. Random forest helps in reducing overfitting. Overfitting means that we fitted the data so close that it learned all the anomalies of the dataset as well, instead of building a generic model. In such cases, even models with slight variance from training dataset become prone to large variance when tested on new data. In Random Forest, trees are built with small set of observations which results in more generalized tree structure. The variance for out of bag samples (samples not used in training) become less, hence reducing overfitting. Another, factor which might help to get rid of overfitting is the number of trees. When we aggregate large number of weakly correlated trees, the variance of the overall model surely decreases because the more the number of trees, the closer their individual errors averages out.
  2. +
+

Let’s take a deeper dive into how random forest can help reduce variance of the model! Randomness in random forest arises from bagging (bootstrap sampling). So, everytime a tree is created it is trained on a new dataset, thus creating large number of uncorrelated trees in the final model. The core concept of random forest is creating multiple high variance trees to create low variance ensemble model. Hence it reaches an optimal trade off between bias and variance. This helps the algorithm to not overfit the data.

+
    +
  1. High Accuracy - For large datasets it predicts with very high accuracy. In today’s world with big data it is important for an algorithm to handle large datasets and perform well. This is one of the strongest points of Random Forest.
  2. +
  3. Performs reliable implicit feature selection
  4. +
  5. Requires almost no input preparation
  6. +
  7. Can be easily grown in parallel
  8. +
+
+

Algorithm

+

The below figure summarizes the Random Forest Algorithm:

+
+Algo +
+

The training data is randomly sampled with replacement to form a subset. A Number of decision trees will be grown on different bootstrap samples and each tree will predict certain outcome sets. In classification problems the final class assigned to an observation is the mode of all the predicted classes from the observation for each tree. This is the basic algorithm of Random Forest. Each tree splits the data differently based on different sets of variables/parameters and splitting is decided by maximum entropy gain. Now when we have a new data point to classify, it is passed onto each trained tree. To get the final decision, aggregation of all the trees is done which is also called Majority Voting.

+
+
+

Implementation

+

The 2 popular packages in R to implement Random Forest are

+
    +
  1. RandomForest
  2. +
  3. Party
  4. +
+

The party package implementation uses conditional inference trees which although has its advantages, are time consuming to form. Hence we have used the RandomForest package for our use case.

+ + +
## 
+## Call:
+##  randomForest(formula = is_popular ~ ., data = train_df, ntree = 1000L,      importance = TRUE) 
+##                Type of random forest: classification
+##                      Number of trees: 1000
+## No. of variables tried at each split: 7
+## 
+##         OOB estimate of  error rate: 32.5%
+## Confusion matrix:
+##      0     1 class.error
+## 0 8892  5923   0.3997975
+## 1 4384 12516   0.2594083
+ +
## Confusion Matrix and Statistics
+## 
+##           Reference
+## Prediction    0    1
+##          0 2170 1182
+##          1 1505 3072
+##                                           
+##                Accuracy : 0.6611          
+##                  95% CI : (0.6506, 0.6715)
+##     No Information Rate : 0.5365          
+##     P-Value [Acc > NIR] : < 2.2e-16       
+##                                           
+##                   Kappa : 0.3145          
+##                                           
+##  Mcnemar's Test P-Value : 5.236e-10       
+##                                           
+##             Sensitivity : 0.5905          
+##             Specificity : 0.7221          
+##          Pos Pred Value : 0.6474          
+##          Neg Pred Value : 0.6712          
+##              Prevalence : 0.4635          
+##          Detection Rate : 0.2737          
+##    Detection Prevalence : 0.4228          
+##       Balanced Accuracy : 0.6563          
+##                                           
+##        'Positive' Class : 0               
+## 
+
+

Hyperparametres

+

mytr

+

Mtry provides the number of variables that will be considered for splitting in each tree node. Choosing the most optimal value of mtry could be tricky. Picking small values of mytr could form large number of weak performing trees and choosing high value could suppress the information gain of less influential predictors.

+

tuneRF is a function in Random Forest package which tunes Random Forest for optimal number of mtry (with respect to OOB error).

+

The best mtry value tells us that out of 58 predictor variables, if we randomly pick only 7 variables and implement 1000 trees, all those trees will be very different from each other. To bring better variety, efficiency and accuracy we use the tuneRF function to get the optimized number of the features to be considered for split in each node.

+ +
## mtry = 7  OOB error = 0% 
+## Searching left ...
+## Searching right ...
+

+

RandomForest package - For classification models, the default is the square root of the number of predictor variables.

+

nodesize

+

Nodesize tells the number of observations in a terminal node. In Random Forest package, we specify maxnodes, i.e. the highest number of nodes in random forest and in Party package we can specify maxdepth of the trees.

+

sampsize and replace

+

We need to handle tradeoff between small training set vs more diverse trees. We can either do sampling with replacement or without replacement by changing the replace parameter in the package. Changing the sampling parameter with and without the replacement could have a small positive change in the performance of the model.

+
+
+

OOB Error

+

OOB Score - Random Forest custom validation method, which provides unbiased estimates. From the entire training dataset, each decision tree is trained using the bootstrap samples. The remaining part of the training set now forms the out of bag sample, which will be used to calculate OOB error. OOB samples act as unseen data and thus they provide true estimate of the model.

+
+
+
+

Importance of Variables

+

randomforest package in R calculates importance of variables by two methods:

+
    +
  1. Gini Variable Importance Measure
  2. +
  3. Permutation Variable Importance Measure
  4. +
+

Gini measure for a variable can be calculated by mean of information gain across all the trees produced by that variable. However, Gini as a measure of variable importance produces strongly biased results. It favors categorical and continuous variables more and assigns higher importance scores to them.

+

Permutation Variable Importance(PVI) – Initially OOB prediction error calculated for the data is recorded. Then, again OOB prediction error is calculated after permuting/randomly shuffling each variable individually. The difference between the two errors is calculated, averaged over all the tress and the importance is decided based on that. Most important variable will be the one for which the accuracy of the model decreases the most.

+

However, the permutation variable importance method faces limitations too. It fails to handle multicollinearity of variables while calculating importance scores. A variable on its own may not have effect on the model, however when combined with other variables inflates the importance score.

+

party package in R to the rescue!! Random Forest models built using this package provide a function to calculate conditional variable importance. The traditional approaches above (Gini and PCI), produce an impotance score by computing the effect of permuting the predictor variable \(X_j\) with respect to \(Y\). The conditional variable importance available in party however, analyzes the relationship between Xj and Y given the correlation structure of Xj with respect to other variables.

+

The code chunk below shows how variable importances can be calculated in the randomforest package.

+ +

+
+

We are only modeling once and all the variable importance part is done after the modeling.

+
+

##Drawbacks of Random Forest 1. High accuracy comes at the cost of computational resource since we need to train large number of trees. 2. Apart from some hypertunning parameters, we have very little control on what model does. Ensembling of trees makes it harder to interpret the model, compared to single decision tree.

+
+
+
+

Boosting and Boosted Trees

+

Boosting is a framework that prioritizes mis-classified samples to classify them better. This is achieved in boosted trees by treating each set of misclassified samples with a higher weight than the rest of the samples for the formulation of a new tree. So far we have seen that bagged trees create multiple independent trees. Boosting however, entails the creation of multiple trees sequentially, each improving on the other. It mainly targets reducing the bias in a model while bagging majorly helps with reducing variance.

+

The idea of boosting comes from the fact that several weak learners may be better than a strong learner. A weak learner is a model that classifies with only a slightly better accuracy than random classification. A strong learner on the other hand is one that produces a good accuracy and is highly correlated with the original classes in the target variable. The key reason this often works is because in Random Forests for example, we build several independent trees (which are generally deeper thus making them strong learners) while, boosted algorithms build shallow trees (weak learners) from misclassified observations of a base learner (say the first tree). The sequential or stage-wise nature and focus on misclassification error is what makes boosting advantageous.

+

Let \(h_{m}\) be the weak learner we train at the \(m\)-th stage, \(F_{m-1}\) be our base learner and \(y\) be the target variable. Then the basic idea is that the weak learner is trained on the residuals from the base learner.

+

\[h_{m}(x) = y - F_{m-1}(x)\] And the \(m\)th base learner is updated as follows: \[F_{m}(x) = F_{m-1}(x) + h_{m}(x)\] Thus the generalized form of the sequential additive boosting framework can be written as \[F(x) = F_{base}(x) + \sum_{m=2}^{M}h_{m}(x)\] where \(F_{base}\) is the first tree that is initialized.

+
+

Why Boosting

+

Sequential addition of weak learners allows correcting or tweaking the model in a way that improves the places where it does not perform well. This allows for slow learning of the model. Since, with a tree once built the model hasn’t finished learning yet. But, is rather waiting for further weaker trees to be built basing heavily on the samples it hasn’t been able to classify well yet. This results in improved accuracy.

+

At each step when a tree is added to the chain of boosted trees the aggregate model may be evaluated. This holds another key advantage in preventing the model from overfitting. Cross-validation after each tree is added, helps detect when the error metric has stopped improving and thus, allows training to stop before the algorithm is lured into further reducing the training error and overfitting. This is called early stopping.

+

One important thing to remember however is that sequential attempts to fix misclassified observations places a lot of weight on these samples and hence boosting is sensitive to outliers and noise. This is where early stopping, as described above, can really help in avoiding such negative effects. We will see further down this article on how it can be controlled with hyperparameters and tuning.

+

Also, since boosting builds weak learners, these are usually shallow trees. This results in increased speed when compared to other frameworks that usually need deeper independent trees. Further boosting is generalizable, so where necessary, a user can define their own loss function and run the boosting framework to optimize it.

+
+
+
+

Gradient Boosted Machine

+

The class of gradient boosted machines uses the gradient information to optimize the loss function in boosting. So, gradient boosted trees are often viewed as gradient descent with a modified loss function. Gradient descent is a generic optimization algorithm that traverses down the slopes of a function with the motive of finding the global minimum. Specifically the function is a loss function which is to be optimized for the purposes of an algorithm.

+

Friedman proposes the modification to the loss function of gradient descent for gradient boosted trees. If the weak learner, say the tree \(h_{m}\) for \(j \in 1, 2, …., M\), is thought of as the gradient in the sequential additive boosting model and assuming \(\gamma_{m}\) is the step-size we may mathematically formulate the tree creation as gradient descent problem where the tree may be sequentially updated and \(\gamma_{m}\) can be calculated as shown in the equations below.

+

\[ F_{m}(x) = F_{m-1}(x) + \gamma_{m}h_{m}(x),\] \[\gamma_m = \underset{\gamma}{\arg\min} \sum_{i=1}^{n}L(y_{i}, F_{m}(x_{i}) + \gamma h_m(x_{i})\]

+

Further, we may modify the above equation with a shrinkage parameter or learning rate to avoid over-fitting. This shrinkage parameter shrinks the incremental update to the base learner by reducing the effects of the new weak learner \(h_{m}\).

+

\[ F_{m}(x) = F_{m-1}(x) + \nu . \gamma_{m}h_{m}(x), \text{ }0<\nu \le 1\]

+
+

Implementation

+

We use the r package gbm for the following portions.

+
+
+

Hyperparameters

+

GBM includes the following important hyper parameters among others as described in the r package gbm.

+
    +
  1. n.trees: Number of trees to build
  2. +
  3. iteraction.depth: The depth to which the trees will be built
  4. +
  5. shrinkage: The learning rate at which gbm adds weak learners to the base learner.
  6. +
  7. n.minobsinnode: The minimum number of observations in a node to accept a split.
  8. +
  9. bag.fraction: The fraction of observations to consider for building a single tree.
  10. +
  11. cv.folds: Number of folds for cross-validation if cross-validation is used.
  12. +
  13. train.fraction: The proportion of data to be considered for training of the entire ensemble of trees. The rest is used as validation data.
  14. +
+

We start by training our first gbm model using the code below. The hyperparameters at the moment are set arbitrarily and are only present for us to check how the code runs. We choose a bernoulli distribution in the distribution parameter to let gbm know that we are working on a binary classification problem. On selecting bernoulli gbm by default minimizes the deviance for evaluation of our binary logistic distribution. We may also specify the number of cores through n.cores. Doing this enables R to try and run each fold of cross-validation on a different core. This often helps will speed. Setting this to NULL makes gbm infer the number of available cores for use on the computer using the function detectCores in the parallel package.

+ +

The gbm.perf function allows us to track the cross-validation deviance across iterations. The plot it produces shows the validation deviance in green and the training deviance in black. We clearly see that the training deviance continues to fall while the validation deviance stabilizes after a point. This shows how cross validation helps us avoid overfitting. The blue line shows the optimal number of trees beyond which the model might have started overfitting. This is a key advantage of GBMs.

+ +
## gbm(formula = is_popular ~ ., distribution = "bernoulli", data = train_df, 
+##     n.trees = 1000, interaction.depth = 5, n.minobsinnode = 10, 
+##     shrinkage = 0.05, bag.fraction = 0.7, cv.folds = 5, verbose = FALSE, 
+##     n.cores = NULL)
+## A gradient boosted model with bernoulli loss function.
+## 1000 iterations were performed.
+## The best cross-validation iteration was 991.
+## There were 58 predictors of which 58 had non-zero influence.
+ +

+
## [1] 991
+ +
## Optimal trees: 991
+

The following code chunk will help us evaluate our model on the testing dataset which we had held out for prediction early on. The predict function when used with a gbm class of models produces the predicted probabilities when type = 'response' and hence we need to convert these to classes. To do this we call our find_auc() function and evaluate the model.

+ +
## [1] "Decision Boundary at 0.527258706739771"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2397 1348
+##          1 1278 2906
+## [1] "Accuracy: 0.668810694917392"
+
+
+

Tuning Hyperparameters

+

Above we had chosen the hyper parameters arbitrarily. The performance of a model is heavily dependent on the hyperparameters selected to run it. Thus, we try tuning our model to find the set of hyperparameteres that yeilds optimal performance.

+

We first define a hyperparameter grid using the expand.grid function. This defines a range of values for which gbm will be run to determine the best fitting model. In the parameter grid we have also defined two variables to store the number of optimal trees and minimum deviance from each iteration.

+ +

Now we construct a loop to iterate over the hyperparameter space. In each iteration we train a gbm model and record its error statistics, more specifically the validation deviance. Along with this we also record the optimal number of trees. This will help us in selecting our final model once we have been able to iterate over our entire hyperparameter space.

+ + +
##    learning_rate depth min_obs_node bag_fraction n_trees_opt Deviance
+## 1           0.05     9           15          0.7         379 1.205874
+## 2           0.05     9            5          0.7         468 1.206524
+## 3           0.05     9           10          0.7         390 1.206866
+## 4           0.05     5            5          0.7         591 1.207157
+## 5           0.05     5            5          0.5         745 1.207220
+## 6           0.05     9           15          0.5         412 1.207776
+## 7           0.05     5           10          0.5         560 1.208000
+## 8           0.05     5           15          0.7         512 1.208149
+## 9           0.05     3           10          0.7         900 1.208185
+## 10          0.05     5           10          0.7         607 1.208268
+

The first row of the output shows the optimal hyperparameters to be selected. Now that we have found the optimal hyperparameters we will train our final model and use it to evaluate the performance on the hold out test dataset. In the code chuck above we have saved our optimal hyperparameters in the variable final_parameters.

+ + +

+
## [1] 504
+ +
## [1] "Decision Boundary at 0.486421312493663"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2176 1098
+##          1 1499 3156
+## [1] "Accuracy: 0.672468154874511"
+

The above code chunk shows results from our final model gbm_final. And we see how the accuracy improves over our initial model gbm_fit. We also see that the GBM model has out performed the Random Forest model.

+
+
+
+

Pruning

+

Pruning is a technique to trim a tree after it has been grown fully. Most algorithms make a decision to split a node based on the information gain it sees at that particular split. Pruning allows to delay the decision of selecting the final split. With pruning a tree grows to its maximum depth (this is usually a hyperparameter set by the user) and then evaluates the cumulative information gain of each split. This overcomes greediness of algorithms. Gradient boosting algorithms such as GBM are greedy in nature.

+

Greedy algorithms such as the class of GBMs make the splitting decision based on the information gained immediately at a split. However, it may be possible that the overall information gain for a split following subsequent splits turns out to be greater than the information gained immediately at that paricular split. This is facilitated when using pruning. In the pruning framework, the node splitting is not greedy and the tree is grown deeper, after which the overall information gain for each split is calculated and the decision to retain the split or discard it is made. If a split is retained, the resolution of the node to its immediate children is kept unchanged. And when a split is discarded, the node at which the split is discarded is treated as terminal node (leaf node).

+

The package xgboost is an open-source framework of gradient boosting algorithms that incorporates pruning making it smarter than other greedy gradient boosting algorithms. This is the main advantage of xgboost.

+
+
+

Xgboost

+

Xgboost is a very powerful package that makes training gradient boosted trees very efficient by incorporating parallelization and early stopping. Eary stopping allows training to stop when improvement in subsequent iterations in no longer noticable based on some stopping criterion. Parallelization and early stopping allow xgboost to be much faster than other gradient boosting frameworks.

+
+

Hyperparameters

+
    +
  1. nrounds: Number of iterations or number of trees to build
  2. +
  3. max_depth: The maximum depth to which a tree is allowed to grow
  4. +
  5. eta: The learning rate.
  6. +
  7. min_child_weight: The minimum sum of instance weights in a child node to accept a split.
  8. +
  9. subsample: The fraction of observations to consider for building a single tree.
  10. +
  11. colsample_bytree: The proportion of columns to be considered when constructing each tree.
  12. +
  13. nfolds: Number of folds for cross-validation if cross-validation is used.
  14. +
  15. early_stopping_rounds: xgboost also allows early stopping. This parameter defines the number of trees after which training is stopped if improvement in validation error is not noticed.
  16. +
  17. nthread: Helps in parallelization. Number of threads to run simultaneously
  18. +
+

The following code shows the xgboost set-up that we follow. Similar to our gbm example we will tune parameters for our xgboost model. Similar to GBM we set up our parameter hyperparameter first and then iterate through the hyperparameter space to obtain optimal parameters. Following this we will construct our final tree with the optimal hyperparameters and evaluate results. The function xgb.cv() allows xgboost to train under cross validation. However it cannot be used with r’s predict function to predict values. So use xgb.cv for tuning and use the tuned parameters to train using the xgboost() function which can now be used with the predict function on new data.

+

Xgboost is very fast and one way in which it does this is by working on matrices. So we first convert our dataframes to matrices as shown in the following code chunk.

+ +

Now that we have converted the data to matrices our data is ready to be supplied to xgboost. Hence we begin our tuning of hyperparametrs.

+ + +
##     eta max_depth min_child_weight subsample colsample_bytree n_trees_opt
+## 1  0.01         5                1       0.5              0.6         984
+## 2  0.01         7                3       0.5              0.8         615
+## 3  0.05         5                3       0.7              0.6         345
+## 4  0.01         7                3       0.7              0.6         702
+## 5  0.01         7                1       0.7              0.8         616
+## 6  0.01         5                3       0.5              0.6         994
+## 7  0.01         7                1       0.5              0.6         750
+## 8  0.01         7                1       0.7              0.6         653
+## 9  0.01         5                1       0.7              0.6         777
+## 10 0.01         5                3       0.5              0.8         898
+##        error
+## 1  0.3199747
+## 2  0.3199750
+## 3  0.3205102
+## 4  0.3206055
+## 5  0.3208893
+## 6  0.3210150
+## 7  0.3210153
+## 8  0.3210467
+## 9  0.3212990
+## 10 0.3213938
+

Now that we have zeroed in on the optimal parameters, we train our final xgboost model and evaluate results.

+ + +
## [1] "Decision Boundary at 0.540467292070389"
+## [1] "Confusion Matrix:"
+##           Reference
+## Prediction    0    1
+##          0 2503 1474
+##          1 1172 2780
+## [1] "Accuracy: 0.666288308740068"
+
+
+
+

References

+
    +
  1. “What Is a Decision Tree? - Examples, Advantages & Role in Management.” Study.com, 27 September 2015, study.com/academy/lesson/what-is-a-decision-tree-examples-advantages-role-in-management.html
  2. +
  3. Breiman,L., Friedman,J., Olshen,R. and Stone,C. (1984) Classification and Regression Trees. Wadsworth, Belmont, CA.
  4. +
  5. K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.
  6. +
  7. Chapter 4: Decision Trees Algorithms, medium.com, 06 October 2017, medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1
  8. +
  9. L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index and information gain criteria. Univeristy of Neuchatel, 2000
  10. +
  11. Variable Importance plot code is extracted from “https://gist.github.com/ramhiser/6dec3067f087627a7a85
  12. +
  13. Probst, P., Wright, M. & Boulesteix, A.-L. Hyperparameters and tuning strategies for random forest. WIREs Data Mining Knowledge Discovery 9, e1301, https://doi.org/10.1002/widm.1301 (2019)
  14. +
  15. Strobl C, Hothorn T, Zeileis A (2009). “Party on! – A New, Conditional Variable Importance Measure for Random Forests Available in the party Package.” The R Journal,1(2), 14–17. URL http://journal.R-project.org/archive/2009-2/RJournal_2009-2_Strobl~et~al.pdf.
  16. +
  17. Fit Random Forest Model from http://code.env.duke.edu/projects/mget/export/HEAD/MGET/Trunk/PythonPackage/dist/TracOnlineDocumentation/Documentation/ArcGISReference/RandomForestModel.FitToArcGISTable.html
  18. +
  19. “Gradient Boosting Machines.” Gradient Boosting Machines · UC Business Analytics R Programming Guide, http://uc-r.github.io/gbm_regression.
  20. +
  21. “Gbm.” Function | R Documentation, https://www.rdocumentation.org/packages/gbm/versions/2.1.5/topics/gbm.
  22. +
  23. He, Tong. “Xgboost v0.90.0.2.” Xgboost Package | R Documentation, https://www.rdocumentation.org/packages/xgboost/versions/0.90.0.2.
  24. +
  25. “Gradient Boosting.” Wikipedia, Wikimedia Foundation, 21 Oct. 2019, https://en.wikipedia.org/wiki/Gradient_boosting.
  26. +
  27. Chen, et al. “XGBoost: A Scalable Tree Boosting System.” ArXiv.org, 10 June 2016, https://arxiv.org/abs/1603.02754.
  28. +
+
+
+ + +
+ +
+
+ + + + + + + + + +