Overview of Classification Trees using R
+Overview of Classification Trees using R
+ +Introduction to Decision Trees and CART
+All of us have used Decision Trees in our day today lives. Infact, my mornings often start with using a decision tree to decide “Do I really have to wake up?”.
+++Decision Tree is a hierarchial representation of possible solutions conditioned on certain factors. It is a tree like model that can be used to determine a course of action.
+
Take the above decision tree for instance, weather I wake up or not is dependent on weather my alarm rang and further on the fact that I have less than 30min to class. Hence, decisions are being made recursively.
+Terminologies
+Before diving further into decision trees and how they help address problems of classification and regression, lets familiarize ourselves with a few jargons:
+-
+
- Leaf/Terminal Node: Represents our final decision +
- Decision Node: Represents the choice that we are making and will have outward branches based on the number of choices we have. +
- Root Node: Represents the top-most decision node. +
Let’s dive in and understand how decision trees are being used to solve machine learning problems.
+Classification and Regression Trees (CART)
+++CART is a supervised machine learning algorithm introduced by Breiman et al. that uses recursive partitioning to model classification and regression problems as binary decision trees.
+
To understand how CART works, we will be working with a dataset on online news popularity.
+Implementation
+About the Dataset
+-
+
- Source: data.world +
- Description: The dataset contains 61 attributes of 39644 news articles. Detailed description of the attributes can be found here +
Problem Statement
+This dataset will be used to model the popularity of a news article as a classification problem. One of the attributes of the dataset is the number of shares an article gets, and it is a measure of the popularity. If an article gets 1400 shares or more it is considered to be a popular article \(^{[3]}\), else it is unpopular.
+Overview of the modeling process
+-
+
- Split the data into train and test set +
- Train the model on training set +
- Predict using the trained model on the test set +
- Evaluate the model +
1. Split the data into train and test set
+We will be using 80% of the data as training set and 20% as test. To get a random split we can use the sample()
in R. You can set a seed to ensure that you get the same set of randomized numbers everytime.
n <- nrow(online_news)
+n_train <- round(0.80 * n) ## no. of rows for training set
+set.seed(252526)
+train_indices <- sample(1:n, n_train) ## training indices
+train_df <- online_news[train_indices, ] ## training set
+test_df <- online_news[-train_indices, ] ## test set
++Note: A very important step in model building is feature engineering. We have skipped this completely as we only focus on algorithms in this tutorial and feature engineering is outside the scope of this tutorial. Feature engineering and transformation can provide greater accuracy.
+
2. Train the model on training set
+Now that we have a training set, we will use package rpart
to create and train a classification tree and determine if the article is popular or not. Another package rpart.plot
is used to visualize the modeled tree.
## method='class' indicates its going to be a classification tree
+dt_model1 <- rpart(is_popular ~ ., data = train_df, method = "class")
+### for a regression tree you can use method='anova'
+
+## yesno=2 indicates which branch is chosen for true
+## extra=4 indicates probability of class
+rpart.plot(dt_model1, yesno = 2, extra = 4) ## visualize the decision tree
Using the training data, the CART algorithm is able to learn which features help in predicting the popularity of an article. From the above tree, we can see that based on the training data, if the article of the best keyword (most popular keyword) from the article has an average number of shares (kw_avg_avg) of 2824 or more and the article is on entertainment, then whether the article would be popular or not is conditioned on it being published during the weekdays or weekends. The article has a 0.68 probability of being popular if it is published on a weekend* but only a probability of 0.42 if it’s published on a weekday. Hence, a logical inference from this model would be to for entertainment articles containing popular keywords, to ensure the article is popular it should be published on a weekend and not on a weekday.
+++Note:
+is_weekend = 0
indicates it is a weekday.
3. Predict using the trained model on the test set
+Using the predict()
function, we can get predictions on the test dataset using the trained model.
## type='class' gives classes as predictions
+## another option is type='prob' that will give the probabilities of belonging to a class
+dt_pred1 <- predict(object = dt_model1, newdata = test_df, type = 'class')
4. Evaluate the model
+Accuracy, area under curve(AUC), confusion matrix, precision, recall/sensitivity, and specificity are some of the metrics popularly used to evaluate classification models.
+The confusionMatrix()
from the caret
package provides us with a summary of these metrics, based on the predictions from the model.
## other packages also have function confusionMatrix, hence mentioning package name ensures that the right function is run
+caret::confusionMatrix(dt_pred1, test_df$is_popular)
## Confusion Matrix and Statistics
+##
+## Reference
+## Prediction 0 1
+## 0 2128 1374
+## 1 1547 2880
+##
+## Accuracy : 0.6316
+## 95% CI : (0.6209, 0.6422)
+## No Information Rate : 0.5365
+## P-Value [Acc > NIR] : < 2e-16
+##
+## Kappa : 0.2569
+##
+## Mcnemar's Test P-Value : 0.00146
+##
+## Sensitivity : 0.5790
+## Specificity : 0.6770
+## Pos Pred Value : 0.6077
+## Neg Pred Value : 0.6506
+## Prevalence : 0.4635
+## Detection Rate : 0.2684
+## Detection Prevalence : 0.4417
+## Balanced Accuracy : 0.6280
+##
+## 'Positive' Class : 0
+##
+How to determine the best split?
+In the section above we saw that the CART algorithm was able to infer rules which split our dataset into as homogenous a subset as possible. The underlying goal of a classification tree is to split the dataset into subsets, where each subset belongs to only one class. A classification tree tries to come up with decision boundaries by splitting the input features in such a way that it leads to subsets with most samples belonging to a particular class. This raises the question How to determine the best split?
+As the goal is to partition the dataset into subsets that are as homogenous (pure) as possible; the best split can be determined by measuring the impurity (misclassfications) introduced by each rule and choose the rule that minimizes the impurity the most.
+Gini Index
+++Gini Index is an impurity based criterion that measures the probability of an observation being wrongly classified given the partitioning criterion.
+
Mathematically, Gini Index is calculated as \[G = 1 - \sum_{c \in C}^C p_c^2\] where,
\(C\) is the set of classes, for our example \(C = {1, 0}\) where 1 indicates popular and 0 indicates unpopular
\(p_i\) is the proportion of the samples that belong to class \(i\)
The value of Gini Index lies between \([0,1)\), where 0 indicates that there are no misclassifications as the \(p_c = 1\) indicating that all samples belong to the same class. In case of a binary classfication, random distribution would imply that the proportion of each class is 0.5, therefore in our case for example the gini index would be calculated as, \[G = 1 - P(popular = 0)^2 - P(popular = 1)^2\] \[= 1 - 0.5^2 - 0.5^2 = 1 - 0.25 - 0.25 = 0.5\] So the maximum value a Gini Index can take would be 0.5 for binary classifications, and \(1 - \frac{1}{n}\) for n-class classification. As lower gini index indicates lesser impurity, at each decision node the algorithm will try to minimize the gini index, and choose the splitting criterion which yields the minimum for it.
+Information Gain
+++Information Gain is also an impurity based criterion that measures impurity as a difference in entropy of the samples before and after the split.
+
Entropy is a measure of randomness. Mathematically Entropy and Information Gain are calculated as follows:
+\[H(S) = \sum_{c \in C} -p(c).\log_{2}p(c)\] where,
+-
+
- \(S\) is the dataset before the split +
- \(C\) is the set of classes in \(S\), for our example \(C = {1, 0}\) where 1 indicates popular and 0 indicates unpopular +
- \(p(c)\) is the proportion of observations in class c in the dataset \(S\) +
\(H(S) = 0\) implies that there is no randomness in the data, and all observations belong to the same class
+\[IG(A, S) = H(S) - \sum_{t \in T} p(t)H(t)\] where,
+-
+
- \(H(S)\) is the entropy of the dataset before the split. +
- \(T\) are the subsets created after splitting \(S\) on feature \(A\) based on some criterion. For example, in our case \(A\) could be
is_weekend
, then for \(A\) the corresponding \(S\) dataset would be all observations wheredata_channel_is_entertainment
is 1, and \(T\) would be the 2 subsets created for weekend and weekdays.
+ - \(p(t)\) is the proportion of the number of observations in \(T\) to the number of observations in \(S\) +
- \(H(t)\) is the entropy of the subset \(t\) +
Information Gain provides us with a measure of the reduction in randomness given the splitting criterion. Hence, at each decision node the algorithm tries to maximize the information gain, and chooses splitting criterion which leads to maximum information gain.
+The rpart
package uses gini index as the default splitting criterion but it provides an option to use information gain as the splitting criterion too.
dt_model2 <- rpart(is_popular ~ ., data = train_df, method = "class", parms = list(split = 'information'))
+rpart.plot(dt_model2, yesno = 2, extra = 4) ## visualize the decision tree
As seen above, gini index and information gain produce similar results in most cases. Based on a study by L. E. Raileanu et al., the choice of gini index and information gain matters in only about 2% of the cases. Hence, usually gini index is the way to go as it is computationally less expensive than information gain.
+Stopping Criterion:
+The CART algorithm recursively uses the impurity based criterion to determine splits at each decision node, until it reaches a homogenous subset, or the improvement is below a certain threshold.
+Hyperparameter Tuning
+Hyperparameters are parameters that define the model architecture. They control the mechanics of an algorithm and are set before the algorithm is trained. These control the responsiveness, speed of learning and efficiency of the algorithm. Thus, tuning these helps fine tune the model and can produce better performance.
+A decision tree has 3 important hyperparameters:
+-
+
cp
: complexity parameter is the minimum improvement in the model needed at each node. It is based on the cost function of the model and acts as a stopping parameter.
+minsplit
: the minimum number of datapoints needed to accept a split.
+maxdepth
: the maximum depth upto which a tree can grow.
+
CP is considered as one of the most important hyperparameters as it helps speed up splitting criterion search. It identifies the splits which do not meet the set threshold and prunes them. Let us look at how playing around with these parameters alters model performance. plotcp()
plots the cp values vs cross validation errors.
Using rpart.control()
we can define hyperparameters
#' Evaluate Model accuracy is evaluated using confusion matrices and AUC and determine decision boundary.
+#' Model accuracy is evaluated using confusion matrices and AUC.
+#' To be able to find the optimal threshold between true positive rate (sensitivity) and false positive rate (specificity) we define the function below that chooses a threshold and prints out a summary of accuracy.
+#' @data dataset with actual test values
+#' @predictions probabilities obtained for target variable
+#' @plt bool plot ROC curve or not
+find_auc <- function(data, predictions, plt = FALSE){
+ auc <- roc(data[,'is_popular'], predictions, plot = plt, quiet = TRUE)
+ # print(auc$auc)
+ thresholds <- cbind(auc$thresholds,auc$sensitivities+auc$specificities)
+ thr <- subset(thresholds, thresholds[,2]==max(thresholds[,2]))[,1]
+ print(paste('Decision Boundary at', thr))
+ cnf = caret::confusionMatrix(as.factor(ifelse(predictions>=thr, 1, 0)), as.factor(data[,'is_popular']))
+ print('Confusion Matrix:')
+ print(cnf$table)
+ print(paste('Accuracy:', cnf$overall[1]))
+}
+
+dt_hyperparam <- rpart.control(minsplit = 20, maxdepth = 10, cp = 0.001)
+dt_model3 <- rpart(is_popular ~ ., data = train_df, method = "class", control = dt_hyperparam)
+rpart.plot(dt_model3, yesno = 2, extra = 4)
dt_pred3 <- predict(object = dt_model3, newdata = test_df, type = 'prob')
+find_auc(test_df, dt_pred3[,2]) ## the second column contains the prob of class being 1
## [1] "Decision Boundary at 0.559922744379563"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2202 1343
+## 1 1473 2911
+## [1] "Accuracy: 0.644848026232816"
+As you can see, tuning the tree improved our accuracy from around 63% to 64%, but the tree has become harder to visualize and interpret. Based on the cp plot we can see that beyond 0.003, there is hardly any improvement in the cross validation error and so we can prune the tree to simplify it.
+ + +dt_pred3 <- predict(object = dt_model3, newdata = test_df, type = 'prob')
+find_auc(test_df, dt_pred3[,2]) ## the second column contains the prob of class
## [1] "Decision Boundary at 0.568987943003288"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2136 1334
+## 1 1539 2920
+## [1] "Accuracy: 0.637659225627444"
+After pruning, the model is much easier to interpret and the accuracy is 63.7% which is still around 64%. Trade off between model complexity and accuracy depends on a case by case basis and must be handled depending on the problem statement. For example, if this was a dataset predicting whether a patient is likely to hae a disease or not, we would choose higher accuracy over model complexity.
+Advantages and Disadvantages
+A Decision Tree comes with its own shares of pros and cons:
+Advantages
+-
+
- They are easy to understand, interpret and visualize. +
- A decision tree can handle both numerical and categorical variables natively. At each decision node, the splitting criterion is based on a feature. For a categorical feature the splitting criterion is set with respect to observations belonging to a certain category/level/class, hence data transformation like one-hot encoding is not required. For numerical features, the splitting criterion is based on observations higher or lower than the threshold. As discussed above the best splitting criterion is chosen based on the impurity measure, and the type of feature is irrelevant to this selection. +
- Decision trees being a hierarchial structure are able to learn multiple decision boundaries from the data and hence are more efficient in modeling non-linear data. +
- Performs well with larger datasets, as trees are computationally more efficient. +
Disadvantages
+-
+
- Decision tree are prone to overfitting, more prominently when the tree is deep. It tries to capture the granularity of the data and hence, becomes a complex model. A slight change in the data can drastically affect the model, making them unstable. +
- In case of categorical variables with several levels, impurity based criterion are more biased in favor of attributes for which there are more observations, i.e. for an imbalanced dataset, decision trees can give biased results. +
Bagged Trees
+Motivation
+A major drawback of decision trees is their high variance. A slight change in the dataset can result in a completely different series of splits, which makes the model interpretation unreliable. Bagged Trees help address this issue by averaging multiple trees grown from subsamples of the data.
+++Bagging aka. Bootstrap Aggregation is an ensemble technique, wherein multiple models are trained on bootstrapped samples from the dataset and an aggregate of the predictions from these models is considered as the final prediction value.
+
Ensembling refers to combining several weak learners together to form one strong learner. In the context of decision trees, weak learners would be represented by shallow trees, that perform only slightly better than random guessing.
+Bootstrap Sampling refers to sampling from a dataset with replacement. This implies that the sample can have some observations multiple times and some not at all.
+Algorithm
+Bagging can be mathematically represented as: \[\hat{y} = \frac{1}{N} \sum_{i=1}^N T_i(x_i)\] where, - \(\hat{y}\) is the predicted value - \(N\) is the number of trees created - \(x_i\) represents the bootstrapped sample for tree \(T_i\)
+How do bagged trees help in reducing the variance of our predictions?
+Mathematically the variance is given as, \[Var(X) = E[X^2] - (E[X])^2\]
+Therefore, the variance of our predicted values, is calculated as:
+\[Var(\hat{y}) = \frac{1}{N^2}\sum_{i=1}^N Var(T_i(x)) + \frac{2}{N^2} \sum_{i<j} Cov(T_i(x), T_j(x))\]
+\[= \frac{1}{N^2}\sum_{i=1}^N Var(T_i(x)) + \frac{2}{N^2} \sum_{i<j} \sqrt{Var(T_i(x))Var(T_j(x))} Cor(T_i(x_i), T_j(x_i))\]
+From the above equation we can see that the variance in the predicted values is proportional to the correlation between the values from different trees. As decision trees are unstable, generating trees using bootstrapped samples gives models that have little correlation. As the correlation between models decreases, so does the variance and hence we are able to reduce the overall variance in our predictions using bagged trees.
+Implementation
+To understand the bagging algorithm, let us first implement the algorithm on our own.
+nbagg = 5 ## number of bagged trees
+seeds = c(123, 456, 789, 654, 321)
+bagging.eval <- tibble(.rows=nrow(test_df))
+models <- c()
+for (i in 1:nbagg) {
+ set.seed(seeds[i])
+ B = length(train_indices) / 2 ## number of samples being bootstrapped with replacement
+ ## extract bootstrapped sample
+ bootstrap_sample_idx <- sample(train_indices, size=B, replace=T)
+ ## training model on bootstrapped samples
+ bagged_model <- rpart(is_popular ~ ., data = train_df[bootstrap_sample_idx,], method = "class")
+ ## generate predictions from each model and store
+ bagging.eval[,i] = predict(bagged_model, newdata = test_df, type = "prob")[,2]
+}
+## Aggregation
+bagging.eval$agg_pred <- rowMeans(bagging.eval)
+find_auc(test_df, bagging.eval$agg_pred)
## [1] "Decision Boundary at 0.482908550592783"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 1893 1074
+## 1 1782 3180
+## [1] "Accuracy: 0.639803253878169"
+You can do the same thing using the bagging()
in ipred
package. This implementation has the provision to generate out of bag estimates of error rates (misclassification error in case of classification trees) using the samples it did not use in the training set.
## nbagg represents the number of bagged trees
+## coob true will calculate the OOB error rates
+bagged_model2 <- bagging(is_popular ~ ., nbagg = 30, data = train_df, coob = TRUE)
##
+## Bagging classification trees with 30 bootstrap replications
+##
+## Call: bagging.data.frame(formula = is_popular ~ ., data = train_df,
+## nbagg = 30, coob = TRUE)
+##
+## Out-of-bag estimate of misclassification error: 0.3633
+bagged_pred2 <- predict(object = bagged_model2, newdata = test_df, type = "prob")
+find_auc(test_df, bagged_pred2[,2])
## [1] "Decision Boundary at 0.516666666666667"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2281 1440
+## 1 1394 2814
+## [1] "Accuracy: 0.642577878673225"
+since bagged trees is simply an ensemble of independent decision trees, the same hyperparameters that are applicable for decision trees also apply to bagged trees. Using rpart.control()
you can specify hyperparameters for bagged trees. The above bagged tree gave an accuracy of 64.25% by using simple decision trees as we had seen in our first model as it would have used the same default hyperparameter values.
Advantages and Disadvantages
+Advantages
+-
+
- Bagged Trees reduce the variance of predictions and thus help in preventing overfitting. +
Disadvantages
+-
+
- The cost of computation is high since we need to train large number of trees. +
- Compared to decision trees, bagged models are harder to interpret. Since the final output is a combination of multiple trees, predictions of the model are no longer intuitive. +
Random Forest
+Random forest is a supervised learning algorithm which can perform both Classification and Regression just like CART. What makes Random Forest different from Bagged Trees is that it does not take into consideration all the variables present in the dataset to choose which variable to split on. Rather it works only on a subset of the columns at each node. So, random forests can be though of as apply bootstrapping to features instead of observations.
+Some advantages of Random Forest over previously discussed algorithms are
+-
+
- Random forest helps in reducing overfitting. Overfitting means that we fitted the data so close that it learned all the anomalies of the dataset as well, instead of building a generic model. In such cases, even models with slight variance from training dataset become prone to large variance when tested on new data. In Random Forest, trees are built with small set of observations which results in more generalized tree structure. The variance for out of bag samples (samples not used in training) become less, hence reducing overfitting. Another, factor which might help to get rid of overfitting is the number of trees. When we aggregate large number of weakly correlated trees, the variance of the overall model surely decreases because the more the number of trees, the closer their individual errors averages out. +
Let’s take a deeper dive into how random forest can help reduce variance of the model! Randomness in random forest arises from bagging (bootstrap sampling). So, everytime a tree is created it is trained on a new dataset, thus creating large number of uncorrelated trees in the final model. The core concept of random forest is creating multiple high variance trees to create low variance ensemble model. Hence it reaches an optimal trade off between bias and variance. This helps the algorithm to not overfit the data.
+-
+
- High Accuracy - For large datasets it predicts with very high accuracy. In today’s world with big data it is important for an algorithm to handle large datasets and perform well. This is one of the strongest points of Random Forest. +
- Performs reliable implicit feature selection +
- Requires almost no input preparation +
- Can be easily grown in parallel +
Algorithm
+The below figure summarizes the Random Forest Algorithm:
+The training data is randomly sampled with replacement to form a subset. A Number of decision trees will be grown on different bootstrap samples and each tree will predict certain outcome sets. In classification problems the final class assigned to an observation is the mode of all the predicted classes from the observation for each tree. This is the basic algorithm of Random Forest. Each tree splits the data differently based on different sets of variables/parameters and splitting is decided by maximum entropy gain. Now when we have a new data point to classify, it is passed onto each trained tree. To get the final decision, aggregation of all the trees is done which is also called Majority Voting.
+Implementation
+The 2 popular packages in R to implement Random Forest are
+-
+
RandomForest
+Party
+
The party
package implementation uses conditional inference trees which although has its advantages, are time consuming to form. Hence we have used the RandomForest
package for our use case.
train_model <- randomForest(formula = is_popular ~ .,
+ data = train_df, ntree = 1000L, importance = TRUE) #Extract variable importance measure
##
+## Call:
+## randomForest(formula = is_popular ~ ., data = train_df, ntree = 1000L, importance = TRUE)
+## Type of random forest: classification
+## Number of trees: 1000
+## No. of variables tried at each split: 7
+##
+## OOB estimate of error rate: 32.5%
+## Confusion matrix:
+## 0 1 class.error
+## 0 8892 5923 0.3997975
+## 1 4384 12516 0.2594083
+#When we print the model we get OOB estimate error and this is explained in the later sections
+pred_val <- predict(train_model, newdata = test_df, type="response")
+
+
+confusionMatrix(pred_val, test_df$is_popular) #Confusion matrix is calculated for Testing dataset
## Confusion Matrix and Statistics
+##
+## Reference
+## Prediction 0 1
+## 0 2170 1182
+## 1 1505 3072
+##
+## Accuracy : 0.6611
+## 95% CI : (0.6506, 0.6715)
+## No Information Rate : 0.5365
+## P-Value [Acc > NIR] : < 2.2e-16
+##
+## Kappa : 0.3145
+##
+## Mcnemar's Test P-Value : 5.236e-10
+##
+## Sensitivity : 0.5905
+## Specificity : 0.7221
+## Pos Pred Value : 0.6474
+## Neg Pred Value : 0.6712
+## Prevalence : 0.4635
+## Detection Rate : 0.2737
+## Detection Prevalence : 0.4228
+## Balanced Accuracy : 0.6563
+##
+## 'Positive' Class : 0
+##
+Hyperparametres
+mytr
Mtry provides the number of variables that will be considered for splitting in each tree node. Choosing the most optimal value of mtry could be tricky. Picking small values of mytr could form large number of weak performing trees and choosing high value could suppress the information gain of less influential predictors.
+tuneRF is a function in Random Forest package which tunes Random Forest for optimal number of mtry (with respect to OOB error).
+The best mtry value tells us that out of 58 predictor variables, if we randomly pick only 7 variables and implement 1000 trees, all those trees will be very different from each other. To bring better variety, efficiency and accuracy we use the tuneRF
function to get the optimized number of the features to be considered for split in each node.
bestmtry <- tuneRF(train_df, train_df$is_popular, stepFactor=1, improve=0.01,
+ trace=TRUE, plot=TRUE)
## mtry = 7 OOB error = 0%
+## Searching left ...
+## Searching right ...
+
+RandomForest package - For classification models, the default is the square root of the number of predictor variables.
+nodesize
Nodesize tells the number of observations in a terminal node. In Random Forest package, we specify maxnodes, i.e. the highest number of nodes in random forest and in Party package we can specify maxdepth of the trees.
+sampsize
and replace
We need to handle tradeoff between small training set vs more diverse trees. We can either do sampling with replacement or without replacement by changing the replace parameter in the package. Changing the sampling parameter with and without the replacement could have a small positive change in the performance of the model.
+OOB Error
+OOB Score - Random Forest custom validation method, which provides unbiased estimates. From the entire training dataset, each decision tree is trained using the bootstrap samples. The remaining part of the training set now forms the out of bag sample, which will be used to calculate OOB error. OOB samples act as unseen data and thus they provide true estimate of the model.
+Importance of Variables
+randomforest
package in R calculates importance of variables by two methods:
-
+
- Gini Variable Importance Measure +
- Permutation Variable Importance Measure +
Gini measure for a variable can be calculated by mean of information gain across all the trees produced by that variable. However, Gini as a measure of variable importance produces strongly biased results. It favors categorical and continuous variables more and assigns higher importance scores to them.
+Permutation Variable Importance(PVI) – Initially OOB prediction error calculated for the data is recorded. Then, again OOB prediction error is calculated after permuting/randomly shuffling each variable individually. The difference between the two errors is calculated, averaged over all the tress and the importance is decided based on that. Most important variable will be the one for which the accuracy of the model decreases the most.
+However, the permutation variable importance method faces limitations too. It fails to handle multicollinearity of variables while calculating importance scores. A variable on its own may not have effect on the model, however when combined with other variables inflates the importance score.
+party
package in R to the rescue!! Random Forest models built using this package provide a function to calculate conditional variable importance. The traditional approaches above (Gini and PCI), produce an impotance score by computing the effect of permuting the predictor variable \(X_j\) with respect to \(Y\). The conditional variable importance available in party
however, analyzes the relationship between Xj and Y given the correlation structure of Xj with respect to other variables.
The code chunk below shows how variable importances can be calculated in the randomforest
package.
var_importance <- data_frame(variable=setdiff(colnames(train_df), "is_popular"),
+ importance=as.vector(importance(train_model, type =1)))
+var_importance <- arrange(var_importance, desc(importance))
+var_importance$variable <- factor(var_importance$variable, levels=var_importance$variable)
+
+p <- ggplot(var_importance, aes(x=variable, weight=importance, fill=variable))
+p <- p + geom_bar() + ggtitle("Variable Importance from Random Forest Fit")
+p <- p + xlab("Demographic Attribute") + ylab("Variable Importance (Mean Decrease in Accuracy Index)")
+p <- p + scale_fill_discrete(name="Variable Name")
+p + theme(axis.text.x=element_blank(),
+ axis.text.y=element_text(size=12),
+ axis.title=element_text(size=12),
+ plot.title=element_text(size=12),
+ legend.title=element_text(size=12),
+ legend.text=element_text(size=12))
++We are only modeling once and all the variable importance part is done after the modeling.
+
##Drawbacks of Random Forest 1. High accuracy comes at the cost of computational resource since we need to train large number of trees. 2. Apart from some hypertunning parameters, we have very little control on what model does. Ensembling of trees makes it harder to interpret the model, compared to single decision tree.
+Boosting and Boosted Trees
+Boosting is a framework that prioritizes mis-classified samples to classify them better. This is achieved in boosted trees by treating each set of misclassified samples with a higher weight than the rest of the samples for the formulation of a new tree. So far we have seen that bagged trees create multiple independent trees. Boosting however, entails the creation of multiple trees sequentially, each improving on the other. It mainly targets reducing the bias in a model while bagging majorly helps with reducing variance.
+The idea of boosting comes from the fact that several weak learners may be better than a strong learner. A weak learner is a model that classifies with only a slightly better accuracy than random classification. A strong learner on the other hand is one that produces a good accuracy and is highly correlated with the original classes in the target variable. The key reason this often works is because in Random Forests for example, we build several independent trees (which are generally deeper thus making them strong learners) while, boosted algorithms build shallow trees (weak learners) from misclassified observations of a base learner (say the first tree). The sequential or stage-wise nature and focus on misclassification error is what makes boosting advantageous.
+Let \(h_{m}\) be the weak learner we train at the \(m\)-th stage, \(F_{m-1}\) be our base learner and \(y\) be the target variable. Then the basic idea is that the weak learner is trained on the residuals from the base learner.
+\[h_{m}(x) = y - F_{m-1}(x)\] And the \(m\)th base learner is updated as follows: \[F_{m}(x) = F_{m-1}(x) + h_{m}(x)\] Thus the generalized form of the sequential additive boosting framework can be written as \[F(x) = F_{base}(x) + \sum_{m=2}^{M}h_{m}(x)\] where \(F_{base}\) is the first tree that is initialized.
+Why Boosting
+Sequential addition of weak learners allows correcting or tweaking the model in a way that improves the places where it does not perform well. This allows for slow learning of the model. Since, with a tree once built the model hasn’t finished learning yet. But, is rather waiting for further weaker trees to be built basing heavily on the samples it hasn’t been able to classify well yet. This results in improved accuracy.
+At each step when a tree is added to the chain of boosted trees the aggregate model may be evaluated. This holds another key advantage in preventing the model from overfitting. Cross-validation after each tree is added, helps detect when the error metric has stopped improving and thus, allows training to stop before the algorithm is lured into further reducing the training error and overfitting. This is called early stopping.
+One important thing to remember however is that sequential attempts to fix misclassified observations places a lot of weight on these samples and hence boosting is sensitive to outliers and noise. This is where early stopping, as described above, can really help in avoiding such negative effects. We will see further down this article on how it can be controlled with hyperparameters and tuning.
+Also, since boosting builds weak learners, these are usually shallow trees. This results in increased speed when compared to other frameworks that usually need deeper independent trees. Further boosting is generalizable, so where necessary, a user can define their own loss function and run the boosting framework to optimize it.
+Gradient Boosted Machine
+The class of gradient boosted machines uses the gradient information to optimize the loss function in boosting. So, gradient boosted trees are often viewed as gradient descent with a modified loss function. Gradient descent is a generic optimization algorithm that traverses down the slopes of a function with the motive of finding the global minimum. Specifically the function is a loss function which is to be optimized for the purposes of an algorithm.
+Friedman proposes the modification to the loss function of gradient descent for gradient boosted trees. If the weak learner, say the tree \(h_{m}\) for \(j \in 1, 2, …., M\), is thought of as the gradient in the sequential additive boosting model and assuming \(\gamma_{m}\) is the step-size we may mathematically formulate the tree creation as gradient descent problem where the tree may be sequentially updated and \(\gamma_{m}\) can be calculated as shown in the equations below.
+\[ F_{m}(x) = F_{m-1}(x) + \gamma_{m}h_{m}(x),\] \[\gamma_m = \underset{\gamma}{\arg\min} \sum_{i=1}^{n}L(y_{i}, F_{m}(x_{i}) + \gamma h_m(x_{i})\]
+Further, we may modify the above equation with a shrinkage parameter or learning rate to avoid over-fitting. This shrinkage parameter shrinks the incremental update to the base learner by reducing the effects of the new weak learner \(h_{m}\).
+\[ F_{m}(x) = F_{m-1}(x) + \nu . \gamma_{m}h_{m}(x), \text{ }0<\nu \le 1\]
+Implementation
+We use the r package gbm
for the following portions.
Hyperparameters
+GBM includes the following important hyper parameters among others as described in the r package gbm.
+-
+
n.trees
: Number of trees to build
+iteraction.depth
: The depth to which the trees will be built
+shrinkage
: The learning rate at which gbm adds weak learners to the base learner.
+n.minobsinnode
: The minimum number of observations in a node to accept a split.
+bag.fraction
: The fraction of observations to consider for building a single tree.
+cv.folds
: Number of folds for cross-validation if cross-validation is used.
+train.fraction
: The proportion of data to be considered for training of the entire ensemble of trees. The rest is used as validation data.
+
We start by training our first gbm model using the code below. The hyperparameters at the moment are set arbitrarily and are only present for us to check how the code runs. We choose a bernoulli distribution in the distribution
parameter to let gbm
know that we are working on a binary classification problem. On selecting bernoulli gbm
by default minimizes the deviance for evaluation of our binary logistic distribution. We may also specify the number of cores through n.cores
. Doing this enables R to try and run each fold of cross-validation on a different core. This often helps will speed. Setting this to NULL
makes gbm infer the number of available cores for use on the computer using the function detectCores
in the parallel
package.
#gbm
+gbm_fit <- gbm( is_popular ~ ., distribution = "bernoulli", data = train_df,
+ n.trees = 1000,
+ interaction.depth = 5,
+ n.minobsinnode = 10,
+ shrinkage = 0.05,
+ bag.fraction = 0.7,
+ cv.folds = 5,
+ n.cores = NULL,
+ verbose = FALSE
+)
The gbm.perf
function allows us to track the cross-validation deviance across iterations. The plot it produces shows the validation deviance in green and the training deviance in black. We clearly see that the training deviance continues to fall while the validation deviance stabilizes after a point. This shows how cross validation helps us avoid overfitting. The blue line shows the optimal number of trees beyond which the model might have started overfitting. This is a key advantage of GBMs.
## gbm(formula = is_popular ~ ., distribution = "bernoulli", data = train_df,
+## n.trees = 1000, interaction.depth = 5, n.minobsinnode = 10,
+## shrinkage = 0.05, bag.fraction = 0.7, cv.folds = 5, verbose = FALSE,
+## n.cores = NULL)
+## A gradient boosted model with bernoulli loss function.
+## 1000 iterations were performed.
+## The best cross-validation iteration was 991.
+## There were 58 predictors of which 58 had non-zero influence.
+
+
+## [1] 991
+
+## Optimal trees: 991
+The following code chunk will help us evaluate our model on the testing dataset which we had held out for prediction early on. The predict
function when used with a gbm class of models produces the predicted probabilities when type = 'response'
and hence we need to convert these to classes. To do this we call our find_auc()
function and evaluate the model.
# Prediction Error
+test_y_pred <- predict(gbm_fit, newdata = test_df, n.trees = ntrees_min_dev, type = 'response')
+thr <- find_auc(test_df, test_y_pred)
## [1] "Decision Boundary at 0.527258706739771"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2397 1348
+## 1 1278 2906
+## [1] "Accuracy: 0.668810694917392"
+Tuning Hyperparameters
+Above we had chosen the hyper parameters arbitrarily. The performance of a model is heavily dependent on the hyperparameters selected to run it. Thus, we try tuning our model to find the set of hyperparameteres that yeilds optimal performance.
+We first define a hyperparameter grid using the expand.grid
function. This defines a range of values for which gbm will be run to determine the best fitting model. In the parameter grid we have also defined two variables to store the number of optimal trees and minimum deviance from each iteration.
# Hyperparameter Grid
+parameter_grid <- expand.grid(
+ learning_rate = c(0.01, 0.05, 0.1),
+ depth = c(3, 5, 9),
+ min_obs_node = c(5, 10, 15),
+ bag_fraction = c(0.5, 0.7),
+ n_trees_opt = 0, # Number of optimal trees
+ Deviance = 0 # Minimum deviance
+)
Now we construct a loop to iterate over the hyperparameter space. In each iteration we train a gbm model and record its error statistics, more specifically the validation deviance. Along with this we also record the optimal number of trees. This will help us in selecting our final model once we have been able to iterate over our entire hyperparameter space.
+# Looping across all parameters ----
+for(i in 1:nrow(parameter_grid)) {
+
+ cat('Model No:', i, '\n')
+
+ # train model
+ gbm_tune <- gbm( is_popular ~ ., distribution = "bernoulli", data = train_df,
+ n.trees = 1000,
+ interaction.depth = parameter_grid$depth[i],
+ shrinkage = parameter_grid$learning_rate[i],
+ n.minobsinnode = parameter_grid$min_obs_node[i],
+ bag.fraction = parameter_grid$bag_fraction[i],
+ train.fraction = .6,
+ n.cores = NULL,
+ verbose = FALSE
+ )
+
+ cat('Deviance:', min(gbm_tune$valid.error), '\n')
+ # add min training error and trees to grid
+ parameter_grid$n_trees_opt[i] <- which.min(gbm_tune$valid.error)
+ parameter_grid$Deviance[i] <- min(gbm_tune$valid.error)
+}
# Finalizing Parameters ----
+parameter_grid <- parameter_grid %>% dplyr::arrange(Deviance)
+final_parameters <- parameter_grid[1,]
+parameter_grid %>% head(10)
## learning_rate depth min_obs_node bag_fraction n_trees_opt Deviance
+## 1 0.05 9 15 0.7 379 1.205874
+## 2 0.05 9 5 0.7 468 1.206524
+## 3 0.05 9 10 0.7 390 1.206866
+## 4 0.05 5 5 0.7 591 1.207157
+## 5 0.05 5 5 0.5 745 1.207220
+## 6 0.05 9 15 0.5 412 1.207776
+## 7 0.05 5 10 0.5 560 1.208000
+## 8 0.05 5 15 0.7 512 1.208149
+## 9 0.05 3 10 0.7 900 1.208185
+## 10 0.05 5 10 0.7 607 1.208268
+The first row of the output shows the optimal hyperparameters to be selected. Now that we have found the optimal hyperparameters we will train our final model and use it to evaluate the performance on the hold out test dataset. In the code chuck above we have saved our optimal hyperparameters in the variable final_parameters
.
# Final GBM Tree ----
+gbm_final <- gbm( is_popular ~ ., distribution = "bernoulli", data = train_df,
+ n.trees = 8000,
+ interaction.depth = final_parameters$depth,
+ shrinkage = final_parameters$learning_rate,
+ n.minobsinnode = final_parameters$min_obs_node,
+ bag.fraction = final_parameters$bag_fraction,
+ cv.folds = 4,
+ n.cores = NULL, # will use all cores by default
+ verbose = FALSE
+)
## [1] 504
+ntrees_final <- which.min(gbm_final$cv.error)
+
+# Prediction Error
+test_y_pred <- predict(gbm_final, newdata = test_df, n.trees = ntrees_final, type = 'response')
+thr <- find_auc(test_df, test_y_pred)
## [1] "Decision Boundary at 0.486421312493663"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2176 1098
+## 1 1499 3156
+## [1] "Accuracy: 0.672468154874511"
+The above code chunk shows results from our final model gbm_final
. And we see how the accuracy improves over our initial model gbm_fit
. We also see that the GBM model has out performed the Random Forest model.
Pruning
+Pruning is a technique to trim a tree after it has been grown fully. Most algorithms make a decision to split a node based on the information gain it sees at that particular split. Pruning allows to delay the decision of selecting the final split. With pruning a tree grows to its maximum depth (this is usually a hyperparameter set by the user) and then evaluates the cumulative information gain of each split. This overcomes greediness of algorithms. Gradient boosting algorithms such as GBM are greedy in nature.
+Greedy algorithms such as the class of GBMs make the splitting decision based on the information gained immediately at a split. However, it may be possible that the overall information gain for a split following subsequent splits turns out to be greater than the information gained immediately at that paricular split. This is facilitated when using pruning. In the pruning framework, the node splitting is not greedy and the tree is grown deeper, after which the overall information gain for each split is calculated and the decision to retain the split or discard it is made. If a split is retained, the resolution of the node to its immediate children is kept unchanged. And when a split is discarded, the node at which the split is discarded is treated as terminal node (leaf node).
+The package xgboost
is an open-source framework of gradient boosting algorithms that incorporates pruning making it smarter than other greedy gradient boosting algorithms. This is the main advantage of xgboost
.
Xgboost
+Xgboost is a very powerful package that makes training gradient boosted trees very efficient by incorporating parallelization and early stopping. Eary stopping allows training to stop when improvement in subsequent iterations in no longer noticable based on some stopping criterion. Parallelization and early stopping allow xgboost to be much faster than other gradient boosting frameworks.
+Hyperparameters
+-
+
nrounds
: Number of iterations or number of trees to build
+max_depth
: The maximum depth to which a tree is allowed to grow
+eta
: The learning rate.
+min_child_weight
: The minimum sum of instance weights in a child node to accept a split.
+subsample
: The fraction of observations to consider for building a single tree.
+colsample_bytree
: The proportion of columns to be considered when constructing each tree.
+nfolds
: Number of folds for cross-validation if cross-validation is used.
+early_stopping_rounds
: xgboost also allows early stopping. This parameter defines the number of trees after which training is stopped if improvement in validation error is not noticed.
+nthread
: Helps in parallelization. Number of threads to run simultaneously
+
The following code shows the xgboost set-up that we follow. Similar to our gbm example we will tune parameters for our xgboost model. Similar to GBM we set up our parameter hyperparameter first and then iterate through the hyperparameter space to obtain optimal parameters. Following this we will construct our final tree with the optimal hyperparameters and evaluate results. The function xgb.cv()
allows xgboost to train under cross validation. However it cannot be used with r’s predict
function to predict values. So use xgb.cv
for tuning and use the tuned parameters to train using the xgboost()
function which can now be used with the predict
function on new data.
Xgboost is very fast and one way in which it does this is by working on matrices. So we first convert our dataframes to matrices as shown in the following code chunk.
+X_train <- as.matrix(subset(train_df, select = -c(is_popular)))
+y_train <- as.matrix(subset(train_df, select = c(is_popular)))
+X_test <- as.matrix(subset(test_df, select = -c(is_popular)))
+y_test <- as.matrix(subset(test_df, select = c(is_popular)))
Now that we have converted the data to matrices our data is ready to be supplied to xgboost. Hence we begin our tuning of hyperparametrs.
+# Hyperparameter grid
+xgb_parameter_grid <- expand.grid(eta = c(.01, .05, .1),
+ max_depth = c(3, 5, 7),
+ min_child_weight = c(1, 3),
+ subsample = c(0.5, 0.7),
+ colsample_bytree = c(0.6, 0.8),
+ n_trees_opt = 0,
+ error = 0
+)
+
+#Loop for tuning
+for(i in 1:nrow(xgb_parameter_grid)) {
+
+ cat('Model No:', i, '\n')
+
+ # create parameter list
+ hyper_parameters <- list(eta = xgb_parameter_grid$eta[i],
+ max_depth = xgb_parameter_grid$max_depth[i],
+ min_child_weight = xgb_parameter_grid$min_child_weight[i],
+ subsample = xgb_parameter_grid$subsample[i],
+ colsample_bytree = xgb_parameter_grid$colsample_bytree[i]
+ )
+
+ # train model
+ xgb_tune <- xgb.cv( data = X_train, label = y_train,
+ params = hyper_parameters,
+ nrounds = 1000,
+ nfold = 4,
+ objective = "binary:logistic",
+ metrics = 'error',
+ verbose = 0,
+ early_stopping_rounds = 100
+ )
+
+ cat('Mean Error:', min(xgb_tune$evaluation_log$test_error_mean), '\n')
+ # add min training error and trees to grid
+ xgb_parameter_grid$n_trees_opt[i] <- which.min(xgb_tune$evaluation_log$test_error_mean)
+ xgb_parameter_grid$error[i] <- min(xgb_tune$evaluation_log$test_error_mean)
+}
# Finalizing Parameters ----
+xgb_parameter_grid <- xgb_parameter_grid %>% dplyr::arrange(error)
+final_parameters <- xgb_parameter_grid[1,]
+
+xgb_parameter_grid %>% head(10)
## eta max_depth min_child_weight subsample colsample_bytree n_trees_opt
+## 1 0.01 5 1 0.5 0.6 984
+## 2 0.01 7 3 0.5 0.8 615
+## 3 0.05 5 3 0.7 0.6 345
+## 4 0.01 7 3 0.7 0.6 702
+## 5 0.01 7 1 0.7 0.8 616
+## 6 0.01 5 3 0.5 0.6 994
+## 7 0.01 7 1 0.5 0.6 750
+## 8 0.01 7 1 0.7 0.6 653
+## 9 0.01 5 1 0.7 0.6 777
+## 10 0.01 5 3 0.5 0.8 898
+## error
+## 1 0.3199747
+## 2 0.3199750
+## 3 0.3205102
+## 4 0.3206055
+## 5 0.3208893
+## 6 0.3210150
+## 7 0.3210153
+## 8 0.3210467
+## 9 0.3212990
+## 10 0.3213938
+Now that we have zeroed in on the optimal parameters, we train our final xgboost model and evaluate results.
+#Final Tuned Params
+hyper_parameters <- list(eta = final_parameters$eta,
+ max_depth = final_parameters$max_depth,
+ min_child_weight = final_parameters$min_child_weight,
+ subsample = final_parameters$subsample,
+ colsample_bytree = final_parameters$colsample_bytree
+)
+
+# Final Model
+xgb_final <- xgboost( data = X_train, label = y_train,
+ params = hyper_parameters,
+ nrounds = final_parameters$n_trees_opt,
+ nfold = 5,
+ objective = "binary:logistic",
+ metrics = 'error',
+ verbose = 0,
+)
# Prediction Error
+test_y_pred <- predict(xgb_final, newdata = X_test)
+thr <- find_auc(y_test, test_y_pred)
## [1] "Decision Boundary at 0.540467292070389"
+## [1] "Confusion Matrix:"
+## Reference
+## Prediction 0 1
+## 0 2503 1474
+## 1 1172 2780
+## [1] "Accuracy: 0.666288308740068"
+References
+-
+
- “What Is a Decision Tree? - Examples, Advantages & Role in Management.” Study.com, 27 September 2015, study.com/academy/lesson/what-is-a-decision-tree-examples-advantages-role-in-management.html +
- Breiman,L., Friedman,J., Olshen,R. and Stone,C. (1984) Classification and Regression Trees. Wadsworth, Belmont, CA. +
- K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal. +
- Chapter 4: Decision Trees Algorithms, medium.com, 06 October 2017, medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1 +
- L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index and information gain criteria. Univeristy of Neuchatel, 2000 +
- Variable Importance plot code is extracted from “https://gist.github.com/ramhiser/6dec3067f087627a7a85” +
- Probst, P., Wright, M. & Boulesteix, A.-L. Hyperparameters and tuning strategies for random forest. WIREs Data Mining Knowledge Discovery 9, e1301, https://doi.org/10.1002/widm.1301 (2019) +
- Strobl C, Hothorn T, Zeileis A (2009). “Party on! – A New, Conditional Variable Importance Measure for Random Forests Available in the party Package.” The R Journal,1(2), 14–17. URL http://journal.R-project.org/archive/2009-2/RJournal_2009-2_Strobl~et~al.pdf. +
- Fit Random Forest Model from http://code.env.duke.edu/projects/mget/export/HEAD/MGET/Trunk/PythonPackage/dist/TracOnlineDocumentation/Documentation/ArcGISReference/RandomForestModel.FitToArcGISTable.html +
- “Gradient Boosting Machines.” Gradient Boosting Machines · UC Business Analytics R Programming Guide, http://uc-r.github.io/gbm_regression. +
- “Gbm.” Function | R Documentation, https://www.rdocumentation.org/packages/gbm/versions/2.1.5/topics/gbm. +
- He, Tong. “Xgboost v0.90.0.2.” Xgboost Package | R Documentation, https://www.rdocumentation.org/packages/xgboost/versions/0.90.0.2. +
- “Gradient Boosting.” Wikipedia, Wikimedia Foundation, 21 Oct. 2019, https://en.wikipedia.org/wiki/Gradient_boosting. +
- Chen, et al. “XGBoost: A Scalable Tree Boosting System.” ArXiv.org, 10 June 2016, https://arxiv.org/abs/1603.02754. +