-
Notifications
You must be signed in to change notification settings - Fork 108
Random Forest and Random Decision Tree
Random forest is a good algorithm because
- It is based on CART (classification and regression tree), so it can solve linear classification problem as well as a large types of non-linear classification problem.
- It ensemble CART by bagging, which cause it hard to over-fitting
- It can be easily paralleled because building every tree is independent with each other.
The standard steps to build random forest is
- select a number of samples randomly with bootstrapping.
- randomly select a subset of features.
- build a CART based on features subset and selected samples
- repeat above 3 steps many times and get N CART, and prediction of one sample is average prediction of N CART.
The standard ways to build CART is
BuildCART {
Q Queue
Q <- root
while !Q.empty {
v = Q.pop_front()
left, right = v.BestSplit()
v.left = left
v.right = right
Q <- left
Q <- right
}
}
The key function of CART is how to choose best split features to split current samples into 2 groups (left, right), and CART use gini to find the best split. However, BestSplit() function is cost of time because it will evaluate every possible split for every feature in current samples, and then choose the best. This is why CART and Random Forest is cost of time in many large dataset.
In order to solve this problem, we can use Random Decision Tree (http://www.cs.columbia.edu/~wfan/software.htm). Unlike Random Forest (RF), Random Decision Tree (RDT) does not use CART to find the decision tree for current samples. In the other word, RDT does not use BestSplit() function, instead, it just find a random feature split to split current samples. Many experiments shows that RDT is good enough in many datasets.
However, RDT is too random, and this may cause it loose effective in many case, but RF is cost of time. This inspire us to think about the following idea: In the BestSplit(), we does not check every possible split for every feature in all current samples, instead, we only check that in a subset of current samples.
Thus, in our implementation of RF, we support following parameters
- max-depth : max depth of single CART within RF
- min-leaf-size : if a node have less than min-leaf-size samples, we will stop split this node
- tree-count : how many trees will be used in RF
- feature-count : how many percentage of features will be selected as features subset in RF
- rf-sample-ratio : how many percentage of samples will be check in every single CART when split node via gini