Our goal is to try to predict whether someone has colorectal cancer based on the abundance of different bacteria in their stool microbiome.
Feaures: 16S rRNA gene abundances clustered to OTUs (represent bacterial abundances)
Label: Health outcome (whether the patient has colorectal cancer or not)
Classification algorithm: Random forest
- Data from: https://github.com/SchlossLab/Sze_CRCMetaAnalysis_mBio_2018
- Modified from: https://github.com/BTopcuoglu/MachineLearning/blob/master/code/learning/model_pipeline.R
- Further reading on this project: https://www.biorxiv.org/content/10.1101/816090v1
Credit: Thank you Zena Lapp for your live-coding scripts.
-
First thing we do is download the dataset: There are 2 ways of doing this:
- Clone this repository on your terminal in Mac or Git Bash on Windows if you have it.
git clone https://github.com/um-dang/machine-learning-pipelines-r.git
- Create a folder in your Documents directory called
machine-learninig-pipelines-r
. Then within that folder, create another folder calleddata
. Download the data here, then move thedata.tsv
file you downloaded intodata
folder.
3. Open RStudio. Go to File
tab, click on New Project
, create a project on Existing Directory
, navigate to machine-learninig-pipelines-r
directory and start the new project. Now you can open a New R script clicking on the green plus in RStudio.
4. First we will load packages. If you haven't installed the packages before, please go to your RStudio console:
install.packages('randomForest')
install.packages('caret')
install.packages('tidyverse')
If you already installed these all you have to type now is:
library(randomForest)
5. We are now ready to read in our data.
data = read.delim('../data/data.tsv')
6. Explore the data:
> data[1:5,1:5]
cancer Otu00001 Otu00002 Otu00003 Otu00004
1 0 0.09804447 0.06288435 0.076198630 0.0033046927
2 0 0.05759443 0.03689744 0.074058219 0.0148711170
3 1 0.10072328 0.21027574 0.319777397 0.0003304693
4 0 0.08893651 0.13013291 0.488869863 0.0000000000
5 1 0.45459416 0.00178536 0.008561644 0.0006609385
7. Learn about the data:
- The rows are different samples, each from a different person
- The first column is whether the person has cancer or not
- 0 means no cancer, 1 means cancer
- This is the label
- The rest of the columns are abundance of different OTUs
- OTU stands for operational taxonomic unit, which is kind of like a species
- Scaled to be between 0 and 1
- These are the features we will use to classify/predict the label
How many samples do we have? How many features?
table(data$cancer)
0 1
172 120
8. Do we have any missing data?
sum(is.na(data))
[1] 0
Since we don't have any missing data, we don't have to remove any of the samples.
9. Split data into train and test set:
The next step is to split the data into a training set (80% of the data) and a test set (20% of the data). We will make a random forest model using the training set and then test the model using the test set.
Why are we doing this? Because to have a reliable model, we need to follow the ML pipeline seen in Figure 1.
We need to have a held-out test data, that will not be used for training the model. To create a training dataset and a held-out test dataset we need to make changes to our dataset.
-
First we need to change our label column to a factor. Random forest needs the label to be a factor if we want to do classification modeling. We are classifying
having cancer
ornot having cancer
.# change the label to a factor (categorical variable) instead of a character data$cancer = as.factor(data$cancer)
-
Randomly order samples.
random_ordered <- data[sample(nrow(data)),]
- Determine the number of training samples
number_training_samples <- ceiling(nrow(random_ordered) * 0.8)
- Create training set:
train <- random_ordered[1:number_training_samples,]
- Create testing set
test <- random_ordered[(number_training_samples + 1):nrow(random_ordered),]
Now we have 1 datasplit where 80% of the data will be used to train and 20% of the data will be held-out to test on once the model is trained and ready to go.
- Train the model on training data.
# Default mtry = max(floor(ncol(data)/3), 1)
# Default ntree = 500
rf_train = randomForest(cancer ~ .,
data = train,
ntree = 500,
mtry=500)
- How does our model do? We need to test on the held-out 20%. We use a
predict
function part of therandomForest
package.
test_pred = predict(rf_train, test)
How do we access this prediction? To do that let's write a function to calculate the accuracy of our model.
# function to compare predicted to actual class
check_pred_class = function(dat, pred){
# correct answers
actual = data.frame(id = rownames(dat), cancer = dat$cancer)
# predicted answers
predicted = data.frame(id = rownames(dat), cancer = pred)
# compare predicted to actual
comparison = merge(actual, predicted, by = 'id', all = F)
# fraction correct
sum(comparison$cancer.x == comparison$cancer.y)/nrow(comparison)
}
Now we have the acutal test
data and the predicted test_pred
. Let's use our function to calcuate the accuracy of our model.
check_pred_class(test, test_pred)
[1] 0.7758621
What if we want to do better with our pipeline and also test many different models and not just random forest.
Let's look at the caret
package which is a helpful wrapper that makes our life easier!
Caret
package is short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for all the steps of machine learning.
- Load the caret package:
library(caret)
- We first need to change the numeric outcomes to a string. Caret doesn't like having numeric outcome variables.
train$cancer <- ifelse(train$cancer == 1, "cancer", "normal")
test$cancer <- ifelse(test$cancer == 1, "cancer", "normal")
-
The syntax for training
caret
models is a little different than what we used before. Because we can use many different models here, they created a generictrain
function. We define what the trainingdata
is, then themethod
as random forest. We also define whichmetric
we want to use to evaluate the model. You can look at what options you have with caret here: http://topepo.github.io/caret/index.html.We also choose to do a better job with out pipeline by adding a cross-validation step to our training step.
-
Let's create a cross-validation scheme. This is an internal data-split to create a better model where we test different
mtry
parameters and decide which one is better. -
How many folds will there be in our cross-validation step? If we pick
5
then 80% of the data will be used to train and 20% will be used to test differentmtry
options. This will be repeated until each fold is tested. (This is an internal datasplit which is applied after the first outer datasplit to create held-out dataset)
cv <- trainControl(method="cv", number=5)
- What
mtry
options are we trying in cross-validation?
grid <- expand.grid(mtry = c(500, 1000))
- Let's train the model:
trained_model <- train(cancer ~ ., data=train, method = "rf", metric = "Accuracy", tuneGrid = grid, trControl = cv, ntree=500, returnResamp="final") # not tuning ntree
-
-
Our model is trained and we can see how each
mtry
did.
trained_model
Random Forest
234 samples
5206 predictors
2 classes: 'cancer', 'normal'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 187, 187, 187, 188, 187
Resampling results across tuning parameters:
mtry Accuracy Kappa
500 0.7393154 0.4175905
1000 0.7435708 0.4279107
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 1000.
So, mtry=1000
was better than mtry=500
and cross-validation step helped us recognize that. Caret
package automatically trained on the full training data with mtry=1000
after determining that it was the best one.
- Now we have the trained model and our model picked the best
mtry
to use, let's predict on test set.
rf_pred <- predict(trained_model, test)
- Let's see how the model did. We can use the
confusionMatrix
function in thecaret
package.
confusionMatrix(rf_pred, as.factor(test$cancer))
Confusion Matrix and Statistics
Reference
Prediction cancer normal
cancer 10 1
normal 13 34
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.6034
P-Value [Acc > NIR] : 0.009653
Kappa : 0.4461
Mcnemar's Test P-Value : 0.003283
Sensitivity : 0.4348
Specificity : 0.9714
Pos Pred Value : 0.9091
Neg Pred Value : 0.7234
Prevalence : 0.3966
Detection Rate : 0.1724
Detection Prevalence : 0.1897
Balanced Accuracy : 0.7031
'Positive' Class : cancer