Difference in performance between mikropml and caret #331

victor5lm · 2023-02-17T13:34:58Z

victor5lm
Feb 17, 2023

Hi! For my MSc Thesis, I have compared the performance (using the AUC over the test set) between caret (using trainControl and train functions) and mikropml (run_ml function) on the same dataset (consisting in abundances from a 16S rRNA sequencing study) by means of Random Forest (rf). Even considering that mikropml is based on caret, and that I used the same exact seed, LOOCV and an 80/20 split for both procedures, AUCs differ and, most importantly, selected hyperparameter values also differ in a more considerable way, even if the provided range of values used for grid search was also the same in both procedures. Is there a possible explanation for this?

This is the code I used for the caret-based RF model:

# Hyperparameter values for grid search

    tunegrid_rf <- expand.grid(.mtry = seq(1, 70))
    ntrees <- seq(1, 100)

    set.seed(2019)
    fitControl_rf <- trainControl(
        method = "LOOCV", # LOOCV as CV method
        savePredictions = TRUE,
        classProbs = TRUE,
        summaryFunction = twoClassSummary,
        verboseIter = TRUE,
        search = "grid",
    )

    models_caret <- list() # List in which all models will be stored. The best one will be chosen from here

    for (ntree in ntrees) {
        print(ntree)
        set.seed(2019)
        fit <- train(trainDescr, trainClass.bi, # trainDescr is the matrix of data and trainClass.bi is a dataframe indicating the group to which each sample belongs
                    method = "rf", # random forest
                    tuneGrid = tunegrid_rf,
                    trControl = fitControl,
                    ntree = ntree)
        key <- toString(ntree)
        models_caret[[key]] <- fit
    }

This is the code I used for the mikropml-based RF model:

## Random Forest -----

    ntrees <- seq(1, 100) # trees to be tested
    tuning_rf <- list(mtry = seq(1, 70))
    models_mikropml <- list() # List in which all models will be stored. The best one will be chosen from here

    for (ntree in ntrees) {
        print(ntree)
        set.seed(2019)
        results_rf <- run_ml(data, "rf", # data is a matrix containing all abundances for each sample and taxon, having a final column called "HPF_group", which can be seen as the outcome colname below.
        outcome_colname = "HPF_group",
        cross_val = caret::trainControl(method = "LOOCV"),
        training_frac = 0.80, seed = 2019,
        calculate_performance = TRUE, ntree = ntree,
        hyperparameters = tuning_rf,
        find_feature_importance = TRUE)
        key <- toString(ntree)
        models_mikropml[[key]] <- results_rf
    }

AUCs for the best (best meaning the one with the highest AUC out of all the built models using LOOCV and grid search) caret-based model and the best mikropml-based one were 0.85 and 0.9, respectively. Selected hyperparameter values were mtry=7 and ntree=4 for the caret-based one and mtry=15 and ntree=17 for the mikropml-based one. What would this difference in hyperparameter values be due to?

Answered by kelly-sovacool

Feb 20, 2023

There are a few reasons I would expect different performance values and different run times with these two code samples:

You set find_feature_importance = TRUE, which takes a considerable amount of run time. Caret's train() doesn't do permutation feature importance.
You set calculate_performance = TRUE, which calculates the model performance on the test set, although it shouldn't be too slow. Caret's train() doesn't do this step.
You let mikropml randomly split the dataset into a training and testing set, rather than specify the exact training set as you did for caret. You can give training_frac a vector of indices if you want to specify the exact training set. (Note: since you set the s…

View full answer

kelly-sovacool · 2023-02-20T16:16:25Z

kelly-sovacool
Feb 20, 2023
Maintainer

There are a few reasons I would expect different performance values and different run times with these two code samples:

You set find_feature_importance = TRUE, which takes a considerable amount of run time. Caret's train() doesn't do permutation feature importance.
You set calculate_performance = TRUE, which calculates the model performance on the test set, although it shouldn't be too slow. Caret's train() doesn't do this step.
You let mikropml randomly split the dataset into a training and testing set, rather than specify the exact training set as you did for caret. You can give training_frac a vector of indices if you want to specify the exact training set. (Note: since you set the same seed for every iteration of your for loop, I would expect you to get the same training/testing split every time. I'm not sure what value there is in repeating model training with the exact same seed.)

You specified cross_val = caret::trainControl(method = "LOOCV") in run_ml() but for caret you used fitControl_rf, which you defined as:

fitControl_rf <- trainControl(
    method = "LOOCV", # LOOCV as CV method
    savePredictions = TRUE,
    classProbs = TRUE,
    summaryFunction = twoClassSummary,
    verboseIter = TRUE,
    search = "grid",
)

Given these major differences I'm not at all surprised to see different performance values, as well as a longer runtime for run_ml().

http://www.schlosslab.org/mikropml/reference/run_ml.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in performance between mikropml and caret #331

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Difference in performance between mikropml and caret #331

victor5lm Feb 17, 2023

Replies: 1 comment

kelly-sovacool Feb 20, 2023 Maintainer

victor5lm
Feb 17, 2023

kelly-sovacool
Feb 20, 2023
Maintainer