Clarify what you can expect to do after bundling, i.e. `predict` #50

ClaudiuPapasteri · 2023-03-04T18:37:39Z

I am not sure if this a known issue, as it doesn't appear in the docs. It seems that except predict, other methods like tidy or rank_results fail using the unbundled object.
This SO post references the same problem.

library(tidymodels)
library(agua)
h2o_start()

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

auto_spec <-
  auto_ml() %>%
  set_engine("h2o", max_runtime_secs = 120, seed = 1) %>%
  set_mode("regression")

normalized_rec <-
  recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_predictors())

auto_wflow <-
  workflow() %>%
  add_model(auto_spec) %>%
  add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

# Save
auto_fit <- fit(auto_wflow, data = concrete_train)
auto_fit_bundle <- bundle(auto_fit)
saveRDS(auto_fit_bundle, file = "test.h2o.auto_fit.rds") #save the object

# Load
auto_fit_bundle <- readRDS("test.h2o.auto_fit.rds")
auto_fit <- unbundle(auto_fit_bundle)

rank_results(auto_fit)
tidy(auto_fit)

Error in UseMethod("rank_results") :
no applicable method for 'rank_results' applied to an object of class "c('H2ORegressionModel', 'H2OModel', 'Keyed')"

The text was updated successfully, but these errors were encountered:

juliasilge · 2023-03-06T16:22:05Z

That's true, yep! The focus of bundle is to capture the references needed by a model to make predictions in a new environment. For more info, you can look at:

I would generally expect functions like tidy() and rank_results() to be called during model development, and not so much during model deployment. Can you share a bit more about your use case?

ClaudiuPapasteri · 2023-03-07T13:08:03Z

Thank you for the helpful reply, I suspected this was the case and the links you shared made it much clearer. Unfortunately, although the scope of the bundle package should be clear for everyone, possible affordances of the post-bundle object (except for prediction from it) are not so obvious (for me, at least). Maybe it would be helpful to state this more clearly in the documentation.
Any way, thank you guys for the awesome package ecosystems, and thank you Julia, your work and talks inspired and helped me throughout my data journey. It's an honor ...

juliasilge · 2023-03-07T16:54:03Z

Thank you so much for the kind words! ❤️

Let's keep this issue open and clarify some of the documentation about what you can expect to do after bundling, especially in the README and main vignette.

(As a side note, I also maintain butcher and this is about the same as how butcher works. Sometimes we keep components in butcher that are needed for something like predict(interval="prediction") but not just your typical predictions.)

Steviey · 2024-03-29T17:42:58Z

Can we use pkg: bundle to:

save a tidy model
reload the tidy model
refit the tidy model on new data
predict on new data
... and if so how- when taking parsnip::auto_ml() and engine: h2o in consideration?

refering to:

https://rstudio.github.io/bundle/

https://rstudio.github.io/bundle/articles/bundle.html

https://rstudio.github.io/bundle/reference/bundle_h2o.html

juliasilge · 2024-03-29T20:16:26Z

@Steviey The normal usage that we expect after bundling is to predict with your model, but if can get out the parsnip object, you should be able to refit:

library(bundle)
library(parsnip)
library(callr)

## bundle a model
mod <-
    boost_tree(trees = 5, mtry = 3) %>%
    set_mode("regression") %>%
    set_engine("xgboost") %>%
    fit(mpg ~ ., data = mtcars[1:25,])

bundled_mod <- bundle(mod)

## fit the model to new data
r(
  func = function(bundled_mod) {
    library(bundle)
    library(parsnip)
    
    unbundled_mod <- unbundle(bundled_mod)
    fittable_model <- extract_spec_parsnip(unbundled_mod)
    fittable_model |> fit(mpg ~ ., data = mtcars[26:32,])
  },
  args = list(
    bundled_mod = bundled_mod
  )
)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> Handle is invalid! Suggest using xgb.Booster.complete
#> raw: 7.7 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 0.3, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 5, watchlist = x$watchlist, 
#>     verbose = 0, nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.3", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 10 
#> niter: 5
#> nfeatures : 10 
#> evaluation_log:
#>   iter training_rmse
#>  <num>         <num>
#>      1     16.923941
#>      2     12.953166
#>      3     10.022720
#>      4      7.801856
#>      5      6.089100

^{Created on 2024-03-29 with reprex v2.1.0}

Steviey · 2024-03-30T00:45:01Z

@juliasilge Thank you Julia. Extract_spec_parsnip() returns a parsnip model specification. Does this include hyperparameters from earlier trainings and fits before bundleling? Would this include sub models from the leaderboard of a h2o AutoML-model?

juliasilge · 2024-03-31T22:48:28Z

@Steviey Hmmmm, I am not entirely sure as I don't have a ton of experience with H2O. I think a good venue for this kind of question is the agua repo: https://github.com/tidymodels/agua

Steviey · 2024-04-01T03:27:51Z

@juliasilge Thank you Julia for the response. Since the h2o-issue goes deeper to h2o itself, mentioned for example here: business-science/modeltime.h2o#14
I would guess this is still not really resolved, after some years. So my hope was pkg. bundle. would do the job entirely.

More in general related to tidymodels (other models then h2o):
I 'm mainly interested in refitting on new data- but with earlier searched hyperparameters. Let's say I train a model and search for hyperparameters on one day, bundle and save the model or workflow etc. for later use and then the next day unbundle and refit on new/more data. Can we then utilize the efforts/compute time from the day before, namely the best hyperparameters searched before bundleling? Are they included in the bundle for later use? Or do we have to save and retrieve that stuff separately?

This could be an ecological question too (green ML/AI).

Maybe related:
tidymodels/tune#84

If bundle requires separat actions in this regard, I m not sure if this is still best practice:

exec(update, object = tree_mod, !!!final_param)

juliasilge · 2024-04-01T16:43:36Z

@Steviey The bundle package can handle bundling up the needed references but doesn't have functionality for getting the best hyperparameters; you'd need to get that through tidymodels infrastructure in either tune or agua. Once you have those hyperparameters, then definitely bundle will work. 👍

Steviey · 2024-04-02T04:18:47Z

@juliasilge OK, then I would bet on finalize more then on update.

simonpcouch · 2024-07-14T22:45:45Z

Feels worth mentioning that the Value documentation for each bundle method states:

The output of unbundle() is a model object that is ready to predict() on new data, and other restored functionality (like plotting or summarizing) is supported as a side effect only.

I would argue that this is sufficient to set expectations for what users can do with unbundled objects. :)

juliasilge · 2024-07-15T15:32:39Z

That's a great point @simonpcouch. 👍

We haven't heard a lot of other confusion on this point to date, so let's close this as complete. We can revisit in the future as necessary!

juliasilge changed the title ~~H2O AutoML with agua: beyond predict other methods fail~~ Clarify what you can expect to do after bundling, i.e. predict Mar 7, 2023

juliasilge added the documentation label Mar 7, 2023

juliasilge closed this as completed Jul 15, 2024

joranE mentioned this issue Jul 22, 2024

Bundling XGBoost objects removes variable names when applying xgb.importance() #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify what you can expect to do after bundling, i.e. `predict` #50

Clarify what you can expect to do after bundling, i.e. `predict` #50

ClaudiuPapasteri commented Mar 4, 2023

juliasilge commented Mar 6, 2023

ClaudiuPapasteri commented Mar 7, 2023

juliasilge commented Mar 7, 2023

Steviey commented Mar 29, 2024

juliasilge commented Mar 29, 2024

Steviey commented Mar 30, 2024 •

edited

Loading

juliasilge commented Mar 31, 2024

Steviey commented Apr 1, 2024 •

edited

Loading

juliasilge commented Apr 1, 2024

Steviey commented Apr 2, 2024

simonpcouch commented Jul 14, 2024

juliasilge commented Jul 15, 2024

Clarify what you can expect to do after bundling, i.e. predict #50

Clarify what you can expect to do after bundling, i.e. predict #50

Comments

ClaudiuPapasteri commented Mar 4, 2023

juliasilge commented Mar 6, 2023

ClaudiuPapasteri commented Mar 7, 2023

juliasilge commented Mar 7, 2023

Steviey commented Mar 29, 2024

https://rstudio.github.io/bundle/

https://rstudio.github.io/bundle/articles/bundle.html

https://rstudio.github.io/bundle/reference/bundle_h2o.html

juliasilge commented Mar 29, 2024

Steviey commented Mar 30, 2024 • edited Loading

juliasilge commented Mar 31, 2024

Steviey commented Apr 1, 2024 • edited Loading

juliasilge commented Apr 1, 2024

Steviey commented Apr 2, 2024

simonpcouch commented Jul 14, 2024

juliasilge commented Jul 15, 2024

Clarify what you can expect to do after bundling, i.e. `predict` #50

Clarify what you can expect to do after bundling, i.e. `predict` #50

Steviey commented Mar 30, 2024 •

edited

Loading

Steviey commented Apr 1, 2024 •

edited

Loading