[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

jameslamb · 2024-04-30T02:57:33Z

Proposal

I'm requesting comment on the following proposal:

remove keyword argument categorical_feature from cv() and train() in the R and Python packages
remove keyword argument feature_name from cv() and train() in the Python package
remove keyword argument colnames from cv() and train() in the R package

And doing all of these only after the packages issuing deprecation warnings for 2-3 releases.

Summary

Both the R and Python packages expose functions cv() (for cross-validation) and train() (for regular entire-dataset training). These functions require a LightGBM Dataset object.

The Dataset object holds attributes categorical_features and feature_names, and allows setting those via constructor keyword arguments and set_{attr}() methods.

Despite that, these cv() and train() functions also take categorical_features and feature_names as keyword arguments.

Python cv()

LightGBM/python-package/lightgbm/engine.py

Lines 569 to 570 in 92a8741

    
           feature_name: _LGBM_FeatureNameConfiguration = "auto", 
        
           categorical_feature: _LGBM_CategoricalFeatureConfiguration = "auto",

Python train()

LightGBM/python-package/lightgbm/engine.py

Lines 62 to 63 in 92a8741

    
           feature_name: _LGBM_FeatureNameConfiguration = "auto", 
        
           categorical_feature: _LGBM_CategoricalFeatureConfiguration = "auto",

R-package cv()

LightGBM/R-package/R/lgb.cv.R

Lines 90 to 91 in 92a8741

    
           , colnames = NULL 
        
           , categorical_feature = NULL

R-package train()

LightGBM/R-package/R/lgb.train.R

Lines 57 to 58 in 92a8741

    
           colnames = NULL, 
        
           categorical_feature = NULL,

These keyword arguments aren't providing any value, in my opinion. Their values are just forwarded along to calls like this:

LightGBM/python-package/lightgbm/engine.py

Lines 738 to 740 in 92a8741

    
           train_set._update_params(params)._set_predictor(predictor).set_feature_name(feature_name).set_categorical_feature( 
        
               categorical_feature 
        
           )

Which at best is redundant with the Dataset class, and at worst could lead to runtime exceptions (if the Dataset has already been constructed).

Motivation

Would simplify the library's interface without any loss of functionality.

If this proposal is accepted, the Dataset class would be the only place that this information is provided to train() and cv().

References

Inspired by this post I noticed on Stack Overflow: https://stackoverflow.com/questions/78383840/in-lightgbm-why-do-the-train-and-the-cv-apis-accept-categorical-feature-argument/78405996#78405996

xgboost does not expose such arguments in train() (code link) or cv() (code link).

These arguments have been part of the API since September 2017: ef77806#diff-9bd633ead0bdfe9540c42a618efd9e559cca16c522ad844a09fcf4ffc7d6e84c.

The text was updated successfully, but these errors were encountered:

borchero · 2024-05-01T18:50:03Z

I was not aware that there parameters were even available, I'm very much in favor of the removal across language APIs.

mayer79 · 2024-05-04T18:33:00Z

Very good idea. I have in mind that it is also in the fit() method of the Scikit-learn API. Do we need it there because that is the first time LightGBM sees the data?

jameslamb · 2024-05-04T19:21:43Z

Thanks for pointing that out @mayer79

Yes, the scikit-learn interface needs these inputs because it takes in the raw data (e.g. numpy, scipy, pandas), not a lightgbm.Dataset. And that it by design, to match the types of inputs estimators in the scikit-learn library accept.

jameslamb · 2024-05-06T04:07:45Z

While working through this tonight to add deprecation warnings, I found a few more things in lgb.cv() in the R package that I think should be deprecated.

passing anything other than a Dataset to data

LightGBM/R-package/R/lgb.cv.R

Lines 103 to 109 in 6e78e69

    
           # If 'data' is not an lgb.Dataset, try to construct one using 'label' 
        
           if (!.is_Dataset(x = data)) { 
        
             if (is.null(label)) { 
        
               stop("'label' must be provided for lgb.cv if 'data' is not an 'lgb.Dataset'") 
        
             } 
        
             data <- lgb.Dataset(data = data, label = label) 
        
           }

passing label and weight

LightGBM/R-package/R/lgb.cv.R

Lines 79 to 80 in 6e78e69

    
           , label = NULL 
        
           , weight = NULL

{xgboost} does not support passing such things in cv() in its R or Python packages.

LightGBM's python package also does not support those things.

I think restricting cv() to only take in a Dataset will help simplify that interface.

cc @mayer79

mayer79 · 2024-05-06T19:57:28Z

I like. Maybe worth noting that the one and only @david-cortes very recently removed non-dataset input in xgb.cv().

jameslamb · 2024-05-06T20:39:21Z

Oh nice, thanks @mayer79 ! I hadn't seen dmlc/xgboost#10031.

StrikerRUS · 2024-07-24T20:45:58Z

I'm +1 for all the changes discussed in this issue.

jameslamb · 2024-08-02T05:22:59Z

Thanks! I'll put up a PR removing these deprecated arguments. The deprecation warnings have now been part of 2 releases.

…m.train(...)` function calls. (#454) Using `categorical_feature` parameter in `lightgbm.Dataset()` instead of `lightgbm.train(...)` eliminates the following warnings: ``` test/gbdt/test_gbdt.py: 60 warnings /usr/local/lib/python3.10/dist-packages/lightgbm/engine.py:187: LGBMDeprecationWarning: Argument 'categorical_feature' to train() is deprecated and will be removed in a future release. Set 'categorical_feature' when calling lightgbm.Dataset() instead. See microsoft/LightGBM#6435. _emit_dataset_kwarg_warning("train", "categorical_feature") ``` --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

jameslamb added question awaiting review labels Apr 30, 2024

jameslamb mentioned this issue May 1, 2024

release v4.4.0 #6439

Merged

33 tasks

This was referenced May 6, 2024

[R-package] [python-package] deprecate Dataset arguments to cv() and train() #6446

Merged

[python-package] make LGBMDeprecationWarning inherit from FutureWarning #6447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

jameslamb commented Apr 30, 2024 •

edited

Loading

borchero commented May 1, 2024

mayer79 commented May 4, 2024

jameslamb commented May 4, 2024

jameslamb commented May 6, 2024

mayer79 commented May 6, 2024

jameslamb commented May 6, 2024

StrikerRUS commented Jul 24, 2024

jameslamb commented Aug 2, 2024

[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

Comments

jameslamb commented Apr 30, 2024 • edited Loading

Proposal

Summary

Motivation

References

borchero commented May 1, 2024

mayer79 commented May 4, 2024

jameslamb commented May 4, 2024

jameslamb commented May 6, 2024

mayer79 commented May 6, 2024

jameslamb commented May 6, 2024

StrikerRUS commented Jul 24, 2024

jameslamb commented Aug 2, 2024

jameslamb commented Apr 30, 2024 •

edited

Loading