FLAML with mixed numerical and categorical features #1226

Sukantabasu · 2023-09-22T22:14:23Z

Sukantabasu
Sep 22, 2023
Collaborator

Hi Everyone,

I have been using FLAML for a while. Thus far, all my datasets only included numerical features. Recently, I started with a new dataset that has both numerical and categorical features. I am using lightgbm and noticed an interesting behavior. Let us assume I have a dataframe df_X which has a few categorical columns; df_X.dtypes are: float64, float64, float64, float64, category, category, float64, etc.

After training, if I do the following, I get good results.
Y1 = automl.predict(df_X)

However, if I do the following, I get quite erroneous results.
bestMod = automl.best_model_for_estimator('lgbm')
Y2 = bestMod.predict(df_X)

Typically, with numerical data, I use multiple models 'lgbm', 'xgboost', 'rf', etc. Since I am using mixed features (without one-hot-encoding), I was just testing my code with only lgbm. My understanding is that lgbm handles categorical data very efficiently with integer encoding.

For my trial run, I was expecting Y1 and Y2 to be identical. Why there is a difference? I did not see any such differences for purely numerical dataframes.

Best regards,
Sukanta

sonichi · 2023-09-23T14:39:41Z

sonichi
Sep 23, 2023

automl does some simple preprocessing before invoking the trained estimator. That could be the reason. Could you try applying automl.feature_transformer to the data before using bestMod for prediction?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLAML with mixed numerical and categorical features #1226

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

FLAML with mixed numerical and categorical features #1226

Sukantabasu Sep 22, 2023 Collaborator

Replies: 1 comment

sonichi Sep 23, 2023

Sukantabasu
Sep 22, 2023
Collaborator

sonichi
Sep 23, 2023