-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training fails bagging_freq > 1 and bagging_fraction is very small #6622
Comments
Thanks for using LightGBM, and for taking the time to open an excellent report with a reproducible example! It really helped with the investitation. Running your reproducible example with the latest development version of LightGBM, I see some logs that are helpful. Please consider including more logs in your reports in the future.
Even following that recommendation, though, I do see the same behavior you saw. I tried experimenting and found that I can reproduce this even with 1,000 samples! import pandas as pd
import numpy as np
import lightgbm as lgb
num_samples = 1_000
for bagging_frac in [0.99, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06]:
try:
bst = lgb.train(
params={
"seed": 1,
"bagging_fraction": bagging_frac,
"bagging_freq": 5,
"verbose": -1
},
train_set=lgb.Dataset(
data=pd.DataFrame({
"FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
"FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
}),
label=np.linspace(start=10.0, stop=80.0, num=num_samples),
)
)
status = "success"
except lgb.basic.LightGBMError:
status = "fail"
print(f"bagging_frac = {bagging_frac}: {status}")
# bagging_frac = 0.99: success
# bagging_frac = 0.75: success
# bagging_frac = 0.5: success
# bagging_frac = 0.4: success
# bagging_frac = 0.3: success
# bagging_frac = 0.2: success
# bagging_frac = 0.1: success
# bagging_frac = 0.01: success
# bagging_frac = 0.001: success
# bagging_frac = 0.0001: fail
# bagging_frac = 1e-05: fail
# bagging_frac = 1e-06: fail Interestingly, if I remove So it looks to me that this check could be triggered under the following mix of conditions:
I tested that with an even bigger dataset... I can even trigger this failure for a dataset with 100,000 observations!! num_samples = 100_000
bst = lgb.train(
params={
"seed": 1,
"bagging_fraction": 0.1/num_samples,
"bagging_freq": 5,
"verbose": -1
},
train_set=lgb.Dataset(
data=pd.DataFrame({
"FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
"FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
}),
label=np.linspace(start=10.0, stop=80.0, num=num_samples),
)
)
# lightgbm.basic.LightGBMError: Check failed: (num_data) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/io/dataset.cpp, line 39 . This definitely looks like a bug to me, and not necessarily one that would only affect small datasets. |
Some other minor notes...
Very interesting application! Can you share any more about the real-world reason(s) that you are training "millions of models" with the same hyper-parameters? I have some ideas about situations where that might happen, but knowing more precisely what you're trying to accomplish would help us to recommend alternatives. For example, if this is some sort of consumer app generating predictions on user-specific data (like a fitness tracker), then training a LightGBM model is probably unnecessary for such a small amount of data (as you sort of mentioned), and you might want to do something else when there is a small amount of data, like:
I've updated your post to use the text of the error message you observed instead of an image, so it can be found from search engines by other people hitting that error. Please see https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors for more discussion of that practice. |
Thanks for the quick answer! A few more words on our application:
From your experiment we observe as well that as long as |
This is very very interesting, thanks so much for the details! And thanks for choosing LightGBM for this important application, we'll do our best to support you 😊
I looked into this some more, and I realize I forgot something very important.... bagging is only enabled if you That's described at https://lightgbm.readthedocs.io/en/latest/Parameters.html#bagging_fraction.
By "runs through", did you mean "fails"? Or did you maybe mean to use I think that is what's happening here ... if you set code that shows that (click me)import numpy as np
import pandas as pd
import lightgbm as lgb
def _attempt_to_train(num_samples):
bagging_fraction = 0.99 / num_samples
param_str = f"num_samples={num_samples}, bagging_frac={bagging_fraction}"
try:
bst = lgb.train(
params={
"seed": 1,
"bagging_fraction": bagging_fraction,
"bagging_freq": 1,
"verbose": -1
},
train_set=lgb.Dataset(
data=pd.DataFrame({
"FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
"FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
}),
label=np.linspace(start=10.0, stop=80.0, num=num_samples),
)
)
print(f"success ({param_str})")
except lgb.basic.LightGBMError:
print(f"failure ({param_str})")
num_sample_vals = [
1,
2,
100,
1_000,
10_000,
100_000
]
for n in num_sample_vals:
_attempt_to_train(n) This makes sense... you're asking LightGBM to do something impossible. I think LightGBM's behavior in this situation should be changed in the following ways:
The case where you train on a single sample is unlikely to be a particularly useful model, and under LightGBM's default setting of
Yes, this is definitely a good idea! I didn't suggest it because your post included the constraint that you wanted to use identical hyperparameters for every model. There are a few other parameters whose values you might want to change based on the number of samples:
You might find the discussion in #5194 relevant to this. |
Hi again!
Sorry that was mistake, I actually meant: "If
Currently, we do have the same standard set used for all items and would prefer to not have it parametrisable per item (for simplicity reasons) but it's not a hard limitation. I think, we can accept making the Thanks again a lot for the help! :) |
Ok great, thanks for the excellent report and for sharing so much information with me! We'll leave this open to track the work I suggested in #6622 (comment). Any interest in trying to contribute that? It'd require changes only on the C/C++ side of the project. No worries if not, I'll have some time in the near future to attempt it. |
Sorry for the late answer, i have unfortunately no experience with C/C++ so it would be challenging for me. Gotta pass on that. |
No problem! Thanks again for the great report and interesting discussion. We'll work on a fix for this. |
Hello,
We've recently encountered a problematic edge case with lightgbm.
When simultaneously using bagging and training on a single data point, the model training fails.
Our expectations would have been that the model disregards any bagging mechanisms.
While training a model on a single data point is surely questionable from an analytical point of view, we regularly train millions of models (with the same hyper-parameter set) and cannot guarantee that the amount of training samples exceeds 1 for all of them.
Is there any rationales behind this behaviour? How would you reckon to best go about this one?
Reproducible example
Executing this code snippet leads to this error:
But by setting bagging_fraction to 1, the model is correctly trained (and has a single leaf with output 1).
Environment info
python=3.10
pandas=2.2.2
lightgbm=4.5.0
Additional Comments
It seems like the error is raised when
bagging_fraction * num_samples < 1
The text was updated successfully, but these errors were encountered: