[python-package] fix retrain on sequence dataset #6414

eromoe · 2024-04-11T13:34:41Z

StrikerRUS · 2024-07-13T16:41:09Z

Hi @eromoe ! Sorry for the delayed response!

Could you please add a test that will fail on the current master branch and will pass with your fixes.

Python tests are stored here: https://github.com/microsoft/LightGBM/tree/master/tests/python_package_test.
Feel free to ask for any help with tests.

eromoe · 2024-07-14T06:56:27Z

@microsoft-github-policy-service agree

eromoe · 2024-07-14T06:57:50Z

@StrikerRUS I just added the test

StrikerRUS · 2024-07-14T12:14:42Z

@eromoe Thank you very much! I'm triggering CI tests to see how it goes.

tests/python_package_test/test_sequence.py

jameslamb

Thanks for this, I left some other suggestions on these changes.

jameslamb · 2024-07-14T15:47:03Z

tests/python_package_test/test_sequence.py

+        params,
+        dataset,
+        init_model=model1,
+    )


Please make this test stricter than simply "training runs without error". Could you add assertions after training checking that:

model1 and model2 have the expected number of trees

ref

LightGBM/tests/python_package_test/test_basic.py

Lines 50 to 51 in 830763d

assert bst.current_iteration() == 20

assert bst.num_trees() == 20

dataset has the expected dimensions (feature names, number of rows)

tests/python_package_test/test_sequence.py

jameslamb · 2024-07-14T15:49:33Z

tests/python_package_test/test_sequence.py

+
+
+def test_list_of_sequence():
+    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)


When you move this test, please use the cached version of this function.

LightGBM/tests/python_package_test/test_basic.py

Line 221 in 830763d

*load_breast_cancer(return_X_y=True), test_size=0.1, random_state=2

LightGBM/tests/python_package_test/test_basic.py

Line 221 in 830763d

*load_breast_cancer(return_X_y=True), test_size=0.1, random_state=2

That'll make the tests a little faster.

python-package/lightgbm/basic.py

Co-authored-by: James Lamb <[email protected]>

eromoe · 2024-07-15T07:10:16Z

@jameslamb Updated,

jameslamb

Thanks, please see my next round of comments. Please let us know if you need help with the tests.

We want to give you space to write the tests yourself, but if you do not have interest then please say so and I'll push changes to the tests directly to this PR.

jameslamb · 2024-07-18T03:25:20Z

tests/python_package_test/test_basic.py

+
+    assert sum([len(s) for s in seq_ds.get_data()]) == X.shape[0]
+    assert len(seq_ds.get_feature_name()) == X.shape[1]
+    assert seq_ds.get_data() == seqs


These checks should be moved after training, to avoid this test failure: https://github.com/microsoft/LightGBM/actions/runs/9935170010/job/27451324230?pr=6414

FAILED tests/python_package_test/test_basic.py::test_retrain_list_of_sequence - Exception: Cannot get data before construct Dataset

Please run the tests yourself before pushing another commit.

sh build-python.sh bdist_wheel install pytest tests/python_package_tests/test_basic.py::test_retrain_list_of_sequence

Oh, I only tested the code on jupyter notebook for this case.
Sorry I don't have env for sh , so can not build whell to run the pytest.

Ok, I will push testing changes for you.

jameslamb · 2024-07-18T03:26:24Z

tests/python_package_test/test_basic.py

+    assert model2.current_iteration() == 20 
+    assert model2.num_trees() == 20


Suggested change

assert model2.current_iteration() == 20

assert model2.num_trees() == 20

assert model2.current_iteration() == 40

assert model2.num_trees() == 40

These don't look correct. Performing training once with "num_boost_round": 20, then continued training again with "num_boost_round": 20, should result in a model with 40 boosting rounds.

jameslamb · 2024-07-18T03:28:35Z

tests/python_package_test/test_basic.py

+    X, y = load_breast_cancer(return_X_y=True)
+    seqs = _create_sequence_from_ndarray(X, 2, 100)
+
+    seq_ds = lgb.Dataset(seqs, label=y, free_raw_data=False)


Why was free_raw_data=False necessary here? If it wasn't, please remove it.

If free_raw_data=True, model2 cannot get the data, would raise Exception I remember .

jameslamb · 2024-07-18T03:31:33Z

tests/python_package_test/test_basic.py

+    model1 = lgb.train(
+        params,
+        seq_ds,
+        keep_training_booster=True,


Suggested change

keep_training_booster=True,

Using keep_training_booster=True here works if the initial model (to be passed to init_model) is going to be passed onto training later in memory in the same process, but is that the situation that led to #6413.

I expect it will be more common to instead want to continue training with a model loaded from a file + a Sequence object in memory.

Could you please modify this test to not use keep_training_booster=True, or explain why it's necessary?

Because I have a rolling timeseries trainning project.
The dataset is too large to load in memory. Aim to use memory efficiently, I read 12 month data and build model , make prediction for one month in future, then update datasource(scolling one month) and retrain model.
Retrain model only use recent 12 months data, means old model with old weight would only be updated by recent data.

model = None for idx, (train_idx, test_idx) in enumerate(scroll_train_test(dates_partition, train_size=TRAIN_LOAD_STEP, test_size=TEST_LOAD_STEP, align_idx=train_end_idx)): train_partitions = dates_partition[train_idx] test_partitions = dates_partition[test_idx] train_df = read_partitioned_df(train_partitions, pre_train_partitions, train_df) test_df = read_partitioned_df(test_partitions, pre_train_partitions, test_df) .... model = lgb.train( params, train_data, init_model=model, num_boost_round=num_boost_round, keep_training_booster=True, )

Since it is in the loop, no necessary to dump model as a file, I just reuse it .

Thank you for explaining that. Very interesting use of Sequence!

But the fact that you want to use this functionality in one specific way (with the model held in memory the entire time) does not mean that that's the only pattern that should be tested.

It's very common to use LightGBM's training continuation functionality starting from a model file... for example, to update an existing model once a month based on newly-arrived data. It's important that all LightGBM training-continuation codepaths support that pattern.

Anyway, like I mentioned in #6414 (comment), I can push testing changes here. Once you see the diff of the changes I push, I'd be happy to answer any questions you have.

fix retrain on sequence dataset

2b7811b

eromoe requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners April 11, 2024 13:34

StrikerRUS added awaiting review fix labels Jul 13, 2024

StrikerRUS added awaiting response and removed awaiting review labels Jul 13, 2024

add testing for incremental training on Dataset with lgb.Sequence

a07800c

github-actions bot removed the awaiting response label Jul 14, 2024

StrikerRUS reviewed Jul 14, 2024

View reviewed changes

tests/python_package_test/test_sequence.py Outdated Show resolved Hide resolved

jameslamb requested changes Jul 14, 2024

View reviewed changes

jameslamb changed the title ~~fix retrain on sequence dataset~~ [python-package] fix retrain on sequence dataset Jul 14, 2024

eromoe and others added 4 commits July 15, 2024 14:38

Update python-package/lightgbm/basic.py

3ac186c

Co-authored-by: James Lamb <[email protected]>

Update tests/python_package_test/test_sequence.py

ecd5746

Co-authored-by: James Lamb <[email protected]>

move seqence test to test_basic

b3bcf37

add test for seq_ds

48f062c

jameslamb requested changes Jul 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] fix retrain on sequence dataset #6414

[python-package] fix retrain on sequence dataset #6414

eromoe commented Apr 11, 2024

StrikerRUS commented Jul 13, 2024

eromoe commented Jul 14, 2024

eromoe commented Jul 14, 2024

StrikerRUS commented Jul 14, 2024

jameslamb left a comment

jameslamb Jul 14, 2024

jameslamb Jul 14, 2024

eromoe commented Jul 15, 2024

jameslamb left a comment

jameslamb Jul 18, 2024

eromoe Jul 19, 2024

jameslamb Jul 19, 2024

jameslamb Jul 18, 2024

jameslamb Jul 18, 2024

eromoe Jul 19, 2024

jameslamb Jul 18, 2024

eromoe Jul 19, 2024 •

edited

Loading

jameslamb Jul 19, 2024

	assert bst.current_iteration() == 20
	assert bst.num_trees() == 20



		def test_list_of_sequence():
		X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)

		assert model2.current_iteration() == 20
		assert model2.num_trees() == 20

[python-package] fix retrain on sequence dataset #6414

Are you sure you want to change the base?

[python-package] fix retrain on sequence dataset #6414

Conversation

eromoe commented Apr 11, 2024

StrikerRUS commented Jul 13, 2024

eromoe commented Jul 14, 2024

eromoe commented Jul 14, 2024

StrikerRUS commented Jul 14, 2024

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eromoe commented Jul 15, 2024

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eromoe Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eromoe Jul 19, 2024 •

edited

Loading