Training doesn't resume from previous checkpoint using max_train_steps #1204

playerzer0x · 2024-11-21T01:59:08Z

playerzer0x
Nov 21, 2024

I train a model to 10k steps
I change max_train_steps in config to 15000
I change out the data loader with an updated multidatabackend
I start training and receive this error:
2024-11-21 01:52:34,920 [INFO] Reached the end (58 epochs) of our training run (42 epochs). This run will do zero steps.
Training doesn't continue

If I set max_train_steps to 0 and change num_train_epochs to 100, training starts fine. Haven't counted, but the updated dataset for resume may be less than the original dataset used.

My brain thinks in steps, so would prefer to use steps over epochs.

bghira · 2024-11-21T03:53:17Z

bghira
Nov 21, 2024
Maintainer

well, that is normal. you are no longer resuming the old training run, as you have changed everything.

it's not really recommended to change anything within a single training run, let alone the entire dataset or the step schedule

0 replies

playerzer0x · 2024-11-21T21:11:00Z

playerzer0x
Nov 21, 2024
Author

This change would be across two separate training runs. I'm following Caith's recommendation on training new subjects into a "base LoKR" that was previously trained on styles.

0 replies

bghira · 2024-11-22T14:23:32Z

bghira
Nov 22, 2024
Maintainer

you want to use --init_lora to begin a new training run from the old lokr then. it takes a path to the safetensor file

0 replies

playerzer0x · 2024-11-23T20:09:26Z

playerzer0x
Nov 23, 2024
Author

you want to use --init_lora to begin a new training run from the old lokr then. it takes a path to the safetensor file

Tried this but trainer threw a tensor size error on first step. Went back to using epochs and training starts fine.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training doesn't resume from previous checkpoint using max_train_steps #1204

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training doesn't resume from previous checkpoint using max_train_steps #1204

playerzer0x Nov 21, 2024

Replies: 4 comments

bghira Nov 21, 2024 Maintainer

playerzer0x Nov 21, 2024 Author

bghira Nov 22, 2024 Maintainer

playerzer0x Nov 23, 2024 Author

playerzer0x
Nov 21, 2024

bghira
Nov 21, 2024
Maintainer

playerzer0x
Nov 21, 2024
Author

bghira
Nov 22, 2024
Maintainer

playerzer0x
Nov 23, 2024
Author