Training doesn't resume from previous checkpoint using max_train_steps #1204
Replies: 4 comments
-
well, that is normal. you are no longer resuming the old training run, as you have changed everything. it's not really recommended to change anything within a single training run, let alone the entire dataset or the step schedule |
Beta Was this translation helpful? Give feedback.
-
This change would be across two separate training runs. I'm following Caith's recommendation on training new subjects into a "base LoKR" that was previously trained on styles. |
Beta Was this translation helpful? Give feedback.
-
you want to use |
Beta Was this translation helpful? Give feedback.
-
Tried this but trainer threw a tensor size error on first step. Went back to using epochs and training starts fine. |
Beta Was this translation helpful? Give feedback.
-
2024-11-21 01:52:34,920 [INFO] Reached the end (58 epochs) of our training run (42 epochs). This run will do zero steps.
If I set max_train_steps to 0 and change num_train_epochs to 100, training starts fine. Haven't counted, but the updated dataset for resume may be less than the original dataset used.
My brain thinks in steps, so would prefer to use steps over epochs.
Beta Was this translation helpful? Give feedback.
All reactions