You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- if is_completed: return None, True # already finished
I'm not certain why it is needed though.
Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆
The text was updated successfully, but these errors were encountered:
🥰
Your "workaround" is a very good fix, it is clearly working and should be merged ASAP.
I can comfirm its working. I was trying to do a predict using a trained checkpoint. But the lora weights are not loaded. By removing the line mentioned above, the prediction is finally normal.
Prerequisite
adapter_model.bin
files #44 to fix Adapter model is just 400 bytes when using finetune.py #38 and lora weights are not saved correctly #41Problem
Model has finished training and the output looks like this:
checkpoint-#/adapter_model/
directory existsadapter_model.bin
is MB not 443 bytes.completed
file existsWe want to load from the checkpoint here:
qlora/qlora.py
Lines 309 to 311 in f96eec1
which needs
checkpoint_dir
to be set.checkpoint_dir
comes fromget_last_checkpoint
:qlora/qlora.py
Lines 587 to 591 in f96eec1
✅ Currently I'm seeing:
which is correct.
Unfortunately, the case when
is_completed = True
also means:❌
checkpoint_dir = None
:qlora/qlora.py
Lines 562 to 564 in f96eec1
Therefore,
checkpoint_dir
is not actually used and❌ the model is reset to the default base model:
qlora/qlora.py
Line 317 in f96eec1
which means the generated output will reflect the base model not what was trained.
Workaround
It all works after I remove this line:
qlora/qlora.py
Line 564 in f96eec1
- if is_completed: return None, True # already finished
I'm not certain why it is needed though.
Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆
The text was updated successfully, but these errors were encountered: