Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Completed model does not load from checkpoint / generate produces same as base model #76

Open
Glavin001 opened this issue May 29, 2023 · 3 comments

Comments

@Glavin001
Copy link

Glavin001 commented May 29, 2023

Prerequisite

Problem

Model has finished training and the output looks like this:

image

  • checkpoint-#/adapter_model/ directory exists
    • adapter_model.bin is MB not 443 bytes.
  • completed file exists

We want to load from the checkpoint here:

qlora/qlora.py

Lines 309 to 311 in f96eec1

if checkpoint_dir is not None:
print("Loading adapters from checkpoint.")
model = PeftModel.from_pretrained(model, join(checkpoint_dir, 'adapter_model'))

which needs checkpoint_dir to be set.

checkpoint_dir comes from get_last_checkpoint:

qlora/qlora.py

Lines 587 to 591 in f96eec1

checkpoint_dir, completed_training = get_last_checkpoint(args.output_dir)
if completed_training:
print('Detected that training was already completed!')
model = get_accelerate_model(args, checkpoint_dir)

✅ Currently I'm seeing:

Detected that training was already completed!

which is correct.

Unfortunately, the case when is_completed = True also means:
checkpoint_dir = None:

qlora/qlora.py

Lines 562 to 564 in f96eec1

if isdir(checkpoint_dir):
is_completed = exists(join(checkpoint_dir, 'completed'))
if is_completed: return None, True # already finished

Therefore, checkpoint_dir is not actually used and
❌ the model is reset to the default base model:

qlora/qlora.py

Line 317 in f96eec1

model = get_peft_model(model, config)

which means the generated output will reflect the base model not what was trained.

Workaround

It all works after I remove this line:

qlora/qlora.py

Line 564 in f96eec1

if is_completed: return None, True # already finished

-        if is_completed: return None, True # already finished

I'm not certain why it is needed though.

Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆

@KKcorps
Copy link
Contributor

KKcorps commented May 29, 2023

Are you able to resume training from the checkpoint with this?

@Glavin001
Copy link
Author

Ah maybe not, you're thinking we also need #79 ?

I was testing with a shorter max_steps value so it finished earlier and didn't matter as much about resuming the checkpoints.

Now that fine-tune and generate are working, I'll be increasing max_steps and likely would benefit from your Pull Request. Thanks!

@Maxwell-Lyu
Copy link

Maxwell-Lyu commented May 30, 2023

🥰
Your "workaround" is a very good fix, it is clearly working and should be merged ASAP.

I can comfirm its working. I was trying to do a predict using a trained checkpoint. But the lora weights are not loaded. By removing the line mentioned above, the prediction is finally normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants