[bug] Completed model does not load from checkpoint / generate produces same as base model #76

Glavin001 · 2023-05-29T07:50:47Z

Prerequisite

Ensure checkpoints save correctly: Apply Bug Fix: 443 Bytes adapter_model.bin files #44 to fix Adapter model is just 400 bytes when using finetune.py #38 and lora weights are not saved correctly #41

Problem

Model has finished training and the output looks like this:

✅ checkpoint-#/adapter_model/ directory exists
- ✅ adapter_model.bin is MB not 443 bytes.
✅ completed file exists

We want to load from the checkpoint here:

qlora/qlora.py

Lines 309 to 311 in f96eec1

    
           if checkpoint_dir is not None: 
        
               print("Loading adapters from checkpoint.") 
        
               model = PeftModel.from_pretrained(model, join(checkpoint_dir, 'adapter_model'))

which needs checkpoint_dir to be set.

checkpoint_dir comes from get_last_checkpoint:

qlora/qlora.py

Lines 587 to 591 in f96eec1

    
           checkpoint_dir, completed_training = get_last_checkpoint(args.output_dir) 
        
           if completed_training: 
        
               print('Detected that training was already completed!') 
        
           model = get_accelerate_model(args, checkpoint_dir)

✅ Currently I'm seeing:

Detected that training was already completed!

which is correct.

Unfortunately, the case when is_completed = True also means:
❌ checkpoint_dir = None:

qlora/qlora.py

Lines 562 to 564 in f96eec1

    
           if isdir(checkpoint_dir): 
        
               is_completed = exists(join(checkpoint_dir, 'completed')) 
        
               if is_completed: return None, True # already finished

Therefore, checkpoint_dir is not actually used and
❌ the model is reset to the default base model:

qlora/qlora.py

Line 317 in f96eec1

model = get_peft_model(model, config)

which means the generated output will reflect the base model not what was trained.

Workaround

It all works after I remove this line:

qlora/qlora.py

Line 564 in f96eec1

if is_completed: return None, True # already finished

-        if is_completed: return None, True # already finished

I'm not certain why it is needed though.

Hope this helps save someone else the hours I just wasted thinking my training/dataset/etc was a problem when it was really not even using the trained model 😆

The text was updated successfully, but these errors were encountered:

KKcorps · 2023-05-29T08:48:29Z

Are you able to resume training from the checkpoint with this?

Glavin001 · 2023-05-29T16:10:44Z

Ah maybe not, you're thinking we also need #79 ?

I was testing with a shorter max_steps value so it finished earlier and didn't matter as much about resuming the checkpoints.

Now that fine-tune and generate are working, I'll be increasing max_steps and likely would benefit from your Pull Request. Thanks!

Maxwell-Lyu · 2023-05-30T15:12:01Z

🥰
Your "workaround" is a very good fix, it is clearly working and should be merged ASAP.

I can comfirm its working. I was trying to do a predict using a trained checkpoint. But the lora weights are not loaded. By removing the line mentioned above, the prediction is finally normal.

Maxwell-Lyu mentioned this issue May 31, 2023

[Bug] trained checkpoint is not loaded when running generate.sh #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Completed model does not load from checkpoint / generate produces same as base model #76

[bug] Completed model does not load from checkpoint / generate produces same as base model #76

Glavin001 commented May 29, 2023 •

edited

Loading

KKcorps commented May 29, 2023

Glavin001 commented May 29, 2023

Maxwell-Lyu commented May 30, 2023 •

edited

Loading

[bug] Completed model does not load from checkpoint / generate produces same as base model #76

[bug] Completed model does not load from checkpoint / generate produces same as base model #76

Comments

Glavin001 commented May 29, 2023 • edited Loading

Prerequisite

Problem

Workaround

KKcorps commented May 29, 2023

Glavin001 commented May 29, 2023

Maxwell-Lyu commented May 30, 2023 • edited Loading

Glavin001 commented May 29, 2023 •

edited

Loading

Maxwell-Lyu commented May 30, 2023 •

edited

Loading