error saving checkpoint when model is compiled #1399

felipemello1 · 2024-08-23T02:58:34Z

File "/data/users/felipemello/torchtune/recipes/lora_finetune_single_device.py", line 679, in train
    self.save_checkpoint(epoch=curr_epoch)
  File "/data/users/felipemello/torchtune/recipes/lora_finetune_single_device.py", line 539, in save_checkpoint
    self._checkpointer.save_checkpoint(
  File "/data/users/felipemello/torchtune/torchtune/utils/_checkpointing/_checkpointer.py", line 526, in save_checkpoint
    state_dict[utils.MODEL_KEY] = convert_weights.tune_to_hf(
  File "/data/users/felipemello/torchtune/torchtune/models/convert_weights.py", line 198, in tune_to_hf
    new_key = get_mapped_key(key, inverted_mapping_dict)
  File "/data/users/felipemello/torchtune/torchtune/models/convert_weights.py", line 59, in get_mapped_key
    raise Exception(
Exception: Error converting the state dict. Found unexpected key: "_orig_mod.tok_embeddings.weight". Please make sure you're loading a checkpoint with the right format.

with this command:

tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device loss=torchtune.modules.loss.ChunkedCrossEntropyLoss loss.num_output_chunks=16 optimizer_in_bwd=False enable_activation_checkpointing=True optimizer._component_=torch.optim.AdamW compile=True  dataset.source=Yukang/LongAlpaca-12k dataset.packed=False dataset.split=train[:10%] dataset.train_on_input=True tokenizer.max_seq_len=8192 gradient_accumulation_steps=1 max_steps_per_epoch=20 epochs=1 batch_size=2

in this PR: #1390

environment:

conda create -n torchtune_debugging python=3.11
conda activate torchtune_debugging
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -e ".[dev]"
pip uninstall torchao
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device model.lora_attn_modules="[q_proj, v_proj]" model.apply_lora_to_mlp=False model.apply_lora_to_output=False model.lora_rank=8 model.lora_alpha=16

The text was updated successfully, but these errors were encountered:

felipemello1 · 2024-08-23T02:59:02Z

possibly related: #596

ebsmothers · 2024-08-30T17:48:03Z

This should be fixed by using in-place compile

ebsmothers closed this as completed Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error saving checkpoint when model is compiled #1399

error saving checkpoint when model is compiled #1399

felipemello1 commented Aug 23, 2024 •

edited

Loading

felipemello1 commented Aug 23, 2024

ebsmothers commented Aug 30, 2024 •

edited

Loading

error saving checkpoint when model is compiled #1399

error saving checkpoint when model is compiled #1399

Comments

felipemello1 commented Aug 23, 2024 • edited Loading

felipemello1 commented Aug 23, 2024

ebsmothers commented Aug 30, 2024 • edited Loading

felipemello1 commented Aug 23, 2024 •

edited

Loading

ebsmothers commented Aug 30, 2024 •

edited

Loading