Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error saving checkpoint when model is compiled #1399

Closed
felipemello1 opened this issue Aug 23, 2024 · 2 comments
Closed

error saving checkpoint when model is compiled #1399

felipemello1 opened this issue Aug 23, 2024 · 2 comments

Comments

@felipemello1
Copy link
Contributor

felipemello1 commented Aug 23, 2024

File "/data/users/felipemello/torchtune/recipes/lora_finetune_single_device.py", line 679, in train
    self.save_checkpoint(epoch=curr_epoch)
  File "/data/users/felipemello/torchtune/recipes/lora_finetune_single_device.py", line 539, in save_checkpoint
    self._checkpointer.save_checkpoint(
  File "/data/users/felipemello/torchtune/torchtune/utils/_checkpointing/_checkpointer.py", line 526, in save_checkpoint
    state_dict[utils.MODEL_KEY] = convert_weights.tune_to_hf(
  File "/data/users/felipemello/torchtune/torchtune/models/convert_weights.py", line 198, in tune_to_hf
    new_key = get_mapped_key(key, inverted_mapping_dict)
  File "/data/users/felipemello/torchtune/torchtune/models/convert_weights.py", line 59, in get_mapped_key
    raise Exception(
Exception: Error converting the state dict. Found unexpected key: "_orig_mod.tok_embeddings.weight". Please make sure you're loading a checkpoint with the right format. 

with this command:

tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device loss=torchtune.modules.loss.ChunkedCrossEntropyLoss loss.num_output_chunks=16 optimizer_in_bwd=False enable_activation_checkpointing=True optimizer._component_=torch.optim.AdamW compile=True  dataset.source=Yukang/LongAlpaca-12k dataset.packed=False dataset.split=train[:10%] dataset.train_on_input=True tokenizer.max_seq_len=8192 gradient_accumulation_steps=1 max_steps_per_epoch=20 epochs=1 batch_size=2

in this PR: #1390

environment:

conda create -n torchtune_debugging python=3.11
conda activate torchtune_debugging
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -e ".[dev]"
pip uninstall torchao
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device model.lora_attn_modules="[q_proj, v_proj]" model.apply_lora_to_mlp=False model.apply_lora_to_output=False model.lora_rank=8 model.lora_alpha=16 
@felipemello1
Copy link
Contributor Author

possibly related: #596

@ebsmothers
Copy link
Contributor

ebsmothers commented Aug 30, 2024

This should be fixed by using in-place compile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants