Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral QLoRA and config spring cleaning #670

Merged
merged 11 commits into from
Apr 11, 2024
Merged

Mistral QLoRA and config spring cleaning #670

merged 11 commits into from
Apr 11, 2024

Conversation

ebsmothers
Copy link
Contributor

@ebsmothers ebsmothers commented Apr 9, 2024

Context

Changelog

  • Add a Mistral 7B QLoRA recipe
  • Also a lot of other stuff (sorry)
    • While I was adding the Mistral 7B QLoRA recipe, I was pretty befuddled by our config -> recipe mapping. @joecummings and @kartikayk and I had discussed this previously, but configs -> recipes should never be a one-to-many mapping. It's confusing and a maintenance nightmare. So I
      • split Mistral recipes out into single device and multi-device
      • tried to improve the "docstrings" across our different configs to make them consistent and fix any obvious errors (I'm sure I missed/introduced a few)
      • removed the distinction of single-device vs. single-device low-memory for our full finetunes. Now single-device implicitly means low-memory (i.e. these configs will ship with optimizer_in_bwd=True and paged AdamW out of the box).
    • Oh also we never use cpu_offload in our recipes and all the cpu_offload: False polluting our configs is confusing, and in general single-GPU FSDP is confusing, so.. no more CPU offload.

Also I have no idea why half our recipes default to epochs=1 and half default to epochs=3 but this is already enough of a hodgepodge PR so will tackle that separately.

Test plan

Llama2 7B recipes

Full finetune single device

tune run full_finetune_single_device --config llama2/7B_full_low_memory tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.
model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|17|Loss: 0.7681398391723633:   0%| 

Full finetune distributed

tune run --nproc_per_node=2 full_finetune_distributed --config llama2/7B_full tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokeniz
er.model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|11|Loss: 0.8318166136741638:   0%|

LoRA finetune single device

tune run lora_finetune_single_device --config llama2/7B_lora_single_device tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model checkpoi
nter=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|48|Loss: 1.070238709449768:   0%|▏

LoRA finetune distributed

tune run --nproc_per_node 2 lora_finetune_distributed --config llama2/7B_lora tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokeniz
er.model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt']
...
1|112|Loss: 0.913817286491394:   1%|█  

QLoRA finetune single device

tune run lora_finetune_single_device --config llama2/7B_qlora_single_device tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model checkpointer=torchtu
ne.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt']
...
1|43|Loss: 0.947695791721344:   0%|▎  

Mistral 7B recipes

Full finetune single device

tune run full_finetune_single_device --config mistral/7B_full_low_memory tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.
utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|11|Loss: 1.0883381366729736:   0%

LoRA finetune single device

tune run lora_finetune_single_device --config mistral/7B_lora_single_device tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.
utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|146|Loss: 1.1544195413589478:   1%|█▉

Full finetune distributed

tune run --nproc_per_node=2 full_finetune_distributed --config mistral/7B_full tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|8|Loss: 0.6514360308647156:   0%|

LoRA finetune distributed

tune run --nproc_per_node=2 lora_finetune_distributed --config mistral/7B_lora tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|5|Loss: 1.7303991317749023:   0%|

QLoRA finetune single device

tune run lora_finetune_single_device --config mistral/7B_qlora_single_device tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|7|Loss: 1.9033024311065674:   0%| 

Copy link

pytorch-bot bot commented Apr 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/670

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 63b3b40 with merge base 8bb3aae (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024
@ebsmothers ebsmothers changed the title [wip] Mistral QLoRA and config spring cleaning Mistral QLoRA and config spring cleaning Apr 9, 2024
@ebsmothers ebsmothers marked this pull request as ready for review April 9, 2024 21:54
# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo id flag is deprecated, pass it as positional arg

# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw can we group all flag arguments together and do tune run recipe --flags instead of in between? cc @joecummings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more @RdoubleA ?

#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id meta-llama/Llama-2-7b \
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove repo id

# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
# --config llama2/7B_full_single_device_low_memory \
# tune run full_finetune_single_device \
# --config llama2/7B_full_single_device \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# --config llama2/7B_full_single_device \
# --config mistral/7B_full_single_device \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew some of those copy-pastes would come back to bite me. Thanks

resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 1
epochs: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did epochs=1 in another config? What's the reason for the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I alluded to this in the PR summary, but from my perspective there is no rhyme or reason as to how we are setting epochs in our configs currently. Did a quick pass, here's the current state of the world:

3 epochs
Gemma 2B full
Mistral 7B lora
Mistral 7B full
Llama2 7B full single device
Llama2 13B full
Llama2 7B full

1 epoch
Llama2 7B LoRA
Llama2 13B LoRA
Llama2 7B LoRA single device
Llama2 7B QLoRA single device
Llama2 7B full single device low memory

So seems like 1 epoch is only llama2 LoRA configs but then also weirdly the low-memory single-device full finetune (but not the regular single-device full finetune, which I am scrapping anyways).

In that case, I would keep this one as-is and change the Llama2 single-device one to 3 epochs so that the dividing line is just "Llama2 LoRA configs train for one epoch, all others train for 3 epochs". Honestly I don't really understand that either and have half a mind to set everything to one epoch. Is there any reason not to do that?

optimizer:
_component_: torch.optim.SGD
_component_: bitsandbytes.optim.PagedAdamW
lr: 2e-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does mistral use 5e-6 but llama uses a different LR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistral FT hyperparams have not really been extensively tuned, cc @kartikayk who may have more context there

# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo id

# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
# tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \

Copy link
Contributor

@kartikayk kartikayk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome PR! Thanks for pushing these changes!

@@ -28,7 +28,6 @@ model:
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
quantize_base: True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this go away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't actually needed to begin with. qlora_llama2_7b is just a partial of lora_llama2_7b with quantize_base=True

@@ -19,7 +19,7 @@
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device lora finetuning please use 7B_lora_single_device.yaml
# For single device LoRA finetuning please use 7B_lora_single_device.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry github doesnt let me comment on the exact line, but mind updating the tune download command here as well. The command should remove repo-id

     tune download meta-llama/Llama-2-13b-hf \
     --hf-token <HF_TOKEN> \
     --output-dir /tmp/llama2-13b-hf

@@ -14,7 +14,7 @@
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
# tune run full_finetune_single_device \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on tune download

optimizer:
_component_: torch.optim.SGD
_component_: bitsandbytes.optim.PagedAdamW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually lets me comment :)

Same comment on tune download

return FSDP(
model,
auto_wrap_policy=wrap_policy,
device_id=device,
mixed_precision=None,
sharding_strategy=_get_sharding_strategy(strategy),
cpu_offload=CPUOffload(offload_params=True) if cpu_offload else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the param entirely.

@joecummings
Copy link
Contributor

Can you look at #640 for some inspo on updating the comments in configs?

@joecummings
Copy link
Contributor

removed the distinction of single-device vs. single-device low-memory for our full finetunes. Now single-device implicitly means low-memory

I thought we were going the other way with this, e.g. we explicitly call out the recipes as low memory, rather than single device.

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ebsmothers ebsmothers merged commit 6e9ea22 into main Apr 11, 2024
30 of 31 checks passed
@ebsmothers ebsmothers deleted the mistral-qlora branch April 11, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants