-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral QLoRA and config spring cleaning #670
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/670
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 63b3b40 with merge base 8bb3aae (): NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
recipes/configs/mistral/7B_full.yaml
Outdated
# This config uses hyperparameters based on small set of experiments and information | ||
# available on various forums. These are not meant to replicate the numbers | ||
# from the paper | ||
# | ||
# This config assumes that you've run the following command before launching | ||
# this run: | ||
# tune download --repo-id mistralai/Mistral-7B-v0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repo id flag is deprecated, pass it as positional arg
recipes/configs/mistral/7B_full.yaml
Outdated
# You can add specific overrides through the command line. For example | ||
# to override the checkpointer directory while launching training | ||
# you can run: | ||
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \ | |
# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw can we group all flag arguments together and do tune run recipe --flags instead of in between? cc @joecummings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more @RdoubleA ?
# | ||
# This config assumes that you've run the following command before launching | ||
# this run: | ||
# tune download --repo-id meta-llama/Llama-2-7b \ | ||
# tune download --repo-id mistralai/Mistral-7B-v0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove repo id
# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \ | ||
# --config llama2/7B_full_single_device_low_memory \ | ||
# tune run full_finetune_single_device \ | ||
# --config llama2/7B_full_single_device \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# --config llama2/7B_full_single_device \ | |
# --config mistral/7B_full_single_device \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew some of those copy-pastes would come back to bite me. Thanks
resume_from_checkpoint: False | ||
|
||
# Fine-tuning arguments | ||
batch_size: 2 | ||
epochs: 1 | ||
epochs: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did epochs=1 in another config? What's the reason for the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I alluded to this in the PR summary, but from my perspective there is no rhyme or reason as to how we are setting epochs in our configs currently. Did a quick pass, here's the current state of the world:
3 epochs
Gemma 2B full
Mistral 7B lora
Mistral 7B full
Llama2 7B full single device
Llama2 13B full
Llama2 7B full
1 epoch
Llama2 7B LoRA
Llama2 13B LoRA
Llama2 7B LoRA single device
Llama2 7B QLoRA single device
Llama2 7B full single device low memory
So seems like 1 epoch is only llama2 LoRA configs but then also weirdly the low-memory single-device full finetune (but not the regular single-device full finetune, which I am scrapping anyways).
In that case, I would keep this one as-is and change the Llama2 single-device one to 3 epochs so that the dividing line is just "Llama2 LoRA configs train for one epoch, all others train for 3 epochs". Honestly I don't really understand that either and have half a mind to set everything to one epoch. Is there any reason not to do that?
optimizer: | ||
_component_: torch.optim.SGD | ||
_component_: bitsandbytes.optim.PagedAdamW | ||
lr: 2e-5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does mistral use 5e-6 but llama uses a different LR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistral FT hyperparams have not really been extensively tuned, cc @kartikayk who may have more context there
recipes/configs/mistral/7B_lora.yaml
Outdated
# This config uses hyperparameters based on small set of experiments and information | ||
# available on various forums. These are not meant to replicate the numbers | ||
# from the paper | ||
# | ||
# This config assumes that you've run the following command before launching | ||
# this run: | ||
# tune download --repo-id mistralai/Mistral-7B-v0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repo id
recipes/configs/mistral/7B_lora.yaml
Outdated
# You can add specific overrides through the command line. For example | ||
# to override the checkpointer directory while launching training | ||
# you can run: | ||
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \ | |
# tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awesome PR! Thanks for pushing these changes!
@@ -28,7 +28,6 @@ model: | |||
apply_lora_to_output: False | |||
lora_rank: 8 | |||
lora_alpha: 16 | |||
quantize_base: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't actually needed to begin with. qlora_llama2_7b
is just a partial of lora_llama2_7b
with quantize_base=True
@@ -19,7 +19,7 @@ | |||
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR> | |||
# | |||
# This config works best when the model is being fine-tuned on 2+ GPUs. | |||
# For single device lora finetuning please use 7B_lora_single_device.yaml | |||
# For single device LoRA finetuning please use 7B_lora_single_device.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry github doesnt let me comment on the exact line, but mind updating the tune download
command here as well. The command should remove repo-id
tune download meta-llama/Llama-2-13b-hf \
--hf-token <HF_TOKEN> \
--output-dir /tmp/llama2-13b-hf
@@ -14,7 +14,7 @@ | |||
# You can add specific overrides through the command line. For example | |||
# to override the checkpointer directory while launching training | |||
# you can run: | |||
# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \ | |||
# tune run full_finetune_single_device \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on tune download
optimizer: | ||
_component_: torch.optim.SGD | ||
_component_: bitsandbytes.optim.PagedAdamW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
recipes/configs/mistral/7B_full.yaml
Outdated
# This config uses hyperparameters based on small set of experiments and information | ||
# available on various forums. These are not meant to replicate the numbers | ||
# from the paper | ||
# | ||
# This config assumes that you've run the following command before launching | ||
# this run: | ||
# tune download --repo-id mistralai/Mistral-7B-v0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually lets me comment :)
Same comment on tune download
return FSDP( | ||
model, | ||
auto_wrap_policy=wrap_policy, | ||
device_id=device, | ||
mixed_precision=None, | ||
sharding_strategy=_get_sharding_strategy(strategy), | ||
cpu_offload=CPUOffload(offload_params=True) if cpu_offload else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the param entirely.
Can you look at #640 for some inspo on updating the comments in configs? |
I thought we were going the other way with this, e.g. we explicitly call out the recipes as low memory, rather than single device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Context
Changelog
optimizer_in_bwd=True
and paged AdamW out of the box).cpu_offload
in our recipes and all thecpu_offload: False
polluting our configs is confusing, and in general single-GPU FSDP is confusing, so.. no more CPU offload.Also I have no idea why half our recipes default to
epochs=1
and half default toepochs=3
but this is already enough of a hodgepodge PR so will tackle that separately.Test plan
Llama2 7B recipes
Full finetune single device
Full finetune distributed
LoRA finetune single device
LoRA finetune distributed
QLoRA finetune single device
Mistral 7B recipes
Full finetune single device
LoRA finetune single device
Full finetune distributed
LoRA finetune distributed
QLoRA finetune single device