Mistral QLoRA and config spring cleaning #670

ebsmothers · 2024-04-09T18:31:05Z

Context

In Out of CUDA on 15 GB colab . Just trying to train Mistral 7. v1 #665 it was pointed out that we can't run any recipes on Colab. While Use less memory during lora state dict validation #666 allows us to run Llama2 7B with QLoRA, we still don't support Mistral 7B out of the box as we don't have a QLoRA recipe for it yet.

Changelog

Add a Mistral 7B QLoRA recipe
Also a lot of other stuff (sorry)
- While I was adding the Mistral 7B QLoRA recipe, I was pretty befuddled by our config -> recipe mapping. @joecummings and @kartikayk and I had discussed this previously, but configs -> recipes should never be a one-to-many mapping. It's confusing and a maintenance nightmare. So I
  - split Mistral recipes out into single device and multi-device
  - tried to improve the "docstrings" across our different configs to make them consistent and fix any obvious errors (I'm sure I missed/introduced a few)
  - removed the distinction of single-device vs. single-device low-memory for our full finetunes. Now single-device implicitly means low-memory (i.e. these configs will ship with optimizer_in_bwd=True and paged AdamW out of the box).
- Oh also we never use cpu_offload in our recipes and all the cpu_offload: False polluting our configs is confusing, and in general single-GPU FSDP is confusing, so.. no more CPU offload.

Also I have no idea why half our recipes default to epochs=1 and half default to epochs=3 but this is already enough of a hodgepodge PR so will tackle that separately.

Test plan

Llama2 7B recipes

Full finetune single device

tune run full_finetune_single_device --config llama2/7B_full_low_memory tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.
model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|17|Loss: 0.7681398391723633:   0%|

Full finetune distributed

tune run --nproc_per_node=2 full_finetune_distributed --config llama2/7B_full tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokeniz
er.model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|11|Loss: 0.8318166136741638:   0%|

LoRA finetune single device

tune run lora_finetune_single_device --config llama2/7B_lora_single_device tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model checkpoi
nter=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] 
...
1|48|Loss: 1.070238709449768:   0%|▏

LoRA finetune distributed

tune run --nproc_per_node 2 lora_finetune_distributed --config llama2/7B_lora tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokeniz
er.model checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt']
...
1|112|Loss: 0.913817286491394:   1%|█

QLoRA finetune single device

tune run lora_finetune_single_device --config llama2/7B_qlora_single_device tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model checkpointer=torchtu
ne.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt']
...
1|43|Loss: 0.947695791721344:   0%|▎

Mistral 7B recipes

Full finetune single device

tune run full_finetune_single_device --config mistral/7B_full_low_memory tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.
utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|11|Loss: 1.0883381366729736:   0%

LoRA finetune single device

tune run lora_finetune_single_device --config mistral/7B_lora_single_device tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.
utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|146|Loss: 1.1544195413589478:   1%|█▉

Full finetune distributed

tune run --nproc_per_node=2 full_finetune_distributed --config mistral/7B_full tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|8|Loss: 0.6514360308647156:   0%|

LoRA finetune distributed

tune run --nproc_per_node=2 lora_finetune_distributed --config mistral/7B_lora tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|5|Loss: 1.7303991317749023:   0%|

QLoRA finetune single device

tune run lora_finetune_single_device --config mistral/7B_qlora_single_device tokenizer.path=/data/users/ebs/checkpoints/mistral/tokenizer.model checkpointer=torchtune.utils.FullModelHFCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints/mistral
...
1|7|Loss: 1.9033024311065674:   0%|

pytorch-bot · 2024-04-09T18:31:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/670

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 63b3b40 with merge base 8bb3aae ():

NEW FAILURE - The following job has failed:

Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.10, nightly) (gh)
tests/recipes/test_eleuther_eval.py::TestEleutherEval::test_torchune_checkpoint_eval_results

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-04-10T18:30:16Z

recipes/configs/mistral/7B_full.yaml

 # This config uses hyperparameters based on small set of experiments and information
 # available on various forums. These are not meant to replicate the numbers
 # from the paper
 #
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download --repo-id mistralai/Mistral-7B-v0.1 \


repo id flag is deprecated, pass it as positional arg

RdoubleA · 2024-04-10T18:31:03Z

recipes/configs/mistral/7B_full.yaml

+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \


Suggested change

# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \

# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed \

Btw can we group all flag arguments together and do tune run recipe --flags instead of in between? cc @joecummings

Can you explain more @RdoubleA ?

RdoubleA · 2024-04-10T18:32:41Z

recipes/configs/mistral/7B_full_single_device.yaml

 #
 # This config assumes that you've run the following command before launching
 # this run:
-#   tune download --repo-id meta-llama/Llama-2-7b \
+#   tune download --repo-id mistralai/Mistral-7B-v0.1 \


Remove repo id

RdoubleA · 2024-04-10T18:33:01Z

recipes/configs/mistral/7B_full_single_device.yaml

-#   tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
-#   --config llama2/7B_full_single_device_low_memory \
+#   tune run full_finetune_single_device \
+#   --config llama2/7B_full_single_device \


Suggested change

# --config llama2/7B_full_single_device \

# --config mistral/7B_full_single_device \

I knew some of those copy-pastes would come back to bite me. Thanks

RdoubleA · 2024-04-10T18:33:51Z

recipes/configs/mistral/7B_full_single_device.yaml

 resume_from_checkpoint: False

 # Fine-tuning arguments
 batch_size: 2
-epochs: 1
+epochs: 3


You did epochs=1 in another config? What's the reason for the difference?

I alluded to this in the PR summary, but from my perspective there is no rhyme or reason as to how we are setting epochs in our configs currently. Did a quick pass, here's the current state of the world:

3 epochs
Gemma 2B full
Mistral 7B lora
Mistral 7B full
Llama2 7B full single device
Llama2 13B full
Llama2 7B full

1 epoch
Llama2 7B LoRA
Llama2 13B LoRA
Llama2 7B LoRA single device
Llama2 7B QLoRA single device
Llama2 7B full single device low memory

So seems like 1 epoch is only llama2 LoRA configs but then also weirdly the low-memory single-device full finetune (but not the regular single-device full finetune, which I am scrapping anyways).

In that case, I would keep this one as-is and change the Llama2 single-device one to 3 epochs so that the dividing line is just "Llama2 LoRA configs train for one epoch, all others train for 3 epochs". Honestly I don't really understand that either and have half a mind to set everything to one epoch. Is there any reason not to do that?

RdoubleA · 2024-04-10T18:34:31Z

recipes/configs/llama2/7B_full_single_device.yaml

 optimizer:
-  _component_: torch.optim.SGD
+  _component_: bitsandbytes.optim.PagedAdamW
  lr: 2e-5


Why does mistral use 5e-6 but llama uses a different LR?

Mistral FT hyperparams have not really been extensively tuned, cc @kartikayk who may have more context there

RdoubleA · 2024-04-10T18:36:17Z

recipes/configs/mistral/7B_lora.yaml

 # This config uses hyperparameters based on small set of experiments and information
 # available on various forums. These are not meant to replicate the numbers
 # from the paper
 #
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download --repo-id mistralai/Mistral-7B-v0.1 \


RdoubleA · 2024-04-10T18:36:35Z

recipes/configs/mistral/7B_lora.yaml

+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \


Suggested change

# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \

# tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \

kartikayk

This is an awesome PR! Thanks for pushing these changes!

kartikayk · 2024-04-10T18:29:32Z

recipes/configs/llama2/7B_qlora_single_device.yaml

@@ -28,7 +28,6 @@ model:
  apply_lora_to_output: False
  lora_rank: 8
  lora_alpha: 16
-  quantize_base: True


Why does this go away?

It wasn't actually needed to begin with. qlora_llama2_7b is just a partial of lora_llama2_7b with quantize_base=True

kartikayk · 2024-04-10T18:31:22Z

recipes/configs/llama2/13B_lora.yaml

@@ -19,7 +19,7 @@
 #   checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
 #
 # This config works best when the model is being fine-tuned on 2+ GPUs.
-# For single device lora finetuning please use 7B_lora_single_device.yaml
+# For single device LoRA finetuning please use 7B_lora_single_device.yaml


Sorry github doesnt let me comment on the exact line, but mind updating the tune download command here as well. The command should remove repo-id

tune download meta-llama/Llama-2-13b-hf \ --hf-token <HF_TOKEN> \ --output-dir /tmp/llama2-13b-hf

kartikayk · 2024-04-10T18:31:38Z

recipes/configs/llama2/7B_full_single_device.yaml

@@ -14,7 +14,7 @@
 # You can add specific overrides through the command line. For example
 # to override the checkpointer directory while launching training
 # you can run:
-#   tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
+#   tune run full_finetune_single_device \


Same comment on tune download

kartikayk · 2024-04-10T18:32:14Z

recipes/configs/llama2/7B_full_single_device.yaml

 optimizer:
-  _component_: torch.optim.SGD
+  _component_: bitsandbytes.optim.PagedAdamW


kartikayk · 2024-04-10T18:32:56Z

recipes/configs/mistral/7B_full.yaml

 # This config uses hyperparameters based on small set of experiments and information
 # available on various forums. These are not meant to replicate the numbers
 # from the paper
 #
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download --repo-id mistralai/Mistral-7B-v0.1 \


This actually lets me comment :)

Same comment on tune download

joecummings · 2024-04-11T13:01:38Z

torchtune/utils/_distributed.py

        return FSDP(
            model,
            auto_wrap_policy=wrap_policy,
            device_id=device,
            mixed_precision=None,
            sharding_strategy=_get_sharding_strategy(strategy),
-            cpu_offload=CPUOffload(offload_params=True) if cpu_offload else None,


Just remove the param entirely.

joecummings · 2024-04-11T13:04:25Z

Can you look at #640 for some inspo on updating the comments in configs?

joecummings · 2024-04-11T13:06:00Z

removed the distinction of single-device vs. single-device low-memory for our full finetunes. Now single-device implicitly means low-memory

I thought we were going the other way with this, e.g. we explicitly call out the recipes as low memory, rather than single device.

joecummings

lgtm

[wip] Mistral QLoRA and config spring cleaning

55ec1cd

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024

ebsmothers added 5 commits April 9, 2024 11:31

couple fixes

79cb5d0

bug fix

0562cbc

recipe tests and other cleanup

6d2f267

bug fix

ae7c2a3

merge

db46325

ebsmothers changed the title ~~[wip] Mistral QLoRA and config spring cleaning~~ Mistral QLoRA and config spring cleaning Apr 9, 2024

ebsmothers marked this pull request as ready for review April 9, 2024 21:54

ebsmothers requested review from kartikayk, rohan-varma, joecummings and SLR722 April 9, 2024 21:55

RdoubleA reviewed Apr 10, 2024

View reviewed changes

kartikayk reviewed Apr 10, 2024

View reviewed changes

address comments

868633e

joecummings reviewed Apr 11, 2024

View reviewed changes

joecummings mentioned this pull request Apr 11, 2024

Rename configs for consistency #640

Closed

ebsmothers added 4 commits April 11, 2024 08:23

address comments

b27550c

merge

787cfb3

fix config path in recipe test

5e23e36

update paths in test_cp

63b3b40

joecummings approved these changes Apr 11, 2024

View reviewed changes

ebsmothers merged commit 6e9ea22 into main Apr 11, 2024
30 of 31 checks passed

ebsmothers deleted the mistral-qlora branch April 11, 2024 17:33

SLR722 mentioned this pull request Apr 12, 2024

[Easy] Update supported finetune methods section in readme #696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral QLoRA and config spring cleaning #670

Mistral QLoRA and config spring cleaning #670

ebsmothers commented Apr 9, 2024 •

edited

Loading

pytorch-bot bot commented Apr 9, 2024 •

edited

Loading

RdoubleA Apr 10, 2024

RdoubleA Apr 10, 2024

RdoubleA Apr 10, 2024

joecummings Apr 11, 2024

RdoubleA Apr 10, 2024

RdoubleA Apr 10, 2024

ebsmothers Apr 10, 2024

RdoubleA Apr 10, 2024

ebsmothers Apr 10, 2024

RdoubleA Apr 10, 2024

ebsmothers Apr 10, 2024

RdoubleA Apr 10, 2024

RdoubleA Apr 10, 2024

kartikayk left a comment

kartikayk Apr 10, 2024

ebsmothers Apr 10, 2024

kartikayk Apr 10, 2024

kartikayk Apr 10, 2024

kartikayk Apr 10, 2024

kartikayk Apr 10, 2024

joecummings Apr 11, 2024

joecummings commented Apr 11, 2024

joecummings commented Apr 11, 2024

joecummings left a comment

	# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
	# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed \

	# --config llama2/7B_full_single_device \
	# --config mistral/7B_full_single_device \

	# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
	# tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \

Mistral QLoRA and config spring cleaning #670

Mistral QLoRA and config spring cleaning #670

Conversation

ebsmothers commented Apr 9, 2024 • edited Loading

Context

Changelog

Test plan

Llama2 7B recipes

Full finetune single device

Full finetune distributed

LoRA finetune single device

LoRA finetune distributed

QLoRA finetune single device

Mistral 7B recipes

Full finetune single device

LoRA finetune single device

Full finetune distributed

LoRA finetune distributed

QLoRA finetune single device

pytorch-bot bot commented Apr 9, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/670

❌ 1 New Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings commented Apr 11, 2024

joecummings commented Apr 11, 2024

joecummings left a comment

Choose a reason for hiding this comment

ebsmothers commented Apr 9, 2024 •

edited

Loading

pytorch-bot bot commented Apr 9, 2024 •

edited

Loading