Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral QLoRA and config spring cleaning #670

Merged
merged 11 commits into from
Apr 11, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions recipes/configs/gemma/2B_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,6 @@ gradient_accumulation_steps: 1
# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

Expand Down
3 changes: 0 additions & 3 deletions recipes/configs/llama2/13B_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,6 @@ gradient_accumulation_steps: 1
# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

Expand Down
2 changes: 1 addition & 1 deletion recipes/configs/llama2/13B_lora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device lora finetuning please use 7B_lora_single_device.yaml
# For single device LoRA finetuning please use 7B_lora_single_device.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry github doesnt let me comment on the exact line, but mind updating the tune download command here as well. The command should remove repo-id

     tune download meta-llama/Llama-2-13b-hf \
     --hf-token <HF_TOKEN> \
     --output-dir /tmp/llama2-13b-hf

# or 7B_qlora_single_device.yaml and update the model and checkpoints to
# the 13B model.

Expand Down
3 changes: 0 additions & 3 deletions recipes/configs/llama2/7B_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,6 @@ gradient_accumulation_steps: 1
# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

Expand Down
8 changes: 4 additions & 4 deletions recipes/configs/llama2/7B_full_single_device.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
# tune run full_finetune_single_device \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on tune download

# --config llama2/7B_full_single_device \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
Expand Down Expand Up @@ -48,15 +48,15 @@ resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 3
epochs: 1
optimizer:
_component_: torch.optim.SGD
_component_: bitsandbytes.optim.PagedAdamW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

lr: 2e-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does mistral use 5e-6 but llama uses a different LR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistral FT hyperparams have not really been extensively tuned, cc @kartikayk who may have more context there

optimizer_in_bwd: True
loss:
_component_: torch.nn.CrossEntropyLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1
optimizer_in_bwd: False


# Training environment
Expand Down
2 changes: 1 addition & 1 deletion recipes/configs/llama2/7B_lora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device lora finetuning please use 7B_lora_single_device.yaml
# For single device LoRA finetuning please use 7B_lora_single_device.yaml
# or 7B_qlora_single_device.yaml


Expand Down
1 change: 0 additions & 1 deletion recipes/configs/llama2/7B_qlora_single_device.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ model:
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
quantize_base: True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this go away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't actually needed to begin with. qlora_llama2_7b is just a partial of lora_llama2_7b with quantize_base=True


checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
Expand Down
23 changes: 20 additions & 3 deletions recipes/configs/mistral/7B_full.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
# Config for multi-device full finetuning in full_finetune_distributed.py
# using a Mistral 7B model
#
# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo id flag is deprecated, pass it as positional arg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually lets me comment :)

Same comment on tune download

# --hf-token <HF_TOKEN> \
# --output-dir /tmp/Mistral-7B-v0.1
#
# Run this config on 4 GPUs using the following:
# tune run --nproc_per_node 4 full_finetune_distributed --config mistral/7B_full
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw can we group all flag arguments together and do tune run recipe --flags instead of in between? cc @joecummings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more @RdoubleA ?

# --config mistral/7B_full \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# Single device full finetuning requires more memory optimizations. It's
# best to use 7B_full_single_device.yaml for those cases

# Tokenizer
tokenizer:
Expand Down Expand Up @@ -48,9 +68,6 @@ gradient_accumulation_steps: 1
# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

Expand Down
Original file line number Diff line number Diff line change
@@ -1,30 +1,33 @@
# Config for single device full finetuning in full_finetune_single_device.py
# using a Llama2 7B model
# using a Mistral 7B model
#
# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id meta-llama/Llama-2-7b \
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove repo id

# --hf-token <HF_TOKEN> \
# --output-dir /tmp/llama2
# --output-dir /tmp/Mistral-7B-v0.1
#
# To launch on a single device, run the following command from root:
# tune run full_finetune_single_device \
# --config llama2/7B_full_single_device_low_memory \
# --config mistral/7B_full_single_device \
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 1 full_finetune_single_device \
# --config llama2/7B_full_single_device_low_memory \
# tune run full_finetune_single_device \
# --config llama2/7B_full_single_device \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# --config llama2/7B_full_single_device \
# --config mistral/7B_full_single_device \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew some of those copy-pastes would come back to bite me. Thanks

# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works only for training on single device.


# Tokenizer
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/llama2/tokenizer.model
_component_: torchtune.models.mistral.mistral_tokenizer
path: /tmp/Mistral-7B-v0.1/tokenizer.model

# Dataset
dataset:
Expand All @@ -35,31 +38,33 @@ shuffle: True

# Model Arguments
model:
_component_: torchtune.models.llama2.llama2_7b
_component_: torchtune.models.mistral.mistral_7b

checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: /tmp/llama2
checkpoint_files: [consolidated.00.pth]
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /tmp/Mistral-7B-v0.1
checkpoint_files: [
pytorch_model-00001-of-00002.bin,
pytorch_model-00002-of-00002.bin
]
recipe_checkpoint: null
output_dir: /tmp/llama2
model_type: LLAMA2
output_dir: /tmp/Mistral-7B-v0.1/
model_type: MISTRAL
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 1
epochs: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did epochs=1 in another config? What's the reason for the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I alluded to this in the PR summary, but from my perspective there is no rhyme or reason as to how we are setting epochs in our configs currently. Did a quick pass, here's the current state of the world:

3 epochs
Gemma 2B full
Mistral 7B lora
Mistral 7B full
Llama2 7B full single device
Llama2 13B full
Llama2 7B full

1 epoch
Llama2 7B LoRA
Llama2 13B LoRA
Llama2 7B LoRA single device
Llama2 7B QLoRA single device
Llama2 7B full single device low memory

So seems like 1 epoch is only llama2 LoRA configs but then also weirdly the low-memory single-device full finetune (but not the regular single-device full finetune, which I am scrapping anyways).

In that case, I would keep this one as-is and change the Llama2 single-device one to 3 epochs so that the dividing line is just "Llama2 LoRA configs train for one epoch, all others train for 3 epochs". Honestly I don't really understand that either and have half a mind to set everything to one epoch. Is there any reason not to do that?

optimizer:
_component_: bitsandbytes.optim.PagedAdamW
lr: 2e-5
optimizer_in_bwd: True
lr: 5e-6
loss:
_component_: torch.nn.CrossEntropyLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1
optimizer_in_bwd: True


# Training environment
# Training env
device: cuda

# Memory management
Expand All @@ -72,5 +77,5 @@ dtype: bf16
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
output_dir: /tmp/alpaca-llama2-finetune
output_dir: /tmp/Mistral-7B-v0.1/
log_every_n_steps: null
20 changes: 20 additions & 0 deletions recipes/configs/mistral/7B_lora.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
# Config for multi-device LoRA finetuning in lora_finetune_distributed.py
# using a Mistral 7B model
#
# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo id

# --hf-token <HF_TOKEN> \
# --output-dir /tmp/Mistral-7B-v0.1
#
# Run this config on 4 GPUs using the following:
# tune run --nproc_per_node 4 lora_finetune_distributed --config mistral/7B_lora
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
# tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \

# --config mistral/7B_lora \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device LoRA finetuning please use 7B_lora_single_device.yaml
# or 7B_qlora_single_device.yaml for those cases


# Tokenizer
Expand Down
98 changes: 98 additions & 0 deletions recipes/configs/mistral/7B_lora_single_device.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Config for single device LoRA finetuning in lora_finetune_single_device.py
# using a Mistral 7B model
#
# This config uses hyperparameters based on small set of experiments and information
# available on various forums. These are not meant to replicate the numbers
# from the paper
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id mistralai/Mistral-7B-v0.1 \
# --hf-token <HF_TOKEN> \
# --output-dir /tmp/Mistral-7B-v0.1
#
# To launch on a single device, run the following command from root:
# tune run lora_finetune_single_device \
# --config mistral/7B_lora_single_device \
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run lora_finetune_single_device \
# --config mistral/7B_lora_single_device \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works only for training on single device.

# Tokenizer
tokenizer:
_component_: torchtune.models.mistral.mistral_tokenizer
path: /tmp/Mistral-7B-v0.1/tokenizer.model

# Dataset
dataset:
_component_: torchtune.datasets.alpaca_dataset
train_on_input: True
seed: null
shuffle: True

# Model Arguments
model:
_component_: torchtune.models.mistral.lora_mistral_7b
lora_attn_modules: ['q_proj', 'k_proj', 'v_proj']
apply_lora_to_mlp: True
apply_lora_to_output: True
lora_rank: 64
lora_alpha: 16

checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /tmp/Mistral-7B-v0.1
checkpoint_files: [
pytorch_model-00001-of-00002.bin,
pytorch_model-00002-of-00002.bin
]
recipe_checkpoint: null
output_dir: /tmp/Mistral-7B-v0.1
model_type: MISTRAL
resume_from_checkpoint: False

optimizer:
_component_: torch.optim.AdamW
lr: 2e-5

lr_scheduler:
_component_: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 100

loss:
_component_: torch.nn.CrossEntropyLoss

# Fine-tuning arguments
batch_size: 4
epochs: 3
max_steps_per_epoch: null
gradient_accumulation_steps: 1

# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
output_dir: /tmp/Mistral-7B-v0.1
log_every_n_steps: null

# Show case the usage of pytorch profiler
# Set enabled to False as it's only needed for debugging training
profiler:
_component_: torchtune.utils.profiler
enabled: False
output_dir: /tmp/alpaca-llama2-finetune/torchtune_perf_tracing.json
Loading
Loading