Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move to accelerate launch for distributed training #92

Merged
merged 13 commits into from
Mar 18, 2024
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,13 @@ Current supported and tested models are `Llama2` (7 and 13B configurations have
# if you want to use one GPU on multi-gpu machine
export CUDA_VISIBLE_DEVICES=0

# MODEL_PATH=meta-llama/Llama-2-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
Expand All @@ -83,15 +87,33 @@ python tuning/sft_trainer.py \
```

### Multiple GPUs with FSDP

Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
The recommendation is to use [huggingface accelerate](https://huggingface.co/docs/accelerate/en/index) to launch multi-gpu jobs, in particular when using FSDP:
- `accelerate` is written on top of [`torch.distributed.run`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py).
- `accelerate launch` CLI highly similar to `torchrun`, spawns multiple jobs (one for each gpu).
- tightly integrated with [huggingface Trainer](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).

`accelerate launch` CLI to be run with specific command line arguments, see example below. Default arguments handled by passing in a
`--config_file` argument; see [reference docs](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) and [fixtures/accelerate_fsdp_defaults.yaml](./fixtures/accelerate_fsdp_defaults.yaml) for sample defaults.

```bash
torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--master_port=1234 \
# Please set the environment variables:
# MASTER_PORT=1234 # The port at which the process with rank 0 listens to and should be set to an unused port
# MODEL_PATH=meta-llama/Llama-2-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the training dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved


```bash
accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
--num_processes=8 \
--main_process_port=$MASTER_PORT \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--bf16 True \
--training_data_path $TRAIN_DATA_PATH \
--torch_dtype bfloat16 \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
Expand All @@ -104,16 +126,13 @@ tuning/sft_trainer.py \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_config tuning/config/fsdp_config.json \
--include_tokens_per_second \
--packing False \
--response_template "\n### Response:" \
--dataset_text_field "output"
```


For `GPTBigCode` models, Hugging Face has enabled Flash v2 and one can simply replace the `'LlamaDecoderLayer'` with `'GPTBigCodeBlock'` in `tuning/config/fsdp_config.json` for proper sharding of the model.

### LoRA Tuning Example

Expand Down
22 changes: 11 additions & 11 deletions examples/prompt_tuning_twitter_complaints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,22 @@ dataset.to_json("twitter_complaints.json")
### Prompt Tuning
We will switch our PEFT method from LORA to Prompt Tuning (pt)
```bash
# replace these with your values
MODEL_PATH=llama-7b-hf
DATA_PATH=twitter_complaints.json
OUTPUT_PATH=out
# Please set the environment variables:
# MASTER_PORT=1234 # The port at which the process with rank 0 listens to and should be set to an unused port
# MODEL_PATH=meta-llama/Llama-2-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the training dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--master_port=1234 \

accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--peft_method pt \
--torch_dtype bfloat16 \
--tokenizer_name_or_path $MODEL_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
Expand All @@ -56,8 +58,6 @@ tuning/sft_trainer.py \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_config tuning/config/fsdp_config.json \
--include_tokens_per_second \
--packing False \
--response_template "\n### Label:" \
Expand Down
60 changes: 60 additions & 0 deletions fixtures/accelerate_fsdp_defaults.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# options that can be used with accelerate config are neatly documented here -
# https://github.com/huggingface/accelerate/blob/ee163b66fb7848892519e804688cb4ae981aacbe/docs/source/package_reference/cli.md

# type of compute environment, no need to change
compute_environment: LOCAL_MACHINE # AMAZON_SAGEMAKER

# use FSDP distributed compute
distributed_type: FSDP

# FSDP specific configurations
fsdp_config:

# use this for training transformers
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

# this controls the FSDP pipelining
fsdp_backward_prefetch_policy: BACKWARD_PRE # set to BACKWARD_PRE for the most time-efficient pipeline
# but requires the most memory. BACKWARD_POST is the less
# memory intensive option

# setting this to true will increase forward memory by prefetching the next FSDP all-gather, while performing
# the current forward pass.
fsdp_forward_prefetch: false

# setting this will offload model and optimizer parameters to the CPU, to save GPU memory at a significant
# increase of CPU time.
fsdp_offload_params: false

fsdp_sharding_strategy: 1 # set to FULL_SHARD (1), SHARD_GRAD_OP (2),
# 3 is NO_SHARD, effectively disabling FSDP
# 4, 5 are HYBRID_ modes for multi-node training only.

fsdp_state_dict_type: FULL_STATE_DICT # set to FULL_STATE_DICT (1), SHARDED_STATE_DICT (3)
# 2 is LOCAL_STATE_DICT where parameters are still flattened
# 3 is efficient, but requires know-how to use the shared checkpoint.

fsdp_cpu_ram_efficient_loading: true # for large models set to true, model loaded on single process
fsdp_sync_module_states: true # for large models set to true, model loaded on single process

# not needed for HF models that have . _no_split_modules
# the example below is for GPTBigCode
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”

# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options
# would be fp16, bf16, etc.
mixed_precision: 'no'

machine_rank: 0 # rank of the machine where accelerate is launched
num_machines: 1
num_processes: 1 # default, override with --num_processes

# the rendezvous method to use in distributed training. Other option is c10d
rdzv_backend: static
same_network: true

# below arguments are required when training in multi-node setup
# for multi-gpu single node, the below values default to
# main_process_ip: 127.0.0.1 # override with --main_process_ip
# main_process_port: 29500 # override with --main_process_port
12 changes: 0 additions & 12 deletions tuning/config/fsdp_config.json

This file was deleted.

Loading