Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move to accelerate launch for distributed training #92

Merged
merged 13 commits into from
Mar 18, 2024
43 changes: 34 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ Current supported and tested models are `Llama2` (7 and 13B configurations have
# if you want to use one GPU on multi-gpu machine
export CUDA_VISIBLE_DEVICES=0

MODEL_PATH=llama-7b-hf # Huggingface model id or path to a checkpoint
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
DATA_PATH=twitter_complaints.json # Path to the dataset
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved

Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -83,15 +87,38 @@ python tuning/sft_trainer.py \
```

### Multiple GPUs with FSDP

Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
The recommendation is to use [huggingface accelerate](https://huggingface.co/docs/accelerate/en/index) to launch multi-gpu jobs, in particular when using FSDP:
- `accelerate` is written on top of [`torch.distributed.run`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py).
- `accelerate launch` CLI highly similar to `torchrun`, spawns multiple jobs (one for each gpu).
- tightly integrated with [huggingface Trainer](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).

`accelerate launch` CLI to be run with specific command line arguments, see example below. Default arguments handled by passing in a
`--config_file` argument; see [reference docs](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) and [fixtures/accelerate_fsdp_defaults.yaml](./fixtures/accelerate_fsdp_defaults.yaml) for sample defaults.

```bash
torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--master_port=1234 \
MODEL_PATH=llama-7b-hf # Huggingface model id or path to a checkpoint
DATA_PATH=twitter_complaints.json # Path to the dataset
OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

# MASTER_PORT and MASTER_ADDR are essential for multi node training and
# not needed for multi gpu in single node
MASTER_PORT=1234 # The port at which the process with rank 0 listens to
MASTER_ADDR=x.x.x.x # The IP addresss of the node with rank 0
kmehant marked this conversation as resolved.
Show resolved Hide resolved


```bash
accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
--num_machines=1 \
kmehant marked this conversation as resolved.
Show resolved Hide resolved
--num_processes=8 \
--main_process_port=1234 \
kmehant marked this conversation as resolved.
Show resolved Hide resolved
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--bf16 True \
--data_path $DATA_PATH \
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
--torch_dtype bfloat16 \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
Expand All @@ -104,16 +131,14 @@ tuning/sft_trainer.py \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_config tuning/config/fsdp_config.json \
--include_tokens_per_second \
--packing False \
--response_template "\n### Response:" \
--dataset_text_field "output"
```


For `GPTBigCode` models, Hugging Face has enabled Flash v2 and one can simply replace the `'LlamaDecoderLayer'` with `'GPTBigCodeBlock'` in `tuning/config/fsdp_config.json` for proper sharding of the model.
Typically the transformer module is passed to form FSDP unit. For `GPTBigCode` models, Hugging Face has enabled Flash v2 and one can simply replace the `'LlamaDecoderLayer'` with `'GPTBigCodeBlock'` in `config/accelerate_fsdp_llama_2_procs.yaml` for proper sharding of the model.
kmehant marked this conversation as resolved.
Show resolved Hide resolved

### LoRA Tuning Example

Expand Down
16 changes: 10 additions & 6 deletions examples/prompt_tuning_twitter_complaints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,21 @@ MODEL_PATH=llama-7b-hf
DATA_PATH=twitter_complaints.json
OUTPUT_PATH=out

torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--master_port=1234 \
# MASTER_PORT and MASTER_ADDR are essential for multi node training and
# not needed for multi gpu in single node
MASTER_PORT=1234 # The port at which the process with rank 0 listens to
MASTER_ADDR=x.x.x.x # The IP addresss of the node with rank 0

accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
kmehant marked this conversation as resolved.
Show resolved Hide resolved
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
Ssukriti marked this conversation as resolved.
Show resolved Hide resolved
--output_dir $OUTPUT_PATH \
--peft_method pt \
--torch_dtype bfloat16 \
--tokenizer_name_or_path $MODEL_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
Expand All @@ -56,8 +62,6 @@ tuning/sft_trainer.py \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_config tuning/config/fsdp_config.json \
--include_tokens_per_second \
--packing False \
--response_template "\n### Label:" \
Expand Down
60 changes: 60 additions & 0 deletions fixtures/accelerate_fsdp_defaults.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# options that can be used with accelerate config are neatly documented here -
# https://github.com/huggingface/accelerate/blob/ee163b66fb7848892519e804688cb4ae981aacbe/docs/source/package_reference/cli.md

# type of compute environment, no need to change
compute_environment: LOCAL_MACHINE # AMAZON_SAGEMAKER

# use FSDP distributed compute
distributed_type: FSDP

# FSDP specific configurations
fsdp_config:

# use this for training transformers
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

# this controls the FSDP pipelining
fsdp_backward_prefetch_policy: BACKWARD_PRE # set to BACKWARD_PRE for the most time-efficient pipeline
# but requires the most memory. BACKWARD_POST is the less
# memory intensive option

# setting this to true will increase forward memory by prefetching the next FSDP all-gather, while performing
# the current forward pass.
fsdp_forward_prefetch: false

# setting this will offload model and optimizer parameters to the CPU, to save GPU memory at a significant
# increase of CPU time.
fsdp_offload_params: false

fsdp_sharding_strategy: 1 # set to FULL_SHARD (1), SHARD_GRAD_OP (2),
# 3 is NO_SHARD, effectively disabling FSDP
# 4, 5 are HYBRID_ modes for multi-node training only.

fsdp_state_dict_type: FULL_STATE_DICT # set to FULL_STATE_DICT (1), SHARDED_STATE_DICT (3)
# 2 is LOCAL_STATE_DICT where parameters are still flattened
# 3 is efficient, but requires know-how to use the shared checkpoint.

fsdp_cpu_ram_efficient_loading: true # for large models set to true, model loaded on single process
fsdp_sync_module_states: true # for large models set to true, model loaded on single process

# not needed for HF models that have . _no_split_modules
# the example below is for GPTBigCode
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”

# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options
# would be fp16, bf16, etc.
mixed_precision: 'no'

machine_rank: 0 # rank of the machine where accelerate is launched
num_machines: 1
num_processes: 1 # default, override with --num_processes

# the rendezvous method to use in distributed training. Other option is c10d
rdzv_backend: static
same_network: true

# below arguments are required when training in multi-node setup
# for multi-gpu single node, the below values default to
# main_process_ip: 127.0.0.1 # override with --main_process_ip
# main_process_port: 29500 # override with --main_process_port
12 changes: 0 additions & 12 deletions tuning/config/fsdp_config.json

This file was deleted.

Loading