Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU switch from TorchRun to Accelerate #91

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,15 +83,26 @@ python tuning/sft_trainer.py \
```

### Multiple GPUs with FSDP

The recommendation is to use [huggingface accelerate](https://huggingface.co/docs/accelerate/en/index) to launch multi-gpu jobs, in particular when using FSDP:
- `accelerate` is written on top of [`torch.distributed.run`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py).
- `accelerate launch` CLI highly similar to `torchrun`, spawns multiple jobs (one for each gpu).
- tightly integrated with [huggingface Trainer](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).

`accelerate launch` CLI to be run with specific command line arguments, see example below. Default arguments handled by passing in a
`--config_file` argument; see [reference docs](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) and [fixtures/accelerate_fsdp_defaults.yaml](./fixtures/accelerate_fsdp_defaults.yaml) for sample defaults.


```bash
torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--master_port=1234 \
accelerate launch \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
--num_machines=1 \
--num_processes=8 \
--main_process_port=1234 \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--data_path $DATA_PATH \
--bf16 True \
--torch_dtype bfloat16 \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
Expand All @@ -104,17 +115,12 @@ tuning/sft_trainer.py \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_config tuning/config/fsdp_config.json \
--include_tokens_per_second \
--packing False \
--response_template "\n### Response:" \
--dataset_text_field "output"
```


For `GPTBigCode` models, Hugging Face has enabled Flash v2 and one can simply replace the `'LlamaDecoderLayer'` with `'GPTBigCodeBlock'` in `tuning/config/fsdp_config.json` for proper sharding of the model.

### LoRA Tuning Example

```bash
Expand Down
52 changes: 52 additions & 0 deletions fixtures/accelerate_fsdp_defaults.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# type of compute environment, no need to change
compute_environment: LOCAL_MACHINE # AMAZON_SAGEMAKER

# use FSDP distributed compute
distributed_type: FSDP

# FSDP specific configurations
fsdp_config:

# use this for training transformers
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

# this controls the FSDP pipelining
fsdp_backward_prefetch_policy: BACKWARD_PRE # set to BACKWARD_PRE for the most time-efficient pipeline
# but requires the most memory. BACKWARD_POST is the less
# memory intensive option

# setting this to true will increase forward memory by prefetching the next FSDP all-gather, while performing
# the current forward pass.
fsdp_forward_prefetch: false

# setting this will offload model and optimizer parameters to the CPU, to save GPU memory at a significant
# increase of CPU time.
fsdp_offload_params: false

fsdp_sharding_strategy: 1 # set to FULL_SHARD (1), SHARD_GRAD_OP (2),
# 3 is NO_SHARD, effectively disabling FSDP
# 4, 5 are HYBRID_ modes for multi-node training only.

fsdp_state_dict_type: FULL_STATE_DICT # set to FULL_STATE_DICT (1), SHARDED_STATE_DICT (3)
# 2 is LOCAL_STATE_DICT where parameters are still flattened
# 3 is efficient, but requires know-how to use the shared checkpoint.

fsdp_cpu_ram_efficient_loading: true # for large models set to true, model loaded on single process
fsdp_sync_module_states: true # for large models set to true, model loaded on single process

# not needed for HF models that have . _no_split_modules
# the example below is for GPTBigCode
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”

# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options
# would be fp16, bf16, etc.
mixed_precision: 'no'

machine_rank: 0 # rank of the machine where accelerate is launched
num_machines: 1
num_processes: 1 # default, override with --num_processes

# the rendezvous method to use in distributed training. Other option is c10d
rdzv_backend: static
same_network: true
12 changes: 0 additions & 12 deletions tuning/config/fsdp_config.json

This file was deleted.

Loading