Skip to content

Commit

Permalink
feat: update launch training with accelerate for multi-gpu (#98)
Browse files Browse the repository at this point in the history
* add accelerate launch script

Signed-off-by: Anh-Uong <[email protected]>

* give ownership of fms-hf-tuning repo to tuning user

Signed-off-by: Anh-Uong <[email protected]>

* fix: training script param

Signed-off-by: Anh-Uong <[email protected]>

* format script, add logging, add fsdp defaults file

Signed-off-by: Anh-Uong <[email protected]>

* set default accelerate config and set num_processes if multi-gpu

Signed-off-by: Anh-Uong <[email protected]>

* refactor copy and chmod

Signed-off-by: Anh-Uong <[email protected]>

* run accelerate script by default, run fmt

Signed-off-by: Anh-Uong <[email protected]>

* allow for multiGPU to be empty, run lint

Signed-off-by: Anh-Uong <[email protected]>

* explicitly set single GPU

Signed-off-by: Anh-Uong <[email protected]>

* docs: build and run image with configs

Signed-off-by: Anh-Uong <[email protected]>

* fix building dockerfile

Signed-off-by: Anh-Uong <[email protected]>

* fixes based on review comments

- determine list of store action params
- only override num_processes if no config_file
- update json key multiGPU to accelerate_launch_args
- update docs

Signed-off-by: Anh-Uong <[email protected]>

* multiGPU interpreted outside of accelerate params

Signed-off-by: Anh-Uong <[email protected]>

* Add support for parsing more accelerate launch params (#1)

* Add support for parsing more accelerate launch params

Signed-off-by: Thara Palanivel <[email protected]>

* Formatting

Signed-off-by: Thara Palanivel <[email protected]>

* Addressing review comments

Signed-off-by: Thara Palanivel <[email protected]>

---------

Signed-off-by: Thara Palanivel <[email protected]>
Signed-off-by: Anh-Uong <[email protected]>

* fix logic for addt param parsing, docs

Signed-off-by: Anh-Uong <[email protected]>

* Use config only if multi_gpu (#2)

* Use fsdp config only if multi_gpu

Signed-off-by: Thara Palanivel <[email protected]>

* Simplifying multi-gpu logic

Signed-off-by: Thara Palanivel <[email protected]>

* Fixing typo

Signed-off-by: Thara Palanivel <[email protected]>

* Address review comments

Signed-off-by: Thara Palanivel <[email protected]>

* Fix typo

Signed-off-by: Thara Palanivel <[email protected]>

---------

Signed-off-by: Thara Palanivel <[email protected]>
Signed-off-by: Anh-Uong <[email protected]>

* doc and comment updates from feedback

Signed-off-by: Anh-Uong <[email protected]>

---------

Signed-off-by: Anh-Uong <[email protected]>
Signed-off-by: Thara Palanivel <[email protected]>
Co-authored-by: tharapalanivel <[email protected]>
  • Loading branch information
anhuong and tharapalanivel authored Apr 2, 2024
1 parent 79b0fd3 commit 2df20ba
Show file tree
Hide file tree
Showing 4 changed files with 290 additions and 7 deletions.
13 changes: 8 additions & 5 deletions build/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,11 @@ RUN git clone https://github.com/foundation-model-stack/fms-hf-tuning.git && \
RUN mkdir -p /licenses
COPY LICENSE /licenses/

COPY launch_training.py /app
RUN chmod +x /app/launch_training.py
# Copy scripts and default configs
COPY build/launch_training.py build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/
RUN chmod +x /app/launch_training.py /app/accelerate_launch.py

ENV FSDP_DEFAULTS_FILE_PATH="/app/accelerate_fsdp_defaults.yaml"

# Need a better way to address this hack
RUN touch /.aim_profile && \
Expand All @@ -120,10 +123,10 @@ RUN touch /.aim_profile && \

# create tuning user and give ownership to dirs
RUN useradd -u $USER_UID tuning -m -g 0 --system && \
chown -R $USER:0 /app && \
chmod -R g+rwX /app
chown -R $USER:0 /app /tmp && \
chmod -R g+rwX /app /tmp

WORKDIR /app
USER ${USER}

CMD [ "tail", "-f", "/dev/null" ]
CMD [ "python", "/app/accelerate_launch.py" ]
165 changes: 165 additions & 0 deletions build/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Building fms-hf-tuning as an Image

The Dockerfile provides a way of running fms-hf-tuning SFT Trainer. It installs the dependencies needed and adds two additional scripts that helps to parse arguments to pass to SFT Trainer. The `accelerate_launch.py` script is run by default when running the image to trigger SFT trainer for single or multi GPU by parsing arguments and running `accelerate launch launch_training.py`.

## Configuration

The scripts accept a JSON formatted config which are set by environment variables. `SFT_TRAINER_CONFIG_JSON_PATH` can be set to the mounted path of the JSON config. Alternatively, `SFT_TRAINER_CONFIG_JSON_ENV_VAR` can be set to the encoded JSON config using the below function:

```py
import base64

def encode_json(my_json_string):
base64_bytes = base64.b64encode(my_json_string.encode("ascii"))
txt = base64_bytes.decode("ascii")
return txt

with open("test_config.json") as f:
contents = f.read()

encode_json(contents)
```

The keys for the JSON config are all of the flags available to use with [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer#trl.SFTTrainer).

For configuring `accelerate launch`, use key `accelerate_launch_args` and pass the set of flags accepted by [accelerate launch](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch). Since these flags are passed via the JSON config, the key matches the long formed flag name. For example, to enable flag `--quiet`, use JSON key `"quiet"`, using the short formed `"q"` will fail.

For example, the below config is used for running with two GPUs and FSDP for fine tuning:

```json
{
"accelerate_launch_args": {
"num_machines": 1,
"main_process_port": 1234,
"num_processes": 2,
"use_fsdp": true,
"fsdp_backward_prefetch_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_sharding_strategy": 1,
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_cpu_ram_efficient_loading": true,
"fsdp_sync_module_states": true
},
"model_name_or_path": "/llama/13B",
"training_data_path": "/data/twitter_complaints.json",
"output_dir": "/output/llama-7b-pt-multigpu",
"num_train_epochs": 5.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"save_strategy": "epoch",
"learning_rate": 0.03,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": true,
"torch_dtype": "bfloat16",
"tokenizer_name_or_path": "/llama/13B"
}
```

Users should always set `num_processes` to be explicit about the number of processes to run tuning on. When `num_processes` is greater than 1, the [FSDP config](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/fixtures/accelerate_fsdp_defaults.yaml) is used by default. You can also set your own default values by specifying your own config file using key `config_file`. Any of these values in configs can be overwritten by passing in flags via `accelerate_launch_args` in the JSON config.

Note that `num_processes` which is the total number of processes to be launched in parallel, should match the number of GPUs to run on. The number of GPUs used can also be set by setting environment variable `CUDA_VISIBLE_DEVICES`. If ``num_processes=1`, the script will assume single-GPU.


## Building the Image

With docker, build the image at the top level with:

```sh
docker build . -t sft-trainer:mytag -f build/Dockerfile
```

## Running the Image

Run sft-trainer-image with the JSON env var and mounts set up.

```sh
docker run -v config.json:/app/config.json -v $MODEL_PATH:/model -v $TRAINING_DATA_PATH:/data/twitter_complaints.json --env SFT_TRAINER_CONFIG_JSON_PATH=/app/config.json sft-trainer:mytag
```

This will run `accelerate_launch.py` with the JSON config passed.

An example Kubernetes Pod for deploying sft-trainer which requires creating PVCs with the model and input dataset and any mounts needed for the outputted tuned model:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: sft-trainer-config
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"main_process_port": 1234,
"num_processes": 2,
"use_fsdp": true,
"fsdp_backward_prefetch_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_sharding_strategy": 1,
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_cpu_ram_efficient_loading": true,
"fsdp_sync_module_states": true
},
"model_name_or_path": "/llama/13B",
"training_data_path": "/data/twitter_complaints.json",
"output_dir": "/output/llama-7b-pt-multigpu",
"num_train_epochs": 5.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"save_strategy": "epoch",
"learning_rate": 0.03,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": true,
"torch_dtype": "bfloat16",
"tokenizer_name_or_path": "/llama/13B"
}
---
apiVersion: v1
kind: Pod
metadata:
name: sft-trainer-test
spec:
containers:
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /config/config.json
image: sft-trainer:mytag
imagePullPolicy: IfNotPresent
name: tuning-test
resources:
limits:
nvidia.com/gpu: "2"
requests:
nvidia.com/gpu: "2"
volumeMounts:
- mountPath: /data/input
name: input-data
- mountPath: /data/output
name: output-data
- mountPath: /config
name: sft-trainer-config
restartPolicy: Never
terminationGracePeriodSeconds: 30
volumes:
- name: input-data
persistentVolumeClaim:
claimName: input-pvc
- name: output-data
persistentVolumeClaim:
claimName: output-pvc
- name: sft-trainer-config
configMap:
name: sft-trainer-config
```
114 changes: 114 additions & 0 deletions build/accelerate_launch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Copyright The FMS HF Tuning Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Script wraps launch_training to run with accelerate for multi and single GPU cases.
Read accelerate_launch_args configuration via environment variable `SFT_TRAINER_CONFIG_JSON_PATH`
for the path to the JSON config file with parameters or `SFT_TRAINER_CONFIG_JSON_ENV_VAR`
for the encoded config string to parse.
"""

# Standard
import json
import os
import base64
import pickle
import logging

# Third Party
from accelerate.commands.launch import launch_command_parser, launch_command


def txt_to_obj(txt):
base64_bytes = txt.encode("ascii")
message_bytes = base64.b64decode(base64_bytes)
try:
# If the bytes represent JSON string
return json.loads(message_bytes)
except UnicodeDecodeError:
# Otherwise the bytes are a pickled python dictionary
return pickle.loads(message_bytes)


def main():
LOGLEVEL = os.environ.get("LOG_LEVEL", "WARNING").upper()
logging.basicConfig(level=LOGLEVEL)

json_configs = {}
json_path = os.getenv("SFT_TRAINER_CONFIG_JSON_PATH")
json_env_var = os.getenv("SFT_TRAINER_CONFIG_JSON_ENV_VAR")

if json_path:
with open(json_path, "r", encoding="utf-8") as f:
json_configs = json.load(f)

elif json_env_var:
json_configs = txt_to_obj(json_env_var)

parser = launch_command_parser()
# Map to determine which flags don't require a value to be set
actions_type_map = {
action.dest: type(action).__name__ for action in parser._actions
}

# Parse accelerate_launch_args
accelerate_launch_args = []
accelerate_config = json_configs.get("accelerate_launch_args", {})
if accelerate_config:
logging.info("Using accelerate_launch_args configs: %s", accelerate_config)
for key, val in accelerate_config.items():
if actions_type_map.get(key) == "_AppendAction":
for param_val in val:
accelerate_launch_args.extend([f"--{key}", str(param_val)])
elif (actions_type_map.get(key) == "_StoreTrueAction" and val) or (
actions_type_map.get(key) == "_StoreFalseAction" and not val
):
accelerate_launch_args.append(f"--{key}")
else:
accelerate_launch_args.append(f"--{key}")
# Only need to add key for params that aren't flags ie. --quiet
if actions_type_map.get(key) == "_StoreAction":
accelerate_launch_args.append(str(val))

num_processes = accelerate_config.get("num_processes")
if num_processes:
# if multi GPU setting and accelerate config_file not passed by user,
# use the default config for default set of parameters
if num_processes > 1 and not accelerate_config.get("config_file"):
# Add default FSDP config
fsdp_filepath = os.getenv(
"FSDP_DEFAULTS_FILE_PATH", "/app/accelerate_fsdp_defaults.yaml"
)
if os.path.exists(fsdp_filepath):
logging.info("Using accelerate config file: %s", fsdp_filepath)
accelerate_launch_args.extend(["--config_file", fsdp_filepath])

elif num_processes == 1:
logging.info("num_processes=1 so setting env var CUDA_VISIBLE_DEVICES=0")
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
else:
logging.warning(
"num_processes param was not passed in. Value from config file (if available) will \
be used or accelerate launch will determine number of processes automatically"
)

# Add training_script
accelerate_launch_args.append("/app/launch_training.py")

logging.debug("accelerate_launch_args: %s", accelerate_launch_args)
args = parser.parse_args(args=accelerate_launch_args)
logging.debug("accelerate launch parsed args: %s", args)
launch_command(args)


if __name__ == "__main__":
main()
5 changes: 3 additions & 2 deletions build/launch_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ def main():
LOGLEVEL = os.environ.get("LOG_LEVEL", "WARNING").upper()
logging.basicConfig(level=LOGLEVEL)

logging.info("Attempting to launch training script")
logging.info("Initializing launch training script")

parser = transformers.HfArgumentParser(
dataclass_types=(
configs.ModelArguments,
Expand Down Expand Up @@ -122,7 +123,7 @@ def main():
elif peft_method_parsed == "pt":
tune_config = prompt_tuning_config

logging.debug(
logging.info(
"Parameters used to launch training: \
model_args %s, data_args %s, training_args %s, tune_config %s",
model_args,
Expand Down

0 comments on commit 2df20ba

Please sign in to comment.