Skip to content

Commit

Permalink
Merge branch 'main' into default_params
Browse files Browse the repository at this point in the history
Signed-off-by: Thara Palanivel <[email protected]>
  • Loading branch information
tharapalanivel committed Apr 2, 2024
2 parents 4cf6e5a + 2df20ba commit 0f806b9
Show file tree
Hide file tree
Showing 3 changed files with 287 additions and 5 deletions.
13 changes: 8 additions & 5 deletions build/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,11 @@ RUN git clone https://github.com/foundation-model-stack/fms-hf-tuning.git && \
RUN mkdir -p /licenses
COPY LICENSE /licenses/

COPY launch_training.py /app
RUN chmod +x /app/launch_training.py
# Copy scripts and default configs
COPY build/launch_training.py build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/
RUN chmod +x /app/launch_training.py /app/accelerate_launch.py

ENV FSDP_DEFAULTS_FILE_PATH="/app/accelerate_fsdp_defaults.yaml"

# Need a better way to address this hack
RUN touch /.aim_profile && \
Expand All @@ -120,10 +123,10 @@ RUN touch /.aim_profile && \

# create tuning user and give ownership to dirs
RUN useradd -u $USER_UID tuning -m -g 0 --system && \
chown -R $USER:0 /app && \
chmod -R g+rwX /app
chown -R $USER:0 /app /tmp && \
chmod -R g+rwX /app /tmp

WORKDIR /app
USER ${USER}

CMD [ "tail", "-f", "/dev/null" ]
CMD [ "python", "/app/accelerate_launch.py" ]
165 changes: 165 additions & 0 deletions build/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Building fms-hf-tuning as an Image

The Dockerfile provides a way of running fms-hf-tuning SFT Trainer. It installs the dependencies needed and adds two additional scripts that helps to parse arguments to pass to SFT Trainer. The `accelerate_launch.py` script is run by default when running the image to trigger SFT trainer for single or multi GPU by parsing arguments and running `accelerate launch launch_training.py`.

## Configuration

The scripts accept a JSON formatted config which are set by environment variables. `SFT_TRAINER_CONFIG_JSON_PATH` can be set to the mounted path of the JSON config. Alternatively, `SFT_TRAINER_CONFIG_JSON_ENV_VAR` can be set to the encoded JSON config using the below function:

```py
import base64

def encode_json(my_json_string):
base64_bytes = base64.b64encode(my_json_string.encode("ascii"))
txt = base64_bytes.decode("ascii")
return txt

with open("test_config.json") as f:
contents = f.read()

encode_json(contents)
```

The keys for the JSON config are all of the flags available to use with [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer#trl.SFTTrainer).

For configuring `accelerate launch`, use key `accelerate_launch_args` and pass the set of flags accepted by [accelerate launch](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch). Since these flags are passed via the JSON config, the key matches the long formed flag name. For example, to enable flag `--quiet`, use JSON key `"quiet"`, using the short formed `"q"` will fail.

For example, the below config is used for running with two GPUs and FSDP for fine tuning:

```json
{
"accelerate_launch_args": {
"num_machines": 1,
"main_process_port": 1234,
"num_processes": 2,
"use_fsdp": true,
"fsdp_backward_prefetch_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_sharding_strategy": 1,
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_cpu_ram_efficient_loading": true,
"fsdp_sync_module_states": true
},
"model_name_or_path": "/llama/13B",
"training_data_path": "/data/twitter_complaints.json",
"output_dir": "/output/llama-7b-pt-multigpu",
"num_train_epochs": 5.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"save_strategy": "epoch",
"learning_rate": 0.03,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": true,
"torch_dtype": "bfloat16",
"tokenizer_name_or_path": "/llama/13B"
}
```

Users should always set `num_processes` to be explicit about the number of processes to run tuning on. When `num_processes` is greater than 1, the [FSDP config](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/fixtures/accelerate_fsdp_defaults.yaml) is used by default. You can also set your own default values by specifying your own config file using key `config_file`. Any of these values in configs can be overwritten by passing in flags via `accelerate_launch_args` in the JSON config.

Note that `num_processes` which is the total number of processes to be launched in parallel, should match the number of GPUs to run on. The number of GPUs used can also be set by setting environment variable `CUDA_VISIBLE_DEVICES`. If ``num_processes=1`, the script will assume single-GPU.


## Building the Image

With docker, build the image at the top level with:

```sh
docker build . -t sft-trainer:mytag -f build/Dockerfile
```

## Running the Image

Run sft-trainer-image with the JSON env var and mounts set up.

```sh
docker run -v config.json:/app/config.json -v $MODEL_PATH:/model -v $TRAINING_DATA_PATH:/data/twitter_complaints.json --env SFT_TRAINER_CONFIG_JSON_PATH=/app/config.json sft-trainer:mytag
```

This will run `accelerate_launch.py` with the JSON config passed.

An example Kubernetes Pod for deploying sft-trainer which requires creating PVCs with the model and input dataset and any mounts needed for the outputted tuned model:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: sft-trainer-config
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"main_process_port": 1234,
"num_processes": 2,
"use_fsdp": true,
"fsdp_backward_prefetch_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_sharding_strategy": 1,
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_cpu_ram_efficient_loading": true,
"fsdp_sync_module_states": true
},
"model_name_or_path": "/llama/13B",
"training_data_path": "/data/twitter_complaints.json",
"output_dir": "/output/llama-7b-pt-multigpu",
"num_train_epochs": 5.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"save_strategy": "epoch",
"learning_rate": 0.03,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": true,
"torch_dtype": "bfloat16",
"tokenizer_name_or_path": "/llama/13B"
}
---
apiVersion: v1
kind: Pod
metadata:
name: sft-trainer-test
spec:
containers:
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /config/config.json
image: sft-trainer:mytag
imagePullPolicy: IfNotPresent
name: tuning-test
resources:
limits:
nvidia.com/gpu: "2"
requests:
nvidia.com/gpu: "2"
volumeMounts:
- mountPath: /data/input
name: input-data
- mountPath: /data/output
name: output-data
- mountPath: /config
name: sft-trainer-config
restartPolicy: Never
terminationGracePeriodSeconds: 30
volumes:
- name: input-data
persistentVolumeClaim:
claimName: input-pvc
- name: output-data
persistentVolumeClaim:
claimName: output-pvc
- name: sft-trainer-config
configMap:
name: sft-trainer-config
```
114 changes: 114 additions & 0 deletions build/accelerate_launch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Copyright The FMS HF Tuning Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Script wraps launch_training to run with accelerate for multi and single GPU cases.
Read accelerate_launch_args configuration via environment variable `SFT_TRAINER_CONFIG_JSON_PATH`
for the path to the JSON config file with parameters or `SFT_TRAINER_CONFIG_JSON_ENV_VAR`
for the encoded config string to parse.
"""

# Standard
import json
import os
import base64
import pickle
import logging

# Third Party
from accelerate.commands.launch import launch_command_parser, launch_command


def txt_to_obj(txt):
base64_bytes = txt.encode("ascii")
message_bytes = base64.b64decode(base64_bytes)
try:
# If the bytes represent JSON string
return json.loads(message_bytes)
except UnicodeDecodeError:
# Otherwise the bytes are a pickled python dictionary
return pickle.loads(message_bytes)


def main():
LOGLEVEL = os.environ.get("LOG_LEVEL", "WARNING").upper()
logging.basicConfig(level=LOGLEVEL)

json_configs = {}
json_path = os.getenv("SFT_TRAINER_CONFIG_JSON_PATH")
json_env_var = os.getenv("SFT_TRAINER_CONFIG_JSON_ENV_VAR")

if json_path:
with open(json_path, "r", encoding="utf-8") as f:
json_configs = json.load(f)

elif json_env_var:
json_configs = txt_to_obj(json_env_var)

parser = launch_command_parser()
# Map to determine which flags don't require a value to be set
actions_type_map = {
action.dest: type(action).__name__ for action in parser._actions
}

# Parse accelerate_launch_args
accelerate_launch_args = []
accelerate_config = json_configs.get("accelerate_launch_args", {})
if accelerate_config:
logging.info("Using accelerate_launch_args configs: %s", accelerate_config)
for key, val in accelerate_config.items():
if actions_type_map.get(key) == "_AppendAction":
for param_val in val:
accelerate_launch_args.extend([f"--{key}", str(param_val)])
elif (actions_type_map.get(key) == "_StoreTrueAction" and val) or (
actions_type_map.get(key) == "_StoreFalseAction" and not val
):
accelerate_launch_args.append(f"--{key}")
else:
accelerate_launch_args.append(f"--{key}")
# Only need to add key for params that aren't flags ie. --quiet
if actions_type_map.get(key) == "_StoreAction":
accelerate_launch_args.append(str(val))

num_processes = accelerate_config.get("num_processes")
if num_processes:
# if multi GPU setting and accelerate config_file not passed by user,
# use the default config for default set of parameters
if num_processes > 1 and not accelerate_config.get("config_file"):
# Add default FSDP config
fsdp_filepath = os.getenv(
"FSDP_DEFAULTS_FILE_PATH", "/app/accelerate_fsdp_defaults.yaml"
)
if os.path.exists(fsdp_filepath):
logging.info("Using accelerate config file: %s", fsdp_filepath)
accelerate_launch_args.extend(["--config_file", fsdp_filepath])

elif num_processes == 1:
logging.info("num_processes=1 so setting env var CUDA_VISIBLE_DEVICES=0")
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
else:
logging.warning(
"num_processes param was not passed in. Value from config file (if available) will \
be used or accelerate launch will determine number of processes automatically"
)

# Add training_script
accelerate_launch_args.append("/app/launch_training.py")

logging.debug("accelerate_launch_args: %s", accelerate_launch_args)
args = parser.parse_args(args=accelerate_launch_args)
logging.debug("accelerate launch parsed args: %s", args)
launch_command(args)


if __name__ == "__main__":
main()

0 comments on commit 0f806b9

Please sign in to comment.