Skip to content

Commit

Permalink
docs: add note on ephemeral storage (foundation-model-stack#106)
Browse files Browse the repository at this point in the history
* docs: add note on ephemeral storage

Signed-off-by: Anh-Uong <[email protected]>

* review suggestion, reword docs

Co-authored-by: Sukriti Sharma <[email protected]>
Signed-off-by: Anh Uong <[email protected]>

---------

Signed-off-by: Anh-Uong <[email protected]>
Signed-off-by: Anh Uong <[email protected]>
Co-authored-by: Sukriti Sharma <[email protected]>
  • Loading branch information
2 people authored and jbusche committed Apr 9, 2024
1 parent abc10af commit 0c6bf43
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions build/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ For example, the below config is used for running with two GPUs and FSDP for fin
}
```

Users should always set `num_processes` to be explicit about the number of processes to run tuning on. When `num_processes` is greater than 1, the [FSDP config](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/fixtures/accelerate_fsdp_defaults.yaml) is used by default. You can also set your own default values by specifying your own config file using key `config_file`. Any of these values in configs can be overwritten by passing in flags via `accelerate_launch_args` in the JSON config.
Users should always set `num_processes` to be explicit about the number of processes to run tuning on. When `num_processes` is greater than 1, the [FSDP config](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/fixtures/accelerate_fsdp_defaults.yaml) is used by default. Thus in the above example, you don't need to pass in the FSDP flags since they match the ones used in the default FSDP config. You can also set your own default values by specifying your own config file using key `config_file`. Any of these values in configs can be overwritten by passing in flags via `accelerate_launch_args` in the JSON config.

Note that `num_processes` which is the total number of processes to be launched in parallel, should match the number of GPUs to run on. The number of GPUs used can also be set by setting environment variable `CUDA_VISIBLE_DEVICES`. If ``num_processes=1`, the script will assume single-GPU.

Expand Down Expand Up @@ -141,8 +141,12 @@ containers:
resources:
limits:
nvidia.com/gpu: "2"
memory: 200Gi
cpu: "10"
ephemeral-storage: 2Ti
requests:
nvidia.com/gpu: "2"
memory: 80Gi
cpu: "5"
volumeMounts:
- mountPath: /data/input
name: input-data
Expand All @@ -163,3 +167,5 @@ volumes:
configMap:
name: sft-trainer-config
```
The above kube resource values are not hard-defined. However, they are useful when running some models (such as LLaMa-13b model). If ephemeral storage is not defined, you will likely hit into error `The node was low on resource: ephemeral-storage. Container was using 1498072868Ki, which exceeds its request of 0.` where the pod runs low on storage while tuning the model.

0 comments on commit 0c6bf43

Please sign in to comment.