foundation-model-stack · VassilisVassiliadis · Jun 10, 2024 · Jun 11, 2024 · Jun 11, 2024 · Jun 11, 2024
@@ -43,6 +43,8 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           python -m pip install tox
+          python -m pip install poetry --user
+          echo "PATH=$PATH:~/.local/bin" >> "$GITHUB_ENV"
       - name: Build and test with tox
         run: tox -e ${{ matrix.python-version.tox }}
       - name: Build and check wheel package

@@ -18,5 +18,7 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           python -m pip install tox
+          python -m pip install poetry --user
+          echo "PATH=$PATH:~/.local/bin" >> "$GITHUB_ENV"
       - name: Check Coverage
         run: tox -e coverage
@@ -33,6 +33,8 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           python -m pip install tox
+          python -m pip install poetry --user
+          echo "PATH=$PATH:~/.local/bin" >> "$GITHUB_ENV"
       - name: Check formatting
         run: tox -e fmt
       - name: Run pylint

@@ -23,5 +23,7 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           python -m pip install tox
+          python -m pip install poetry --user
+          echo "PATH=$PATH:~/.local/bin" >> "$GITHUB_ENV"
       - name: Run unit tests
         run: tox -e py
@@ -42,6 +42,8 @@ If additional new Python module dependencies are required, think about where to
 - If they're optional dependencies for additional functionality, then put them in the pyproject.toml file like were done for [flash-attn](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/pyproject.toml#L44) or [aim](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/pyproject.toml#L45).
 - If it's an additional dependency for development, then add it to the [dev](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/pyproject.toml#L43) dependencies.
 
+if using `poetry` to install dependencies, you can optionally leverage [poetry add](https://python-poetry.org/docs/cli/#add) to add new dependencies to pyproject.  
+
 #### Code Review
 
 Once you've [created a pull request](#how-can-i-contribute), maintainers will review your code and may make suggestions to fix before merging. It will be easier for your pull request to receive reviews if you consider the criteria the reviewers follow while working. Remember to:
@@ -82,12 +84,45 @@ The following tools are required:
 - [git](https://git-scm.com)
 - [python](https://www.python.org) (v3.8+)
 - [pip](https://pypi.org/project/pip/) (v23.0+)
+- [poetry](https://python-poetry.org/docs/#installation) (v1.8.3+)
+  - Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry. This ensures that Poetry’s own dependencies will not be accidentally upgraded or uninstalled.
+- [tox](https://tox.wiki/en/4.15.1/installation.html) (v4.15.1+)
+  - Just like `poetry` install `tox` in an isolated virtual environment
 
 Installation:
-``` 
-pip install -U datasets
-pip install -e .
+
+```bash 
+: Install poetry and tox in an isolated virtual environment
+python3 -m venv isolated
+./isolated/bin/pip install -U pip setuptools
+./isolated/bin/pip install poetry tox
+
+: Ensure you can access poetry and tox without activating the
+: the isolated virtual environment
+export PATH=$PATH:`pwd`/isolated/bin
+
+: Create your development virtual environment
+python3 -m venv venv
+. venv/bin/activate
+
+: Install a dev version (similar to pip -e ".[dev]") of fms-hf-tuning
+poetry install --extras dev
+```
+
+
+> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
+
+```
+poetry install --extras dev,flash-attn
+```
+
+If you wish to use [aim](https://github.com/aimhubio/aim), then you need to install it:
 ```
+poetry install --extras aim
+```
+
+If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration) follow the instructions in [this section of README.md](README.md#fms-acceleration).
+
-
+Alternatively, you could continue `pip install -e . `,  if you do not wish to leverage the lock file and have environmental contrainsts
-
+Alternatively, you could continue `pip install -e . `,  if you do not wish to leverage the lock file and have environmental contrainsts
 <details>
 <summary>Linting</summary>
 

@@ -7,6 +7,7 @@ This repo provides basic tuning scripts with support for specific models. The re
 
 ## Installation
 
+### 1. Install from wheel 
 ```
 pip install fms-hf-tuning
 ```
@@ -27,7 +28,13 @@ If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/
 ```
 pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
 ```
-`fms-acceleration` is a collection of plugins that packages that accelerate fine-tuning / training of large models, as part of the `fms-hf-tuning` suite. For more details on see [this section below](#fms-acceleration).
+`fms-acceleration` is a collection of plugins that packages that accelerate fine-tuning / training of large models, as part of the `fms-hf-tuning` suite. For more details see [this section below](#fms-acceleration).
+
+### 2. Build from source 
+
+We have committed a `poetry.lock` file to allow reproducible enviornments. If building from source, you can clone the repository and use poetry to install as mentioned in [development docs](/CONTRIBUTING.md#development) 
+
+If building in a dockerfile use `poetry export --format requirements.txt` can install same dependencies from the lock file. Maintainers regularly update the lock file. 
 
 ## Data format
 We support two data formats:
@@ -385,7 +392,7 @@ Equally you can pass in a JSON configuration for running tuning. See [build doc]
 
 ### FMS Acceleration
 
-`fms-acceleration` is fuss-free approach to access a curated collection of acceleration plugins that acclerate your `tuning/sft-trainer.py` experience. Accelerations that apply to a variety of use-cases, e.g., PeFT / full-finetuning, are being planned for. As such, the accelerations are grouped into *plugins*; only install the plugins needed for the acceleration of interest. The plugins are housed in the [seperate repository found here](https://github.com/foundation-model-stack/fms-acceleration).
+`fms-acceleration` is fuss-free approach to access a curated collection of acceleration plugins that accelerate your `tuning/sft-trainer.py` experience. Accelerations that apply to a variety of use-cases, e.g., PeFT / full-finetuning, are being planned for. As such, the accelerations are grouped into *plugins*; only install the plugins needed for the acceleration of interest. The plugins are housed in the [separate repository found here](https://github.com/foundation-model-stack/fms-acceleration).
 
 To access `fms-acceleration` features the `[fms-accel]` dependency must first be installed:
   ```

@@ -110,29 +110,44 @@ RUN dnf install -y git && \
     rm -f /usr/share/doc/perl-Net-SSLeay/examples/server_key.pem && \
     dnf clean all
 USER ${USER}
-WORKDIR /tmp
+# Ensure that git directory is owned by current user, otherwise git raises
+# "fatal: detected dubious ownership" for `/tmp`
+WORKDIR /tmp/fms-hf-tuning
+
+# Install poetry and its dependencies inside an isolated virtual environment which we
+# will not copy into the release-base layer
 RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
-    python -m pip install --user build
-COPY --chown=${USER}:root tuning tuning
-COPY .git .git
-COPY pyproject.toml pyproject.toml
+    python -m venv venv /tmp/isolated && \
+    /tmp/isolated/bin/pip install poetry poetry-plugin-export
 
-# Build a wheel if PyPi wheel_version is empty else download the wheel from PyPi
-RUN if [[ -z "${WHEEL_VERSION}" ]]; \
-    then python -m build --wheel --outdir /tmp; \
-    else pip download fms-hf-tuning==${WHEEL_VERSION} --dest /tmp --only-binary=:all: --no-deps; \
-    fi && \
-    ls /tmp/*.whl >/tmp/bdist_name
+COPY --chown=${USER}:root tuning tuning
+COPY --chown=${USER}:root .git .git
+COPY --chown=${USER}:root pyproject.toml pyproject.toml
+COPY --chown=${USER}:root poetry.lock poetry.lock
+COPY README.md README.md
 
-# Install from the wheel
+# Install using poetry if PyPi wheel_version is empty else download the wheel from PyPi
-# Install using poetry if PyPi wheel_version is empty else download the wheel from PyPi
+# Install using poetry if PyPi wheel_version is empty else download the wheel from PyPi
+# If creating your own dockerfile we suggest to use poetry export for a reproducible environment
-# Install using poetry if PyPi wheel_version is empty else download the wheel from PyPi
+# Install using poetry if PyPi wheel_version is empty else download the wheel from PyPi
+# If creating your own dockerfile we suggest to use poetry export for a reproducible environment
 RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
-    python -m pip install --user wheel && \
-    python -m pip install --user "$(head bdist_name)" && \
-    python -m pip install --user "$(head bdist_name)[flash-attn]" && \
-    # Clean up the wheel module. It's only needed by flash-attn install
-    python -m pip uninstall wheel build -y && \
-    # Cleanup the bdist whl file
-    rm $(head bdist_name) /tmp/bdist_name
+    if [[ -z "${WHEEL_VERSION}" ]]; then \
+      # Extract requirements from poetry and install them in ~/.local \
+      # Need wheel and build for the flash-attn package \
+      python -m pip install --user wheel build && \
+      python -m pip install --user --requirement <(/tmp/isolated/bin/poetry export --format requirements.txt) &&  \
+      # Next install the package with flash-attn \
+      python -m pip install --user ".[flash-attn]" && \
+      python -m pip uninstall wheel build -y ; \
+    else \
+      # This will use whatever dependencies versions satisfy the pyproject.toml constraints \
+      # but they won't necessarily be the exact same versions as present in poetry.lock \
+      # First, install fms-hf-tuning to get its dependencies which include torch. \
+      # Then install with the flash-attn extras as the latter expects torch to be present \
+      python -m pip install --user wheel build && \
+      python -m pip install --user "fms-hf-tuning==${WHEEL_VERSION}" && \
+      python -m pip install --user "fms-hf-tuning[flash-attn]==${WHEEL_VERSION}" && \
+      python -m pip uninstall wheel build -y ; \
+    fi
+
+RUN python -m pip freeze
 
 ## Final image ################################################
 FROM release-base as release

@@ -18,30 +18,33 @@
 """
 
 # Standard
-import os
+from pathlib import Path
 import logging
+import os
+import shutil
 import subprocess
 import sys
-import traceback
 import tempfile
-import shutil
-from pathlib import Path
+import traceback
 
 # Third Party
 from accelerate.commands.launch import launch_command
+import torch.distributed.elastic.multiprocessing.errors
 
-# Local
+# First Party
 from build.utils import (
+    get_highest_checkpoint,
     process_accelerate_launch_args,
     serialize_args,
-    get_highest_checkpoint,
 )
-from tuning.utils.config_utils import get_json_config
+
+# Local
 from tuning.config.tracker_configs import FileLoggingTrackerConfig
+from tuning.utils.config_utils import get_json_config
 from tuning.utils.error_logging import (
-    write_termination_log,
-    USER_ERROR_EXIT_CODE,
     INTERNAL_ERROR_EXIT_CODE,
+    USER_ERROR_EXIT_CODE,
+    write_termination_log,
 )
 
 ERROR_LOG = "/dev/termination-log"
@@ -89,6 +92,20 @@ def main():
     # Launch training
     #
     ##########
+
+    def handle_sft_trainer_exit_error(return_code):
+        # If the subprocess throws an exception, the base exception is hidden in the
+        # subprocess call and is difficult to access at this level. However, that is not
+        # an issue because sft_trainer.py would have already written the exception
+        # message to termination log.
+        logging.error(traceback.format_exc())
+        # The exit code that sft_trainer.py threw is captured in e.returncode
+
+        if return_code not in [INTERNAL_ERROR_EXIT_CODE, USER_ERROR_EXIT_CODE]:
+            return_code = INTERNAL_ERROR_EXIT_CODE
+            write_termination_log(f"Unhandled exception during training. {e}")
+        sys.exit(return_code)
+
     original_output_dir = job_config.get("output_dir")
     with tempfile.TemporaryDirectory() as tempdir:
         try:
@@ -98,19 +115,13 @@ def main():
             os.environ["SFT_TRAINER_CONFIG_JSON_ENV_VAR"] = updated_args
 
             launch_command(args)
+        except torch.distributed.elastic.multiprocessing.errors.ChildFailedError as e:
+            # This is what accelerate.commands.launch.multi_gpu_launcher() raises
+            # (when using >1 GPUs)
+            handle_sft_trainer_exit_error(e.get_first_failure()[1].exitcode)
         except subprocess.CalledProcessError as e:
-            # If the subprocess throws an exception, the base exception is hidden in the
-            # subprocess call and is difficult to access at this level. However, that is not
-            # an issue because sft_trainer.py would have already written the exception
-            # message to termination log.
-            logging.error(traceback.format_exc())
-            # The exit code that sft_trainer.py threw is captured in e.returncode
-
-            return_code = e.returncode
-            if return_code not in [INTERNAL_ERROR_EXIT_CODE, USER_ERROR_EXIT_CODE]:
-                return_code = INTERNAL_ERROR_EXIT_CODE
-                write_termination_log(f"Unhandled exception during training. {e}")
-            sys.exit(return_code)
+            # This is what accelerate.commands.launch.simple_launcher() raises
+            handle_sft_trainer_exit_error(e.returncode)
         except Exception as e:  # pylint: disable=broad-except
             logging.error(traceback.format_exc())
             write_termination_log(f"Unhandled exception during training. {e}")

@@ -13,14 +13,14 @@
 # limitations under the License.
 
 # Standard
-import os
+import base64
 import logging
+import os
 import pickle
-import base64
 
 # Third Party
-import torch
 from accelerate.commands.launch import launch_command_parser
+import torch
 
 
 def get_highest_checkpoint(dir_path):