Skip to content

Commit

Permalink
Itwinai jlab Docker image (#236)
Browse files Browse the repository at this point in the history
* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests
  • Loading branch information
matbun authored Nov 14, 2024
1 parent 86f536f commit 5a710ed
Show file tree
Hide file tree
Showing 38 changed files with 1,523 additions and 511 deletions.
6 changes: 6 additions & 0 deletions .github/linters/.hadolint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
failure-threshold: warning
ignored:
- DL3008 # Pin versions in apt get install.
- DL3013 # Pin versions in pip. TODO: remove.
- DL4001 # Either use Wget or Curl but not both
- DL3003 # Use WORKDIR to switch to a directory
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,5 @@ jobs:
# Default environment names are ".venv-pytorch" and ".venv-tf"
- name: Run pytest for workflows
shell: bash -l {0}
run: .venv-pytorch/bin/pytest -v ./tests/ -m "not slurm"
run: .venv-pytorch/bin/pytest -v ./tests/ -m "not hpc"

58 changes: 14 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ environment for PyTorch:

```bash
ml --force purge
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
```

##### TensorFlow environment
Expand All @@ -80,8 +80,8 @@ environment for TensorFlow:

```bash
ml --force purge
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
```

### Install itwinai for users
Expand Down Expand Up @@ -227,8 +227,8 @@ Commands to be executed before activating the python virtual environment:

```bash
ml --force purge
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
```

- When not on an HPC: do nothing.
Expand Down Expand Up @@ -261,8 +261,8 @@ Commands to be executed before activating the python virtual environment:

```bash
ml --force purge
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
```

- When not on an HPC: do nothing.
Expand Down Expand Up @@ -346,14 +346,16 @@ For example, in `ghcr.io/intertwin-eu/itwinai:0.2.2-torch2.6-jammy`:
The `TAG` follows the convention:

```text
X.Y.Z-[torch|tf]x.y-distro
[jlab-]X.Y.Z-(torch|tf)x.y-distro
```

Where:

- `X.Y.Z` is the **itwinai version**
- `(torch|tf)` is an exclusive OR between "torch" and "tf". You can pick one or the other, but not both.
- `x.y` is the **version of the ML framework** (e.g., PyTorch or TensorFlow)
- `distro` is the OS distro in the container (e.g., Ubuntu Jammy)
- `jlab-` is prepended to the tag of images including JupyterLab

### Image Names and Their Purpose

Expand All @@ -362,42 +364,10 @@ We use different image names to group similar images under the same namespace:
- **`itwinai`**: Production images. These should be well-maintained and orderly.
- **`itwinai-dev`**: Development images. Tags can vary, and may include random
hashes.
- **`itwinai-cvmfs`**: Images that need to be made available through CVMFS.
- **`itwinai-cvmfs`**: Images that need to be made available through CVMFS via
[Unpacker](https://gitlab.cern.ch/unpacked/sync).

> [!WARNING]
> It is very important to keep the number of tags for `itwinai-cvmfs` as low
> as possible. Tags should only be created under this namespace when strictly
> necessary. Otherwise, this could cause issues for the converter.

<!--
### Micromamba installation (deprecated)

To manage Conda environments we use micromamba, a light weight version of conda.

It is suggested to refer to the
[Manual installation guide](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#manual-installation).

Consider that Micromamba can eat a lot of space when building environments because packages are cached on
the local filesystem after being downloaded. To clear cache you can use `micromamba clean -a`.
Micromamba data are kept under the `$HOME` location. However, in some systems, `$HOME` has a limited storage
space and it would be cleverer to install Micromamba in another location with more storage space.
Thus by changing the `$MAMBA_ROOT_PREFIX` variable. See a complete installation example for Linux below, where the
default `$MAMBA_ROOT_PREFIX` is overridden:

```bash
cd $HOME
# Download micromamba (This command is for Linux Intel (x86_64) systems. Find the right one for your system!)
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
# Install micromamba in a custom directory
MAMBA_ROOT_PREFIX='my-mamba-root'
./bin/micromamba shell init $MAMBA_ROOT_PREFIX
# To invoke micromamba from Makefile, you need to add explicitly to $PATH
echo 'PATH="$(dirname $MAMBA_EXE):$PATH"' >> ~/.bashrc
```

**Reference**: [Micromamba installation guide](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html).

-->
> necessary. Otherwise, this could cause issues for the Unpacker.
45 changes: 30 additions & 15 deletions env-files/tensorflow/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,39 @@ ARG IMG_TAG=24.08-tf2-py3

FROM nvcr.io/nvidia/tensorflow:${IMG_TAG}

WORKDIR /usr/src/app
# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai"
LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers"
LABEL org.opencontainers.image.licenses=MIT
LABEL maintainer="Matteo Bunino - [email protected]"

ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /app

RUN apt-get update && apt-get install -y \
# Needed by Prov4ML/yProvML to generate provenance graph
dot2tex \
&& apt-get clean -y && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir \
tf_keras==2.16.* \
"prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \
ray[tune]

# Install itwinai
COPY pyproject.toml ./
COPY src ./
COPY env-files/tensorflow/create_container_env.sh ./
RUN bash create_container_env.sh
COPY pyproject.toml pyproject.toml
COPY src src
RUN pip install --no-cache-dir . \
&& itwinai sanity-check --tensorflow --optional-deps ray

# Create non-root user
RUN groupadd -g 10001 jovyan \
&& useradd -m -u 10000 -g jovyan jovyan \
&& chown -R jovyan:jovyan /usr/src/app
USER jovyan:jovyan
# Additional pip deps
ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt
COPY "${REQUIREMENTS}" additional-requirements.txt
RUN pip install --no-cache-dir -r additional-requirements.txt

# ENTRYPOINT [ "/bin/sh" ]
# CMD [ ]

LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers"
LABEL org.opencontainers.image.licenses=MIT
LABEL maintainer="Matteo Bunino - [email protected]"
19 changes: 0 additions & 19 deletions env-files/tensorflow/create_container_env.sh

This file was deleted.

4 changes: 2 additions & 2 deletions env-files/tensorflow/generic_tf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ pip3 install --no-cache-dir tf_keras==2.16.*

# Install Pov4ML
if [[ "$OSTYPE" =~ ^darwin ]] ; then
pip install "prov4ml[apple]@git+https://github.com/matbun/ProvML@main" || exit 1
pip install "prov4ml[apple,nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
else
pip install "prov4ml[linux]@git+https://github.com/matbun/ProvML@main" || exit 1
pip install "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
fi

# Install itwinai: MUST be last line of the script for the user installation script to work!
Expand Down
84 changes: 64 additions & 20 deletions env-files/torch/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,31 +1,75 @@
ARG IMG_TAG=23.09-py3
ARG IMG_TAG=24.05-py3
# ARG IMG_TAG=23.09-py3

# 23.09-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-09.html
# 24.04-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-04.html

FROM nvcr.io/nvidia/pytorch:${IMG_TAG}

# https://stackoverflow.com/a/56748289
ARG IMG_TAG
LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai"
LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers"
LABEL org.opencontainers.image.licenses=MIT
LABEL maintainer="Matteo Bunino - [email protected]"

# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /app

WORKDIR /usr/src/app
RUN apt-get update && apt-get install -y \
# Needed by Prov4ML/yProvML to generate provenance graph
dot2tex \
&& apt-get clean -y && rm -rf /var/lib/apt/lists/*

# https://github.com/mpi4py/mpi4py/pull/431
RUN env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py
RUN pip install --no-cache-dir --upgrade pip \
&& env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py

# DeepSpeed, Horovod and other deps
ENV HOROVOD_WITH_PYTORCH=1 \
HOROVOD_WITHOUT_TENSORFLOW=1 \
HOROVOD_WITHOUT_MXNET=1 \
CMAKE_CXX_STANDARD=17 \
HOROVOD_MPI_THREADS_DISABLE=1 \
HOROVOD_CPU_OPERATIONS=MPI \
HOROVOD_GPU_ALLREDUCE=NCCL \
HOROVOD_NCCL_LINK=SHARED \
# DeepSpeed
# DS_BUILD_CCL_COMM=1 \
DS_BUILD_UTILS=1 \
DS_BUILD_AIO=1 \
DS_BUILD_FUSED_ADAM=1 \
DS_BUILD_FUSED_LAMB=1 \
DS_BUILD_TRANSFORMER=1 \
DS_BUILD_STOCHASTIC_TRANSFORMER=1 \
DS_BUILD_TRANSFORMER_INFERENCE=1
# Torch: reuse the global torch in the container
RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \
&& pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \
deepspeed==0.15.* \
git+https://github.com/horovod/horovod.git@3a31d93 \
"prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \
ray[tune] \
# fix .triton/autotune/Fp16Matmul_2d_kernel.pickle bug
&& pver="$(python --version 2>&1 | awk '{print $2}' | cut -f1-2 -d.)" \
&& line=$(cat -n "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py" | grep os.rename | awk '{print $1}' | head -n 1) \
&& sed -i "${line}s|^|#|" "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py"

# Install itwinai
COPY pyproject.toml ./
COPY src ./
COPY env-files/torch/create_container_env.sh ./
RUN bash create_container_env.sh ${IMG_TAG}

# Create non-root user
RUN groupadd -g 10001 jovyan \
&& useradd -m -u 10000 -g jovyan jovyan \
&& chown -R jovyan:jovyan /usr/src/app
USER jovyan:jovyan

LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers"
LABEL org.opencontainers.image.licenses=MIT
LABEL maintainer="Matteo Bunino - [email protected]"
COPY pyproject.toml pyproject.toml
COPY src src
# Torch: reuse the global torch in the container
RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \
&& pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \
.[torch] \
&& itwinai sanity-check --torch \
--optional-deps deepspeed \
--optional-deps horovod \
--optional-deps ray

# Additional pip deps
ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt
COPY "${REQUIREMENTS}" additional-requirements.txt
RUN pip install --no-cache-dir -r additional-requirements.txt
1 change: 1 addition & 0 deletions env-files/torch/createEnvVega.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ ml GCCcore/11.3.0
#ml NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
ml NCCL
ml cuDNN/8.9.7.29-CUDA-12.3.0
ml UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0

# You should have CUDA 12.3 now

Expand Down
73 changes: 0 additions & 73 deletions env-files/torch/create_container_env.sh

This file was deleted.

Loading

0 comments on commit 5a710ed

Please sign in to comment.