Itwinai jlab Docker image (#236)

* Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests
interTwin-eu · Nov 14, 2024 · 5a710ed · 5a710ed
1 parent 86f536f
commit 5a710ed
Show file tree

Hide file tree

Showing 38 changed files with 1,523 additions and 511 deletions.
diff --git a/.github/linters/.hadolint.yaml b/.github/linters/.hadolint.yaml
@@ -0,0 +1,6 @@
+failure-threshold: warning
+ignored:
+  - DL3008 # Pin versions in apt get install.
+  - DL3013 # Pin versions in pip. TODO: remove.
+  - DL4001 # Either use Wget or Curl but not both
+  - DL3003 # Use WORKDIR to switch to a directory
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -34,5 +34,5 @@ jobs:
       # Default environment names are ".venv-pytorch" and ".venv-tf"
       - name: Run pytest for workflows
         shell: bash -l {0}
-        run: .venv-pytorch/bin/pytest -v ./tests/ -m "not slurm"
+        run: .venv-pytorch/bin/pytest -v ./tests/ -m "not hpc"
 
diff --git a/README.md b/README.md
@@ -59,8 +59,8 @@ environment for PyTorch:
 
     ```bash
     ml --force purge
-    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
-    ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
+    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
+    ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
     ```
 
 ##### TensorFlow environment
@@ -80,8 +80,8 @@ environment for TensorFlow:
 
     ```bash
     ml --force purge
-    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
-    ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
+    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
+    ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
     ```
 
 ### Install itwinai for users
@@ -227,8 +227,8 @@ Commands to be executed before activating the python virtual environment:
 
     ```bash
     ml --force purge
-    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
-    ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
+    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
+    ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
     ```
 
 - When not on an HPC: do nothing.
@@ -261,8 +261,8 @@ Commands to be executed before activating the python virtual environment:
 
     ```bash
     ml --force purge
-    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/11.7
-    ml GCCcore/11.3.0 NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 cuDNN
+    ml Python CMake/3.24.3-GCCcore-11.3.0 mpi4py OpenMPI CUDA/12.3
+    ml GCCcore/11.3.0 NCCL cuDNN/8.9.7.29-CUDA-12.3.0 UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
     ```
 
 - When not on an HPC: do nothing.
@@ -346,14 +346,16 @@ For example, in `ghcr.io/intertwin-eu/itwinai:0.2.2-torch2.6-jammy`:
 The `TAG` follows the convention:
 
 ```text
-X.Y.Z-[torch|tf]x.y-distro
+[jlab-]X.Y.Z-(torch|tf)x.y-distro
 ```
 
 Where:
 
 - `X.Y.Z` is the **itwinai version**
+- `(torch|tf)` is an exclusive OR between "torch" and "tf". You can pick one or the other, but not both.
 - `x.y` is the **version of the ML framework** (e.g., PyTorch or TensorFlow)
 - `distro` is the OS distro in the container (e.g., Ubuntu Jammy)
+- `jlab-` is prepended to the tag of images including JupyterLab
 
 ### Image Names and Their Purpose
 
@@ -362,42 +364,10 @@ We use different image names to group similar images under the same namespace:
 - **`itwinai`**: Production images. These should be well-maintained and orderly.
 - **`itwinai-dev`**: Development images. Tags can vary, and may include random
 hashes.
-- **`itwinai-cvmfs`**: Images that need to be made available through CVMFS.
+- **`itwinai-cvmfs`**: Images that need to be made available through CVMFS via
+[Unpacker](https://gitlab.cern.ch/unpacked/sync).
 
 > [!WARNING]
 > It is very important to keep the number of tags for `itwinai-cvmfs` as low
 > as possible. Tags should only be created under this namespace when strictly
-> necessary. Otherwise, this could cause issues for the converter.
-
-<!--
-### Micromamba installation (deprecated)
-
-To manage Conda environments we use micromamba, a light weight version of conda.
-
-It is suggested to refer to the
-[Manual installation guide](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#manual-installation).
-
-Consider that Micromamba can eat a lot of space when building environments because packages are cached on
-the local filesystem after being downloaded. To clear cache you can use `micromamba clean -a`.
-Micromamba data are kept under the `$HOME` location. However, in some systems, `$HOME` has a limited storage
-space and it would be cleverer to install Micromamba in another location with more storage space.
-Thus by changing the `$MAMBA_ROOT_PREFIX` variable. See a complete installation example for Linux below, where the
-default `$MAMBA_ROOT_PREFIX` is overridden:
-
-```bash
-cd $HOME
-
-# Download micromamba (This command is for Linux Intel (x86_64) systems. Find the right one for your system!)
-curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
-
-# Install micromamba in a custom directory
-MAMBA_ROOT_PREFIX='my-mamba-root'
-./bin/micromamba shell init $MAMBA_ROOT_PREFIX
-
-# To invoke micromamba from Makefile, you need to add explicitly to $PATH
-echo 'PATH="$(dirname $MAMBA_EXE):$PATH"' >> ~/.bashrc
-```
-
-**Reference**: [Micromamba installation guide](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html).
-
--->
+> necessary. Otherwise, this could cause issues for the Unpacker.
diff --git a/env-files/tensorflow/Dockerfile b/env-files/tensorflow/Dockerfile
@@ -6,24 +6,39 @@ ARG IMG_TAG=24.08-tf2-py3
 
 FROM nvcr.io/nvidia/tensorflow:${IMG_TAG}
 
-WORKDIR /usr/src/app
+# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
+SHELL ["/bin/bash", "-o", "pipefail", "-c"]
+
+LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai"
+LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers"
+LABEL org.opencontainers.image.licenses=MIT
+LABEL maintainer="Matteo Bunino - [email protected]"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y \
+    # Needed by Prov4ML/yProvML to generate provenance graph
+    dot2tex \
+    && apt-get clean -y && rm -rf /var/lib/apt/lists/*
+
+RUN pip install --no-cache-dir --upgrade pip \
+    && pip install --no-cache-dir \
+    tf_keras==2.16.* \
+    "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \
+    ray[tune]
 
 # Install itwinai
-COPY pyproject.toml ./
-COPY src ./
-COPY env-files/tensorflow/create_container_env.sh ./
-RUN bash create_container_env.sh
+COPY pyproject.toml pyproject.toml
+COPY src src
+RUN pip install --no-cache-dir . \
+    && itwinai sanity-check --tensorflow --optional-deps ray
 
-# Create non-root user
-RUN groupadd -g 10001 jovyan \
-    && useradd -m -u 10000 -g jovyan jovyan \
-    && chown -R jovyan:jovyan /usr/src/app
-USER jovyan:jovyan
+# Additional pip deps
+ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt
+COPY "${REQUIREMENTS}" additional-requirements.txt
+RUN pip install --no-cache-dir -r additional-requirements.txt
 
 # ENTRYPOINT [ "/bin/sh" ]
 # CMD [  ]
-
-LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
-LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers"
-LABEL org.opencontainers.image.licenses=MIT
-LABEL maintainer="Matteo Bunino - [email protected]"
diff --git a/env-files/tensorflow/create_container_env.sh b/env-files/tensorflow/create_container_env.sh
diff --git a/env-files/tensorflow/generic_tf.sh b/env-files/tensorflow/generic_tf.sh
@@ -88,9 +88,9 @@ pip3 install --no-cache-dir  tf_keras==2.16.*
 
 # Install Pov4ML
 if [[ "$OSTYPE" =~ ^darwin ]] ; then
-  pip install "prov4ml[apple]@git+https://github.com/matbun/ProvML@main" || exit 1
+  pip install "prov4ml[apple,nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
 else
-  pip install "prov4ml[linux]@git+https://github.com/matbun/ProvML@main" || exit 1
+  pip install "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
 fi
 
 # Install itwinai: MUST be last line of the script for the user installation script to work!

diff --git a/env-files/torch/Dockerfile b/env-files/torch/Dockerfile
@@ -1,31 +1,75 @@
-ARG IMG_TAG=23.09-py3
+ARG IMG_TAG=24.05-py3 
+# ARG IMG_TAG=23.09-py3
 
 # 23.09-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-09.html
 # 24.04-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-04.html
 
 FROM nvcr.io/nvidia/pytorch:${IMG_TAG}
 
-# https://stackoverflow.com/a/56748289
-ARG IMG_TAG
+LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai"
+LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers"
+LABEL org.opencontainers.image.licenses=MIT
+LABEL maintainer="Matteo Bunino - [email protected]"
+
+# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
+SHELL ["/bin/bash", "-o", "pipefail", "-c"]
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+WORKDIR /app
 
-WORKDIR /usr/src/app
+RUN apt-get update && apt-get install -y \
+    # Needed by Prov4ML/yProvML to generate provenance graph
+    dot2tex \
+    && apt-get clean -y && rm -rf /var/lib/apt/lists/*
 
 # https://github.com/mpi4py/mpi4py/pull/431
-RUN env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py
+RUN pip install --no-cache-dir --upgrade pip \
+    && env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py
+
+# DeepSpeed, Horovod and other deps
+ENV HOROVOD_WITH_PYTORCH=1 \
+    HOROVOD_WITHOUT_TENSORFLOW=1 \
+    HOROVOD_WITHOUT_MXNET=1 \
+    CMAKE_CXX_STANDARD=17 \
+    HOROVOD_MPI_THREADS_DISABLE=1 \
+    HOROVOD_CPU_OPERATIONS=MPI \
+    HOROVOD_GPU_ALLREDUCE=NCCL \
+    HOROVOD_NCCL_LINK=SHARED \
+    # DeepSpeed
+    # DS_BUILD_CCL_COMM=1 \
+    DS_BUILD_UTILS=1 \
+    DS_BUILD_AIO=1 \
+    DS_BUILD_FUSED_ADAM=1 \
+    DS_BUILD_FUSED_LAMB=1 \
+    DS_BUILD_TRANSFORMER=1 \
+    DS_BUILD_STOCHASTIC_TRANSFORMER=1 \
+    DS_BUILD_TRANSFORMER_INFERENCE=1
+# Torch: reuse the global torch in the container
+RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \
+    && pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \
+    deepspeed==0.15.* \
+    git+https://github.com/horovod/horovod.git@3a31d93 \
+    "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \
+    ray[tune] \
+    # fix .triton/autotune/Fp16Matmul_2d_kernel.pickle bug
+    && pver="$(python --version 2>&1 | awk '{print $2}' | cut -f1-2 -d.)" \
+    && line=$(cat -n "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py" | grep os.rename | awk '{print $1}' | head -n 1) \
+    && sed -i "${line}s|^|#|" "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py"
 
 # Install itwinai
-COPY pyproject.toml ./
-COPY src ./
-COPY env-files/torch/create_container_env.sh ./
-RUN bash create_container_env.sh ${IMG_TAG}
-
-# Create non-root user
-RUN groupadd -g 10001 jovyan \
-    && useradd -m -u 10000 -g jovyan jovyan \
-    && chown -R jovyan:jovyan /usr/src/app
-USER jovyan:jovyan
-
-LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
-LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers"
-LABEL org.opencontainers.image.licenses=MIT
-LABEL maintainer="Matteo Bunino - [email protected]"
+COPY pyproject.toml pyproject.toml
+COPY src src
+# Torch: reuse the global torch in the container
+RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \
+    && pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \
+    .[torch] \
+    && itwinai sanity-check --torch \
+    --optional-deps deepspeed \
+    --optional-deps horovod \
+    --optional-deps ray
+
+# Additional pip deps
+ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt
+COPY "${REQUIREMENTS}" additional-requirements.txt
+RUN pip install --no-cache-dir -r additional-requirements.txt
diff --git a/env-files/torch/createEnvVega.sh b/env-files/torch/createEnvVega.sh
@@ -19,6 +19,7 @@ ml GCCcore/11.3.0
 #ml NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
 ml NCCL
 ml cuDNN/8.9.7.29-CUDA-12.3.0
+ml UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.3.0
 
 # You should have CUDA 12.3 now
 

diff --git a/env-files/torch/create_container_env.sh b/env-files/torch/create_container_env.sh