-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests
- Loading branch information
Showing
38 changed files
with
1,523 additions
and
511 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
failure-threshold: warning | ||
ignored: | ||
- DL3008 # Pin versions in apt get install. | ||
- DL3013 # Pin versions in pip. TODO: remove. | ||
- DL4001 # Either use Wget or Curl but not both | ||
- DL3003 # Use WORKDIR to switch to a directory |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,24 +6,39 @@ ARG IMG_TAG=24.08-tf2-py3 | |
|
||
FROM nvcr.io/nvidia/tensorflow:${IMG_TAG} | ||
|
||
WORKDIR /usr/src/app | ||
# Fix: https://github.com/hadolint/hadolint/wiki/DL4006 | ||
SHELL ["/bin/bash", "-o", "pipefail", "-c"] | ||
|
||
LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai" | ||
LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers" | ||
LABEL org.opencontainers.image.licenses=MIT | ||
LABEL maintainer="Matteo Bunino - [email protected]" | ||
|
||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
WORKDIR /app | ||
|
||
RUN apt-get update && apt-get install -y \ | ||
# Needed by Prov4ML/yProvML to generate provenance graph | ||
dot2tex \ | ||
&& apt-get clean -y && rm -rf /var/lib/apt/lists/* | ||
|
||
RUN pip install --no-cache-dir --upgrade pip \ | ||
&& pip install --no-cache-dir \ | ||
tf_keras==2.16.* \ | ||
"prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \ | ||
ray[tune] | ||
|
||
# Install itwinai | ||
COPY pyproject.toml ./ | ||
COPY src ./ | ||
COPY env-files/tensorflow/create_container_env.sh ./ | ||
RUN bash create_container_env.sh | ||
COPY pyproject.toml pyproject.toml | ||
COPY src src | ||
RUN pip install --no-cache-dir . \ | ||
&& itwinai sanity-check --tensorflow --optional-deps ray | ||
|
||
# Create non-root user | ||
RUN groupadd -g 10001 jovyan \ | ||
&& useradd -m -u 10000 -g jovyan jovyan \ | ||
&& chown -R jovyan:jovyan /usr/src/app | ||
USER jovyan:jovyan | ||
# Additional pip deps | ||
ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt | ||
COPY "${REQUIREMENTS}" additional-requirements.txt | ||
RUN pip install --no-cache-dir -r additional-requirements.txt | ||
|
||
# ENTRYPOINT [ "/bin/sh" ] | ||
# CMD [ ] | ||
|
||
LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai | ||
LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers" | ||
LABEL org.opencontainers.image.licenses=MIT | ||
LABEL maintainer="Matteo Bunino - [email protected]" |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,31 +1,75 @@ | ||
ARG IMG_TAG=23.09-py3 | ||
ARG IMG_TAG=24.05-py3 | ||
# ARG IMG_TAG=23.09-py3 | ||
|
||
# 23.09-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-09.html | ||
# 24.04-py3: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-04.html | ||
|
||
FROM nvcr.io/nvidia/pytorch:${IMG_TAG} | ||
|
||
# https://stackoverflow.com/a/56748289 | ||
ARG IMG_TAG | ||
LABEL org.opencontainers.image.source="https://github.com/interTwin-eu/itwinai" | ||
LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers" | ||
LABEL org.opencontainers.image.licenses=MIT | ||
LABEL maintainer="Matteo Bunino - [email protected]" | ||
|
||
# Fix: https://github.com/hadolint/hadolint/wiki/DL4006 | ||
SHELL ["/bin/bash", "-o", "pipefail", "-c"] | ||
|
||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
WORKDIR /app | ||
|
||
WORKDIR /usr/src/app | ||
RUN apt-get update && apt-get install -y \ | ||
# Needed by Prov4ML/yProvML to generate provenance graph | ||
dot2tex \ | ||
&& apt-get clean -y && rm -rf /var/lib/apt/lists/* | ||
|
||
# https://github.com/mpi4py/mpi4py/pull/431 | ||
RUN env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py | ||
RUN pip install --no-cache-dir --upgrade pip \ | ||
&& env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py | ||
|
||
# DeepSpeed, Horovod and other deps | ||
ENV HOROVOD_WITH_PYTORCH=1 \ | ||
HOROVOD_WITHOUT_TENSORFLOW=1 \ | ||
HOROVOD_WITHOUT_MXNET=1 \ | ||
CMAKE_CXX_STANDARD=17 \ | ||
HOROVOD_MPI_THREADS_DISABLE=1 \ | ||
HOROVOD_CPU_OPERATIONS=MPI \ | ||
HOROVOD_GPU_ALLREDUCE=NCCL \ | ||
HOROVOD_NCCL_LINK=SHARED \ | ||
# DeepSpeed | ||
# DS_BUILD_CCL_COMM=1 \ | ||
DS_BUILD_UTILS=1 \ | ||
DS_BUILD_AIO=1 \ | ||
DS_BUILD_FUSED_ADAM=1 \ | ||
DS_BUILD_FUSED_LAMB=1 \ | ||
DS_BUILD_TRANSFORMER=1 \ | ||
DS_BUILD_STOCHASTIC_TRANSFORMER=1 \ | ||
DS_BUILD_TRANSFORMER_INFERENCE=1 | ||
# Torch: reuse the global torch in the container | ||
RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \ | ||
&& pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \ | ||
deepspeed==0.15.* \ | ||
git+https://github.com/horovod/horovod.git@3a31d93 \ | ||
"prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" \ | ||
ray[tune] \ | ||
# fix .triton/autotune/Fp16Matmul_2d_kernel.pickle bug | ||
&& pver="$(python --version 2>&1 | awk '{print $2}' | cut -f1-2 -d.)" \ | ||
&& line=$(cat -n "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py" | grep os.rename | awk '{print $1}' | head -n 1) \ | ||
&& sed -i "${line}s|^|#|" "/usr/local/lib/python${pver}/dist-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py" | ||
|
||
# Install itwinai | ||
COPY pyproject.toml ./ | ||
COPY src ./ | ||
COPY env-files/torch/create_container_env.sh ./ | ||
RUN bash create_container_env.sh ${IMG_TAG} | ||
|
||
# Create non-root user | ||
RUN groupadd -g 10001 jovyan \ | ||
&& useradd -m -u 10000 -g jovyan jovyan \ | ||
&& chown -R jovyan:jovyan /usr/src/app | ||
USER jovyan:jovyan | ||
|
||
LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai | ||
LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers" | ||
LABEL org.opencontainers.image.licenses=MIT | ||
LABEL maintainer="Matteo Bunino - [email protected]" | ||
COPY pyproject.toml pyproject.toml | ||
COPY src src | ||
# Torch: reuse the global torch in the container | ||
RUN CONTAINER_TORCH_VERSION="$(python -c 'import torch;print(torch.__version__)')" \ | ||
&& pip install --no-cache-dir torch=="$CONTAINER_TORCH_VERSION" \ | ||
.[torch] \ | ||
&& itwinai sanity-check --torch \ | ||
--optional-deps deepspeed \ | ||
--optional-deps horovod \ | ||
--optional-deps ray | ||
|
||
# Additional pip deps | ||
ARG REQUIREMENTS=env-files/torch/requirements/requirements.txt | ||
COPY "${REQUIREMENTS}" additional-requirements.txt | ||
RUN pip install --no-cache-dir -r additional-requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.