updated README

interTwin-eu · Oct 31, 2023 · ac2d781 · ac2d781
1 parent 29831b3
commit ac2d781
Show file tree

Hide file tree

Showing 5 changed files with 59 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -9,41 +9,61 @@ for a quick overview of this platform for advanced AI/ML workflows in digital tw
 If you want to integrate a new use case, you can follow this
 [step-by-step guide](https://intertwin-eu.github.io/T6.5-AI-and-ML/docs/How-to-use-this-software.html).
 
-## CMCC Use-case:
-To run do: 
-```
-micromamba run -p ./.venv python run-workflow.py -f ./use-cases/cyclones/workflows/workflow-train.yml
-```
 
-## Installation
+## Requirements
 
-The containers were build using Apptainer version 1.1.8-1.el8
+The containers were build using Apptainer version 1.1.8-1.el8 and podman version 4.4.1.
 
-### Building the containers
+### Base Container
 
 The container are built on top of the [NVIDIA PyTorch NGC Containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The NGC containers come with preinstalled libraries such as CUDA, cuDNN, NCCL, PyTorch, etc that are all harmouniously compatible with each other in order to reduce depenency issue and provide a maximum of portability. The current version used is ```nvcr.io/nvidia/pytorch:23.09-py3```, which is based on CUDA 12.2.1 and PyTorch 2.1.0a0+32f93b1.
 If you need other specs you can consults the [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) and find the right base container for you.
-Once you found the right container you can alter the following line
 
+
+### Running the itwinai container
+
+There are currently three ways to execute the itwinai container on a SLURM cluster.
+
+1. Direct build on the HPC system
+2. Use build on the [itwinai repo](https://github.com/interTwin-eu/itwinai/pkgs/container/t6.5-ai-and-ml) and pull to HPC system
+3. Deploy to Kubernetes cluster and offload to HPC via [interLink](https://github.com/interTwin-eu/interLink)
+
+![container workflow](docs/docs/img/containers.png) 
+
+##### Direct build
+Run the following commands to build the container directly on the HPC system. Select the right base container by altering the following line 
 ```
 apptainer pull torch.sif docker://nvcr.io/nvidia/pytorch:23.09-py3
 ```
 inside ```containers/apptainer/apptainer_build.sh``` to change to the desired version.
 
-As mentioned above additional libraries are installed on top of the NGC container which are listed inside ```env-files/torch/pytorch-env-gpu-container.txt```.
-
-Once you are satisified with the libraries run:
+Install the itwinai libraries by running:
 ```
 ./containers/apptainer/apptainer_build.sh
 ```
 
-### Run the containers
+Run the startscript with 
+```
+sbatch use-cases/mnist/torch/startscript.sh
+```
+
+##### Github container repository build
+With this method you can just pull the ready container from the github container repository:
+```
+apptainer pull containers/apptainer/itwinai.sif docker://ghcr.io/intertwin-eu/t6.5-ai-and-ml:containers
+```
 
 Run the startscript with 
 ```
-sbatch startscript.sh
+sbatch use-cases/mnist/torch/startscript.sh
 ```
 
+##### InterLink
+To be tested
+
+
+
+
 
 ### Future work
 It is currently foreseen to build the container via GH actions.
diff --git a/containers/apptainer/apptainer_build.sh b/containers/apptainer/apptainer_build.sh
@@ -13,10 +13,10 @@ export APPTAINER_CACHEDIR=$(mktemp -d -p $PWD/Cache)
 export APPTAINER_TMPDIR=$(mktemp -d -p $PWD/TMP)
 
 # official NVIDIA NVCR container with Torch==2.0.0
-apptainer pull containers/apptainer/torch.sif docker://nvcr.io/nvidia/pytorch:23.09-py3
+apptainer pull containers/apptainer/itwinai.sif docker://nvcr.io/nvidia/pytorch:23.09-py3
 
 # run bash to create envs
 echo "running ./containers/apptainer/apptainer_build_env.sh"
-apptainer exec torch.sif bash -c "./containers/apptainer/apptainer_build_env.sh"
+apptainer exec itwinai.sif bash -c "./containers/apptainer/apptainer_build_env.sh"
 
 #eof
diff --git a/containers/apptainer/apptainer_build_env.sh b/containers/apptainer/apptainer_build_env.sh
@@ -5,5 +5,5 @@ nname='torch_env'
 # source ${nname}/bin/activate
 
 # install wheels -- from this point on, feel free to add anything
-pip3 install -r ./env-files/torch/pytorch-env-gpu-container.txt
-pip3 install -e .
+#pip3 install -r ./env-files/torch/pytorch-env-gpu-container.txt
+pip3 install -e .[dev]
diff --git a/docs/docs/img/containers.png b/docs/docs/img/containers.png
diff --git a/use-cases/mnist/torch/startscript.sh b/use-cases/mnist/torch/startscript.sh
@@ -24,12 +24,9 @@
 # parameters
 debug=false # display debug info
 
-CONTAINERPATH="/p/project/intertwin/zoechbauer1/T6.5-AI-and-ML/containers/apptainer/torch.sif"
+CONTAINERPATH="/p/project/intertwin/zoechbauer1/T6.5-AI-and-ML/containers/apptainer/itwinai.sif"
 
-#EXEC="python train.py -p pipeline.yaml --download-only" #for bash
-EXEC="python train.py -p pipeline.yaml"                 #for SLURM
-
-SLURM_EXECUTION=true
+SLURM_EXECUTION=false
 
 #switch to use case folder
 cd use-cases/mnist/torch
@@ -68,15 +65,28 @@ if [ "$debug" = true ] ; then
 fi
 
 
+
+#This is to overwrite the default run command in the container, e.g.:
+
+#EXEC="python train.py -p pipeline.yaml --download-only" #for bash
+# if [ "$SLURM_EXECUTION" = true ]; then
+#     srun --cpu-bind=none bash -c "apptainer exec --nv \
+#         $CONTAINERPATH \
+#         $EXEC"
+# else
+#     apptainer exec --nv \
+#         $CONTAINERPATH \
+#         $EXEC
+# fi
+
 #Choose SLURM execution or bash script execution
 if [ "$SLURM_EXECUTION" = true ]; then
-    srun --cpu-bind=none bash -c "apptainer exec --nv \
-        $CONTAINERPATH \
-        $EXEC"
+    srun --cpu-bind=none bash -c "apptainer run --nv \
+        $CONTAINERPATH"
+
 else
-    apptainer exec --nv \
-        $CONTAINERPATH \
-        $EXEC
+    apptainer run --nv \
+        $CONTAINERPATH
 fi
 
 #eof