Skip to content

Commit

Permalink
updated README
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexander Zoechbauer committed Oct 31, 2023
1 parent 29831b3 commit ac2d781
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 29 deletions.
48 changes: 34 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,41 +9,61 @@ for a quick overview of this platform for advanced AI/ML workflows in digital tw
If you want to integrate a new use case, you can follow this
[step-by-step guide](https://intertwin-eu.github.io/T6.5-AI-and-ML/docs/How-to-use-this-software.html).

## CMCC Use-case:
To run do:
```
micromamba run -p ./.venv python run-workflow.py -f ./use-cases/cyclones/workflows/workflow-train.yml
```

## Installation
## Requirements

The containers were build using Apptainer version 1.1.8-1.el8
The containers were build using Apptainer version 1.1.8-1.el8 and podman version 4.4.1.

### Building the containers
### Base Container

The container are built on top of the [NVIDIA PyTorch NGC Containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The NGC containers come with preinstalled libraries such as CUDA, cuDNN, NCCL, PyTorch, etc that are all harmouniously compatible with each other in order to reduce depenency issue and provide a maximum of portability. The current version used is ```nvcr.io/nvidia/pytorch:23.09-py3```, which is based on CUDA 12.2.1 and PyTorch 2.1.0a0+32f93b1.
If you need other specs you can consults the [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) and find the right base container for you.
Once you found the right container you can alter the following line


### Running the itwinai container

There are currently three ways to execute the itwinai container on a SLURM cluster.

1. Direct build on the HPC system
2. Use build on the [itwinai repo](https://github.com/interTwin-eu/itwinai/pkgs/container/t6.5-ai-and-ml) and pull to HPC system
3. Deploy to Kubernetes cluster and offload to HPC via [interLink](https://github.com/interTwin-eu/interLink)

![container workflow](docs/docs/img/containers.png)

##### Direct build
Run the following commands to build the container directly on the HPC system. Select the right base container by altering the following line
```
apptainer pull torch.sif docker://nvcr.io/nvidia/pytorch:23.09-py3
```
inside ```containers/apptainer/apptainer_build.sh``` to change to the desired version.

As mentioned above additional libraries are installed on top of the NGC container which are listed inside ```env-files/torch/pytorch-env-gpu-container.txt```.

Once you are satisified with the libraries run:
Install the itwinai libraries by running:
```
./containers/apptainer/apptainer_build.sh
```

### Run the containers
Run the startscript with
```
sbatch use-cases/mnist/torch/startscript.sh
```

##### Github container repository build
With this method you can just pull the ready container from the github container repository:
```
apptainer pull containers/apptainer/itwinai.sif docker://ghcr.io/intertwin-eu/t6.5-ai-and-ml:containers
```

Run the startscript with
```
sbatch startscript.sh
sbatch use-cases/mnist/torch/startscript.sh
```

##### InterLink
To be tested





### Future work
It is currently foreseen to build the container via GH actions.
4 changes: 2 additions & 2 deletions containers/apptainer/apptainer_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ export APPTAINER_CACHEDIR=$(mktemp -d -p $PWD/Cache)
export APPTAINER_TMPDIR=$(mktemp -d -p $PWD/TMP)

# official NVIDIA NVCR container with Torch==2.0.0
apptainer pull containers/apptainer/torch.sif docker://nvcr.io/nvidia/pytorch:23.09-py3
apptainer pull containers/apptainer/itwinai.sif docker://nvcr.io/nvidia/pytorch:23.09-py3

# run bash to create envs
echo "running ./containers/apptainer/apptainer_build_env.sh"
apptainer exec torch.sif bash -c "./containers/apptainer/apptainer_build_env.sh"
apptainer exec itwinai.sif bash -c "./containers/apptainer/apptainer_build_env.sh"

#eof
4 changes: 2 additions & 2 deletions containers/apptainer/apptainer_build_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ nname='torch_env'
# source ${nname}/bin/activate

# install wheels -- from this point on, feel free to add anything
pip3 install -r ./env-files/torch/pytorch-env-gpu-container.txt
pip3 install -e .
#pip3 install -r ./env-files/torch/pytorch-env-gpu-container.txt
pip3 install -e .[dev]
Binary file added docs/docs/img/containers.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 21 additions & 11 deletions use-cases/mnist/torch/startscript.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,9 @@
# parameters
debug=false # display debug info

CONTAINERPATH="/p/project/intertwin/zoechbauer1/T6.5-AI-and-ML/containers/apptainer/torch.sif"
CONTAINERPATH="/p/project/intertwin/zoechbauer1/T6.5-AI-and-ML/containers/apptainer/itwinai.sif"

#EXEC="python train.py -p pipeline.yaml --download-only" #for bash
EXEC="python train.py -p pipeline.yaml" #for SLURM

SLURM_EXECUTION=true
SLURM_EXECUTION=false

#switch to use case folder
cd use-cases/mnist/torch
Expand Down Expand Up @@ -68,15 +65,28 @@ if [ "$debug" = true ] ; then
fi



#This is to overwrite the default run command in the container, e.g.:

#EXEC="python train.py -p pipeline.yaml --download-only" #for bash
# if [ "$SLURM_EXECUTION" = true ]; then
# srun --cpu-bind=none bash -c "apptainer exec --nv \
# $CONTAINERPATH \
# $EXEC"
# else
# apptainer exec --nv \
# $CONTAINERPATH \
# $EXEC
# fi

#Choose SLURM execution or bash script execution
if [ "$SLURM_EXECUTION" = true ]; then
srun --cpu-bind=none bash -c "apptainer exec --nv \
$CONTAINERPATH \
$EXEC"
srun --cpu-bind=none bash -c "apptainer run --nv \
$CONTAINERPATH"

else
apptainer exec --nv \
$CONTAINERPATH \
$EXEC
apptainer run --nv \
$CONTAINERPATH
fi

#eof
Expand Down

0 comments on commit ac2d781

Please sign in to comment.