Itwinai container (#197) · interTwin-eu/itwinai@9c82f18

Commit

Itwinai container (#197)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <[email protected]>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <[email protected]>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD torch and tensorflow Docker containers

* Working DDP

* REFACTOR torch container build scripts

* FIX MPI env var set

* Incomplete containers

* UPDATE Dockerfiles

* REFACTOR Dockerfiles

* Rename

* UPDATE containers files and tutorial

* CLEANUP old doc pages

* ADD containers tutorials

* ADD containers tutorials

* UPDATE deps

* UPDATE deps

* UPDATE deps

* UPDATE docs and tutorials

* CLEANUP duplicates

* Update tests and scripts

* ADD labels

* CLEANUP

* Add docs and fix deepspeed launcher

* UPDATE linter settings

* FIX slow unit test on 3DGAN train

* ADD 3dgan sample dataset

---------

Co-authored-by: Roman Machacek <[email protected]>
Co-authored-by: linxUser3574 <[email protected]>
Co-authored-by: orviz <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: VerderK <[email protected]>

Loading branch information

11 people authored Sep 2, 2024

1 parent 17c5d94 commit 9c82f18

.github/workflows/lint.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -47,6 +47,9 @@ jobs: @@
               VALIDATE_CHECKOV: false # activate to lint k8s pods
               VALIDATE_SHELL_SHFMT: false
               VALIDATE_JSCPD: false
+              VALIDATE_MARKDOWN_PRETTIER: false
+              VALIDATE_YAML_PRETTIER: false
+              VALIDATE_PYTHON_PYINK: false
               # Only check new or edited files
               VALIDATE_ALL_CODEBASE: false
@@ Expand Down @@

.readthedocs.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -35,4 +35,5 @@ sphinx: @@
     python:
        install:
       #  - wheel
+       - requirements: docs/pre-requirements.txt
        - requirements: docs/requirements.txt

README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -149,6 +149,17 @@ Otherwise, if you are on an HPC system, please refer to @@
     [this section](#activate-itwinai-environment-on-hpc)
     explaining how to load the required environment modules before the python environment.
+    To  build a Docker image for the pytorch version (need to adapt `TAG`):
+    ```bash
+    # Local
+    docker buildx build -t itwinai:TAG -f env-files/torch/Dockerfile .
+    # Ghcr.io
+    docker buildx build -t ghcr.io/intertwin-eu/itwinai:TAG -f env-files/torch/Dockerfile .
+    docker push ghcr.io/intertwin-eu/itwinai:TAG
+    ```
     #### TensorFlow virtual environment
     Makefile targets for environment installation:
@@ Expand All / @@ -174,6 +185,17 @@ Otherwise, if you are on an HPC system, please refer to @@
     [this section](#activate-itwinai-environment-on-hpc)
     explaining how to load the required environment modules before the python environment.
+    To  build a Docker image for the tensorflow version (need to adapt `TAG`):
+    ```bash
+    # Local
+    docker buildx build -t itwinai:TAG -f env-files/tensorflow/Dockerfile .
+    # Ghcr.io
+    docker buildx build -t ghcr.io/intertwin-eu/itwinai:TAG -f env-files/tensorflow/Dockerfile .
+    docker push ghcr.io/intertwin-eu/itwinai:TAG
+    ```
     ### Activate itwinai environment on HPC
     Usually, HPC systems organize their software in modules which need to be imported by the users
@@ Expand Down @@

docs/README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,11 +1,52 @@
  
    # Read The Docs documentation page

    The python dependencies are organized in two requirements files, which

    must be installed in the following order:

    1. `pre-requirements.txt` contains torch and tensorflow.

    1. `requirements.txt` contains the packages which depend on torch and tensorflow,

    which should be installed *after* torch and tensorflow.

    ## Build docs locally

    TODO: explain

    To build the docs locally and visualize them in your browser, without relying on external

    services (e.g., Read The Docs cloud), use the following commands

    ```bash

    # Clone the repo, if not done yet

    git clone https://github.com/interTwin-eu/itwinai.git itwinai-docs

    cd itwinai-docs

    # The first time, you may need to install some Linux packages (assuming Ubuntu system here)

    sudo apt update && sudo apt install libmysqlclient-dev

    sudo apt install python3-sphinx

    # Create a python virtual environment and install itwinai and its dependencies

    python3 -m venv .venv-docs

    source .venv-docs/bin/activate

    pip install -r docs/pre-requirements.txt

    pip install -r docs/requirements.txt

    pip install sphinx-rtd-theme

    # Move to the docs folder and build them using Sphinx

    cd docs

    make clean

    make html

    # Serve a local HTTP server to navigate the newly created docs pages.

    # You can see the docs visiting http://localhost:8000 in your browser.

    python -m http.server --directory  _build/html/

    ```

    ### Build docs on JSC

    On JSC systems, the way of building the docs locally is similar to the method

    explained above. However, the environment setup must be slightly adapted to use

    some modules provided on the HPC system.

    To manage the docs, you can simply use the Makefile target

    belows.

    From the repository's root, create the docs virtual environment:

    ```bash

    @@ -19,7 +60,7 @@ and serve them on localhost:
  
    make docs-jsc

    ```

    ## RTD management page

    ## Read The Docs management page

    To manage the documentation page visit

    To manage the documentation on Read The Docs (RTD) cloud, visit

    [https://readthedocs.org/projects/itwinai](https://readthedocs.org/projects/itwinai/).

docs/explain_advanced_workflow.rst

This file was deleted.

docs/pre-requirements.txt

-Original file line number
+Diff line change
@@ -0,0 +1,5 @@
+    wheel
+    tensorflow==2.16.*
+    torch==2.1.*
+    torchvision
+    torchaudio

docs/requirements.txt

-Original file line number
+Diff line change
@@ -1,12 +1,7 @@
-    Sphinx==7.2.6
     sphinx-rtd-theme==2.0.0
     nbsphinx==0.9.4
     myst-parser==2.0.0
-    wheel
-    tensorflow==2.16.*
-    torch==2.1.*
-    torchvision
-    torchaudio
     git+https://github.com/thomas-bouvier/horovod.git@compile-cpp17
     deepspeed
     IPython
@@ Expand Down @@

docs/tutorials/distrib-ml/torch-tutorial-containers.rst

-Original file line number
+Diff line change
@@ -0,0 +1,55 @@
+    itwinai and containers (Docker and Singularity)
+    =========================
+    In this tutorial you will learn how to use itwinai's containers images to run your ML workflows
+    without having to setup the python environment by means of virtual environments.
+    .. include:: ../../../tutorials/distributed-ml/torch-tutorial-containers/README.md
+       :parser: myst_parser.sphinx_
+    Shell scripts
+    --------------
+    run_docker.sh
+    ++++++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/run_docker.sh
+       :language: bash
+    slurm.sh
+    ++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/slurm.sh
+       :language: bash
+    runall.sh
+    ++++++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/runall.sh
+       :language: bash
+    Pipeline configuration
+    -----------------------
+    config.yaml
+    ++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/config.yaml
+       :language: yaml
+    Python files
+    ------------------
+    model.py
+    ++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/model.py
+       :language: python
+    dataloader.py
+    +++++++++++++++
+    .. literalinclude:: ../../../tutorials/distributed-ml/torch-tutorial-containers/dataloader.py
+       :language: python

docs/tutorials/tutorials.rst

-Original file line number
+Diff line change
@@ Expand Up / @@ -20,6 +20,7 @@ Distributed ML with PyTorch @@
        distrib-ml/torch_tutorial_2_trainer_class
        distrib-ml/torch-tutorial-GAN
        distrib-ml/torch_scaling_test
+       distrib-ml/torch-tutorial-containers
     Distributed ML with TensorFlow
@@ Expand Down @@

env-files/docs/create-docs-env-jsc.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -12,4 +12,5 @@ gcc --version @@
     rm -rf .venv-docs
     python -m venv .venv-docs
     source .venv-docs/bin/activate
+    pip install -r docs/pre-requirements.txt
     pip install -r docs/requirements.txt

env-files/tensorflow/Dockerfile

-Original file line number
+Diff line change
@@ -0,0 +1,29 @@
+    ARG IMG_TAG=24.08-tf2-py3
+    # 23.09-tf2-py3: tensorflow==2.13.0
+    # 24.04-tf2-py3: tensorflow==2.15.0
+    # 24.08-tf2-py3: tensorflow==2.16.1
+    FROM nvcr.io/nvidia/tensorflow:${IMG_TAG}
+    WORKDIR /usr/src/app
+    # Install itwinai
+    COPY pyproject.toml ./
+    COPY src ./
+    COPY env-files/tensorflow/create_container_env.sh ./
+    RUN bash create_container_env.sh
+    # Create non-root user
+    RUN groupadd -g 10001 jovyan \
+        && useradd -m -u 10000 -g jovyan jovyan \
+        && chown -R jovyan:jovyan /usr/src/app
+    USER jovyan:jovyan
+    # ENTRYPOINT [ "/bin/sh" ]
+    # CMD [  ]
+    LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
+    LABEL org.opencontainers.image.description="Base itwinai image with tensorflow dependencies and CUDA drivers"
+    LABEL org.opencontainers.image.licenses=MIT
+    LABEL maintainer="Matteo Bunino - [email protected]"

env-files/tensorflow/create_container_env.sh

-Original file line number
+Diff line change
@@ -0,0 +1,16 @@
+    #!/bin/bash
+    # Install dependencies in container, assuming that the container image
+    # is from NGC and tensorflow is already installed
+    pip install --no-cache-dir --upgrade pip
+    # WHEN USING TF >= 2.16:
+    # install legacy version of keras (2.16)
+    # Since TF 2.16, keras updated to 3.3,
+    # which leads to an error when more than 1 node is used
+    # https://keras.io/getting_started/
+    pip install --no-cache-dir tf_keras==2.16.*
+    # itwinai
+    pip --no-cache-dir install .

env-files/tensorflow/generic_tf.sh

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -41,7 +41,7 @@ else
  
      echo "$ENV_NAME environment is created in ${cDir}"

    fi

    pip3 install --upgrade pip

    pip3 install --no-cache-dir  --upgrade pip

    # get wheel -- setuptools extension

    pip3 install --no-cache-dir wheel

    @@ -84,7 +84,7 @@ fi
  
    # Since TF 2.16, keras updated to 3.3,

    # which leads to an error when more than 1 node is used

    # https://keras.io/getting_started/

    pip3 install tf_keras

    pip3 install --no-cache-dir  tf_keras==2.16.*

    # itwinai

    pip3 install -e .[dev]

    pip3 install --no-cache-dir  -e .[dev]

env-files/torch/Dockerfile

-Original file line number
+Diff line change
@@ -0,0 +1,31 @@
+    ARG IMG_TAG=23.09-py3
+    # 23.09-py3: torch==2.1.0
+    # 24.04-py3: torch==2.3.0
+    FROM nvcr.io/nvidia/pytorch:${IMG_TAG}
+    # https://stackoverflow.com/a/56748289
+    ARG IMG_TAG
+    WORKDIR /usr/src/app
+    # https://github.com/mpi4py/mpi4py/pull/431
+    RUN env SETUPTOOLS_USE_DISTUTILS=local python -m pip install --no-cache-dir mpi4py
+    # Install itwinai
+    COPY pyproject.toml ./
+    COPY src ./
+    COPY env-files/torch/create_container_env.sh ./
+    RUN bash create_container_env.sh ${IMG_TAG}
+    # Create non-root user
+    RUN groupadd -g 10001 jovyan \
+        && useradd -m -u 10000 -g jovyan jovyan \
+        && chown -R jovyan:jovyan /usr/src/app
+    USER jovyan:jovyan
+    LABEL org.opencontainers.image.source=https://github.com/interTwin-eu/itwinai
+    LABEL org.opencontainers.image.description="Base itwinai image with torch dependencies and CUDA drivers"
+    LABEL org.opencontainers.image.licenses=MIT
+    LABEL maintainer="Matteo Bunino - [email protected]"

0 comments on commit `9c82f18`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `9c82f18`

Commit

There are no files selected for viewing

0 comments on commit 9c82f18

0 comments on commit `9c82f18`