Update Scalability Tutorial #262

jarlsondre · 2024-12-02T12:43:04Z

Summary

This PR updates the scalability tutorial in multiple ways:

Cleaner code
- Removing many prints that strictly were not necessary
- Reusing functionality such as the argparser or the train_epoch function for easier readability
- General formatting
- Rewritten tutorial
SLURM scripts
- Uses the new Python scripts for running the SLURM jobs, e.g. running all the strategies and also the scaling test

Current Limitations

The scalability tutorial was traditionally created with simplicity in mind, and therefore it tries to minimize its dependency to the itwinai project. Because of this, it does not use the itwinai trainer, which is required for using the updated scalability metrics. Therefore, the current implementation of the scalability tutorial only has the absolute and relative wall-clock time plots. This is, however, not so different from how it was before.

Examples of resulting plots

This is done using a subset size of 5000, so not the full ImageNet dataset.

Relative Scalability Plot

Absolute Scalability Plot

List of tasks

I am including this list of tasks not only for my own benefit, but also to make it easier for reviewers to see what has been done. Keep in mind that this list is not exhaustive: I started adding to it after already finishing some tasks.

Fix deepspeed local rank problem
- Turned out to be caused by some environment variables related to OMPI local rank
Fix Horovod not wanting to load stuff (runs out of time?)
- This turned out to be caused by me forgetting to train on a subset of imagenet instead of the whole thing...
Change the name of the default SLURM log dir
Refactor the tutorial files
- No need to make them perfect but there is a lot of redundant stuff that just causes confusion IMO
Update the tutorial readme
Allow user to specify nodes for scalability analysis
Allow user to specify imagenet subset size
Allow use to specify folder for storing scalability metrics

Related issue :
#221, #263

* Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests

* update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>

tutorials/distributed-ml/torch-tutorial-1-mnist/train.py

use-cases/virgo/config.yaml

uv-tutorial.md

annaelisalappe

I'm a little unsure if my build of the docs worked correctly, you might have to show me later where I can find the new tutorial in the docs :)

src/itwinai/slurm/slurm_config.yaml

tutorials/distributed-ml/torch-tutorial-1-mnist/train.py

use-cases/virgo/config.yaml

jarlsondre and others added 30 commits November 11, 2024 15:36

add empty requirements file for cuda

fa3dc1f

add requirements files and update pyproject toml

e9babf9

update pyproject

e994bf4

update installation in pyproject.toml

4b32a05

update readme and horovod installation script

39e5801

update readme with horovod explanation

c9d786b

update horovod installation script

8932f36

update readme with -e flag

0906e33

fix linter readme errors

0d588ad

add more info to readme

750618f

trailing whitespace 🙃

00f4454

trailing whitespace 🙃 (again)

ae89e0c

add draft of table of contents to readme

149a536

update readme toc

337ebd9

update readme toc again

7b1cff9

add section about uv lock to readme

2457826

update toc of readme

4940963

fix errors in readme

ddc7d13

add version numbers to packages in pyproject.toml

abff6c1

remove uv.lock (for now)

4eb5352

remove link from readme

c9cbcef

put toc in html comment

eb163ef

remove toc, remove ds and horovod from reqs, add docs comment to pyproj

a99a674

add requirements files and update pyproject toml

c51a1c4

update installation in pyproject.toml

76c7863

add pytorch extra to horovod and remove redundant script

468ef94

update readme tutorial with pip installation

b0cd8ac

add uv tutorial in separate file

0bd9a0a

jarlsondre added 19 commits December 3, 2024 17:48

fix layout of plot and use update comm regexes

cb971a1

merge

65a6334

more clean up [WIP]

a140dd0

update deepspeed trainer

9f7b1ab

some cleanup

4301251

small cleanup

72a0f37

fix deepspeed in scalability tutorial

a9ec768

add subset to horovod so it finishes in time

11e3788

small cleanup in itwinai trainer

77a2b91

update default slurm log dir name

10e8e2c

update slurm log directory in config files

dba583f

allow user to specify number of nodes for scalability analysis

3604aa5

allow user to specify imagenet subset size

003a683

enable epoch time logging for tutorial

e50bd19

update readme

36ff234

add folder for scalability metrics

7ffe353

fix linting errors

5b9a469

remove import comments in itwinai trainer file

4b6705f

sort imports

6897039

jarlsondre marked this pull request as ready for review January 7, 2025 16:24

jarlsondre requested review from matbun and annaelisalappe January 7, 2025 16:24

jarlsondre changed the title ~~[DRAFT] Scalability tutorial~~ Update Scalability Tutorial Jan 7, 2025

matbun reviewed Jan 8, 2025

View reviewed changes

tutorials/distributed-ml/torch-tutorial-1-mnist/train.py Outdated Show resolved Hide resolved

use-cases/virgo/config.yaml Outdated Show resolved Hide resolved

uv-tutorial.md Outdated Show resolved Hide resolved

annaelisalappe reviewed Jan 8, 2025

View reviewed changes

src/itwinai/slurm/slurm_config.yaml Outdated Show resolved Hide resolved

tutorials/distributed-ml/torch-tutorial-1-mnist/train.py Show resolved Hide resolved

use-cases/virgo/config.yaml Outdated Show resolved Hide resolved

jarlsondre added 2 commits January 9, 2025 09:34

small cleanup: comments from PR

5f874f0

fix virgo config

1501646

annaelisalappe approved these changes Jan 9, 2025

View reviewed changes

jarlsondre merged commit 7ab952e into main Jan 9, 2025
12 checks passed

jarlsondre deleted the scalability-tutorial branch January 9, 2025 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Scalability Tutorial #262

Update Scalability Tutorial #262

jarlsondre commented Dec 2, 2024 •

edited by annaelisalappe

Loading

annaelisalappe left a comment

Update Scalability Tutorial #262

Update Scalability Tutorial #262

Conversation

jarlsondre commented Dec 2, 2024 • edited by annaelisalappe Loading

Summary

Current Limitations

Examples of resulting plots

Relative Scalability Plot

Absolute Scalability Plot

List of tasks

annaelisalappe left a comment

Choose a reason for hiding this comment

jarlsondre commented Dec 2, 2024 •

edited by annaelisalappe

Loading