Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Scalability Tutorial #262

Merged
merged 77 commits into from
Jan 9, 2025
Merged

Update Scalability Tutorial #262

merged 77 commits into from
Jan 9, 2025

Conversation

jarlsondre
Copy link
Collaborator

@jarlsondre jarlsondre commented Dec 2, 2024

Summary

This PR updates the scalability tutorial in multiple ways:

  • Cleaner code
    • Removing many prints that strictly were not necessary
    • Reusing functionality such as the argparser or the train_epoch function for easier readability
    • General formatting
    • Rewritten tutorial
  • SLURM scripts
    • Uses the new Python scripts for running the SLURM jobs, e.g. running all the strategies and also the scaling test

Current Limitations

The scalability tutorial was traditionally created with simplicity in mind, and therefore it tries to minimize its dependency to the itwinai project. Because of this, it does not use the itwinai trainer, which is required for using the updated scalability metrics. Therefore, the current implementation of the scalability tutorial only has the absolute and relative wall-clock time plots. This is, however, not so different from how it was before.

Examples of resulting plots

This is done using a subset size of 5000, so not the full ImageNet dataset.

Relative Scalability Plot

image

Absolute Scalability Plot

image

List of tasks

I am including this list of tasks not only for my own benefit, but also to make it easier for reviewers to see what has been done. Keep in mind that this list is not exhaustive: I started adding to it after already finishing some tasks.

  • Fix deepspeed local rank problem
    • Turned out to be caused by some environment variables related to OMPI local rank
  • Fix Horovod not wanting to load stuff (runs out of time?)
    • This turned out to be caused by me forgetting to train on a subset of imagenet instead of the whole thing...
  • Change the name of the default SLURM log dir
  • Refactor the tutorial files
    • No need to make them perfect but there is a lot of redundant stuff that just causes confusion IMO
  • Update the tutorial readme
  • Allow user to specify nodes for scalability analysis
  • Allow user to specify imagenet subset size
  • Allow use to specify folder for storing scalability metrics

Related issue :
#221, #263

jarlsondre and others added 30 commits November 11, 2024 15:36
* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests
* update virgo generated dataset to use hdf5 format

* add functionality for selecting output location

* set new data format as standard

* make virgo work with new data loader and add progress bar

* remove old generation files and add script for concatenating hdf5 files

* remove old generation files and add script for concatenating hdf5 files

* rename folder using hyphens

* remove multiprocessing

* add multiprocessing at correct place

* update handling of seed and num processes

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* make virgo work with new data loader and add progress bar

* add contributors

* update ruff settings in pyproject

* update virgo dataset concatenation

* add isort option to ruff

* break imports on purpose

* break more imports to test

* remove ruff config file

* 😀

* test linter 😁

* remove comment in github workflows

* add validation python to linter and make more mistakes

* add linting errors to trainer

* remove isort and flake8 and replace with ruff

* update linters

* run formatter on virgo folder

* fix linting errors and stuff from PR

* update config

* change config for timing code

* update profiler to use 'with' for context managing

* fix profiler.py

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>
@jarlsondre jarlsondre marked this pull request as ready for review January 7, 2025 16:24
@jarlsondre jarlsondre changed the title [DRAFT] Scalability tutorial Update Scalability Tutorial Jan 7, 2025
tutorials/distributed-ml/torch-tutorial-1-mnist/train.py Outdated Show resolved Hide resolved
use-cases/virgo/config.yaml Outdated Show resolved Hide resolved
uv-tutorial.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@annaelisalappe annaelisalappe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if my build of the docs worked correctly, you might have to show me later where I can find the new tutorial in the docs :)

src/itwinai/slurm/slurm_config.yaml Outdated Show resolved Hide resolved
use-cases/virgo/config.yaml Outdated Show resolved Hide resolved
@jarlsondre jarlsondre merged commit 7ab952e into main Jan 9, 2025
12 checks passed
@jarlsondre jarlsondre deleted the scalability-tutorial branch January 9, 2025 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants