Skip to content

Commit

Permalink
HPO tutorial (#270)
Browse files Browse the repository at this point in the history
* Added first draft of virgo raytorchtrainer integration

* Added MLFlow logger integration to raytorchtrainer

* Added ray components to main scripts

* Added first draft of raydeepspeed strategy

* Update createEnvVega.sh

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Itwinai jlab Docker image (#236)

* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests

* Added first draft of virgo raytorchtrainer integration

* Added ray components to main scripts

* Added first draft of raydeepspeed strategy

* Virgo HDF5 file format (#240)

* update virgo generated dataset to use hdf5 format

* add functionality for selecting output location

* set new data format as standard

* make virgo work with new data loader and add progress bar

* remove old generation files and add script for concatenating hdf5 files

* remove old generation files and add script for concatenating hdf5 files

* rename folder using hyphens

* remove multiprocessing

* add multiprocessing at correct place

* update handling of seed and num processes

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* make virgo work with new data loader and add progress bar

* add contributors

* update ruff settings in pyproject

* update virgo dataset concatenation

* add isort option to ruff

* break imports on purpose

* break more imports to test

* remove ruff config file

* 😀

* test linter 😁

* remove comment in github workflows

* add validation python to linter and make more mistakes

* add linting errors to trainer

* remove isort and flake8 and replace with ruff

* update linters

* run formatter on virgo folder

* fix linting errors and stuff from PR

* update config

* change config for timing code

* update profiler to use 'with' for context managing

* fix profiler.py

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Added first draft of virgo raytorchtrainer integration

* Added ray components to main scripts

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Added ray components to main scripts

* Added first draft of raydeepspeed strategy

* Added working version of deepspeed strategy, added RayDistributedStrategy as parent class

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Itwinai jlab Docker image (#236)

* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests

* Virgo HDF5 file format (#240)

* update virgo generated dataset to use hdf5 format

* add functionality for selecting output location

* set new data format as standard

* make virgo work with new data loader and add progress bar

* remove old generation files and add script for concatenating hdf5 files

* remove old generation files and add script for concatenating hdf5 files

* rename folder using hyphens

* remove multiprocessing

* add multiprocessing at correct place

* update handling of seed and num processes

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* make virgo work with new data loader and add progress bar

* add contributors

* update ruff settings in pyproject

* update virgo dataset concatenation

* add isort option to ruff

* break imports on purpose

* break more imports to test

* remove ruff config file

* 😀

* test linter 😁

* remove comment in github workflows

* add validation python to linter and make more mistakes

* add linting errors to trainer

* remove isort and flake8 and replace with ruff

* update linters

* run formatter on virgo folder

* fix linting errors and stuff from PR

* update config

* change config for timing code

* update profiler to use 'with' for context managing

* fix profiler.py

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Changed files to create docs env and build docs remotely on juwels (#251)

* Changed files to create docs env and build docs remotely on juwels

* add docs extra to pyproject.toml

* Added updated information to README

* Grammar

* Trailing whitespaces (*melting emoji*)

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Jarl Sondre Sæther <[email protected]>

* Fix scalability bug (#252)

* add barrier implementation to distributed

* fix profiler, seemingly

* add print suppress to virgo

* fix import bug

* fix trailing whitespace 😇

* remove barrier method

* Isort, format, delete old files

* Formatting imports with ruff

* Linting

* Specified super linter should not use isort

* Unm-messed up the cli.py file

* Incorporated PR comments

* Fixed & sorted imports

* Remove horovod option from RayNoiseGeneratorTrainer

* Typo

* HPO tutorial first draft

* Incorporate PR comments (most importantly, change inheritance for ray strategies)

* First draft tutorial

* PR comments, refactored RayDDPStrategy

* PR comments (super)

* PR comments (refactored dataloader function in RayTorchTrainer)

* PR comments

* Linting

* Remove else and line break

* Removed patch version specifications, refactored slurm launcher script

* Added how-it-works for HPO, updated HPO tutorial

* Removed unused export in slurm script, removed abstract train method in RayTorchTrainer

* Bugfix in the search alg/ scheduler setting and linting

* Pyproject versions (PR comment)

* I had already done that actually, so nevermind... Undoing the last commit

* Updated tutorial

* Link and reference fixes

* bash linting error

* Added MNIST data to gitignore

* Removed MNIST files for testing

* PR comments

* Phrasing of one sentence

* Updated .gitignore

* Duplicate things

---------

Co-authored-by: Matteo Bunino <[email protected]>
Co-authored-by: Jarl Sæther <[email protected]>
  • Loading branch information
3 people authored Dec 10, 2024
1 parent c5f7bc0 commit affbf51
Show file tree
Hide file tree
Showing 12 changed files with 821 additions and 4 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
*_logs
logs_*
TODO
/data
data/
nohup*
tmp*
.tmp*
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
"sphinx.ext.autodoc",
"sphinx.ext.doctest",
"sphinx.ext.viewcode",
"sphinx_tabs.tabs",
"myst_parser",
"nbsphinx",
"sphinx.ext.napoleon",
Expand Down
139 changes: 139 additions & 0 deletions docs/how-it-works/hpo/explain-hpo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
.. _explain_hpo:

Hyperparameter Optimization
============================

**Author(s)**: Anna Lappe (CERN)

Hyperparameter optimization (HPO) is a core technique for improving machine learning model
performance. This page introduces the concepts behind HPO, covering key elements like
hyperparameters, search algorithms, schedulers, and more.
It also outlines the benefits and drawbacks of HPO, helping you make informed decisions when
applying it with itwinai.


Key Concepts
-------------

**Hyperparameters** are parameters in a machine learning model or training process that are set
before training begins. They are not learned from the data but can have a significant impact
on the model's performance and training efficiency. Examples of hyperparameters include:

* Learning rate
* Batch size
* Number of layers in a neural network
* Regularization coefficients (e.g., L2 penalty)

HPO is the process of systematically searching for the optimal set of hyperparameters to
maximize model performance on a given task.

The **search space** defines the range of values each hyperparameter can take. It may include
discrete values (e.g., [32, 64, 128]) or continuous ranges (e.g., learning rate from 1e-5 to 1e-1).

**Search algorithms** explore the hyperparameter search space to identify the best configuration.

Common approaches include:

* Grid Search: Exhaustive search over all combinations of hyperparameter values.
* Random Search: Randomly samples combinations from the search space.
* Bayesian Optimization: Uses a probabilistic surrogate model to learn the shape of the search space and predict promising configurations.
* Evolutionary Algorithms: Use search algorithms based on evolutionary concepts to evolve hyperparameter configurations, e.g. genetic programming.

**Schedulers** manage the allocation of computational resources across multiple hyperparameter
configurations. They help prioritize promising configurations and terminate less effective
ones early to save resources.

Examples include:

* ASHA (Asynchronous Successive Halving Algorithm): Allocates resources by successively discarding the lowest-performing hyperparameter combinations.
* Median Stopping Rule: Stops trials that perform below the median of completed trials.

The **evaluation metric** determines the performance of a model on a validation set.
Common metrics include accuracy, F1 score, and mean squared error.
The choice of metric depends on the task and its goals.

A **trial** is the evaluation of one set of hyperparameters. Depending on whether you are
using a scheduler, this could be the entire training run, so as many epochs as you
have specified, or it could be terminated early and thus run for fewer epochs.


When to Use HPO and Key Considerations
---------------------------------------
HPO can significantly enhance a model's predictive accuracy and generalization to unseen data
by finding the best hyperparameter settings.
However, there are some drawbacks, especially with regards to computational cost and resource
management. Especially distributed HPO requires careful planning of computational resources
to avoid bottlenecks or excessive costs. You should consider that if you want to run four different trials,
and you run them on the same amount of resources as you normally would, your training would run four times as long.

Because of this, we want to design our HPO training wisely, so that we avoid unneseccary
computational cost. These are some things that might help you when you are getting started with HPO.

HPO is beneficial when:

* model performance is sensitive to hyperparameter settings.
* you have access to sufficient computational resources.
* good hyperparameter settings are difficult to determine manually.

To make sure you get the most out of your HPO training, the are some considerations you might want to make are to

* define the search space wisely: Narrow the search space to plausible ranges to improve efficiency.
* choose appropriate metrics: Use metrics aligned with your task's goals. With some search algorithms it is also possible to do multi-objective optimization, e.g. if you want to track both accuracy and L1 loss - just make sure that the metric that you track conforms with your search algorithm.
* allocate your resources strategically: Balance the computational cost with the expected performance gains. This is what the scheduler is for, and it is generally always a good idea to use one, unless you expect your objective function to be extremely heterogenous, i.e. that the performance of a hyperparameter configuration on the first (for example) ten epochs is not a good indicator for its future performance at all. You might also have experience in training your model and want to account for additional behaviors - for this there are additional parameters you may set, such as a grace period (a minimum number of iterations a configuration is allowed to run).


Hyperparameter Optimization in itwinai
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now that we know the key concepts behind HPO, we can explore how these are implemented in itwinai.
We'll introduce distributed HPO, describe the architecture and operation of the ``RayTorchTrainer``,
and see that with the itwinai HPO integration, you can start optimising the hyperparameters of your
models with very minimal changes to your existing itwinai pipeline.


Ray Overview
-------------

We use an open-source framework called Ray to facilitate distributed HPO. Ray provides two key
components used in itwinai:

* **Ray Train**: A module for distributed model training.
* **Ray Tune**: A framework for hyperparameter optimization, supporting a variety of search algorithms and schedulers.

Ray uses its own cluster architecture to distribute training. A ray cluster consists of a group
of nodes that work together to execute distributed tasks. Each node can contribute computational
resources, and Ray schedules and manages these resources.

How a Ray Cluster Operates:

#. **Node Roles**: A cluster includes a head node (orchestrator) and worker nodes (executors).
#. **Task Scheduling**: Ray automatically schedules trials across nodes based on available resources.
#. **Shared State**: Nodes share data such as checkpoints and trial results via a central storage path.

We launch a ray cluster using a dedicated slurm job script. You may refer to `this script <https://github.com/interTwin-eu/itwinai/blob/main/tutorials/hpo-workflows/slurm_hpo.sh>`_ It should be suitable for almost any
time you wish to run an itwinai pipeline with Ray, the only thing you may have to change is the ``#SBATCH`` directives to set the proper resource requirements.
Also refer to the `ray documentation <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_
on this topic, if you want to learn more about how to launch a ray cluster with slurm.


How Distributed Training Works with the RayTorchTrainer
--------------------------------------------------------

The ``RayTorchTrainer`` combines components from **Ray Train** and **Ray Tune**, enabling
distributed HPO to run within your pipeline while maintaining compatibility with other itwinai features.
Because it implements the same interface as the itwinai ``TorchTrainer``, you can easily
replace the itwinai ``TorchTrainer`` with the ``RayTorchTrainer`` in your pipeline with only a few modifications.
The key features of this trainer are:

#. **Compatibility**: Use all itwinai components—loggers, data getters, splitters, and so on, with the ``RayTorchTrainer``.
#. **Flexibility**: Distributed HPO works with various search algorithms and schedulers supported by Ray Tune.
#. **Minimal Code Changes**: Replace the ``TorchTrainer`` with the ``RayTorchTrainer`` with very minimal code changes and you're ready to run HPO.

In the ``TorchTrainer``, initialization tasks (e.g., model creation, logger setup) are done
outside of the ``train()`` function. However, in the ``RayTorchTrainer``, this logic must be
moved inside ``train()`` because Ray executes only the ``train()`` function for each trial independently, so allocation of trial resources is done only once ``train()`` is called.
Furthermore distribution frameworks, such as DDP or DeepSpeed, are agnostic of the other trials, so they should be initialized only once the trial resources are allocated.

For a hands-on tutorial for how to change your existing itwinai pipeline code to additionally
run HPO, or how to set up an HPO integration with itwinai from scratch, have a look at the
:doc:`HPO tutorial <../../tutorials/hpo-workflows/hpo-workflows>`.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ contains thoroughly tested features aligned with the toolkit's most recent relea
how-it-works/training/training
how-it-works/loggers/explain_loggers
how-it-works/workflows/explain_workflows
how-it-works/hpo/explain-hpo

.. toctree::
:maxdepth: 2
Expand Down
Loading

0 comments on commit affbf51

Please sign in to comment.