HPO tutorial (#270)

* Added first draft of virgo raytorchtrainer integration * Added MLFlow logger integration to raytorchtrainer * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Update createEnvVega.sh * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Added working version of deepspeed strategy, added RayDistributedStrategy as parent class * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Changed files to create docs env and build docs remotely on juwels (#251) * Changed files to create docs env and build docs remotely on juwels * add docs extra to pyproject.toml * Added updated information to README * Grammar * Trailing whitespaces (*melting emoji*) --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Jarl Sondre Sæther <[email protected]> * Fix scalability bug (#252) * add barrier implementation to distributed * fix profiler, seemingly * add print suppress to virgo * fix import bug * fix trailing whitespace 😇 * remove barrier method * Isort, format, delete old files * Formatting imports with ruff * Linting * Specified super linter should not use isort * Unm-messed up the cli.py file * Incorporated PR comments * Fixed & sorted imports * Remove horovod option from RayNoiseGeneratorTrainer * Typo * HPO tutorial first draft * Incorporate PR comments (most importantly, change inheritance for ray strategies) * First draft tutorial * PR comments, refactored RayDDPStrategy * PR comments (super) * PR comments (refactored dataloader function in RayTorchTrainer) * PR comments * Linting * Remove else and line break * Removed patch version specifications, refactored slurm launcher script * Added how-it-works for HPO, updated HPO tutorial * Removed unused export in slurm script, removed abstract train method in RayTorchTrainer * Bugfix in the search alg/ scheduler setting and linting * Pyproject versions (PR comment) * I had already done that actually, so nevermind... Undoing the last commit * Updated tutorial * Link and reference fixes * bash linting error * Added MNIST data to gitignore * Removed MNIST files for testing * PR comments * Phrasing of one sentence * Updated .gitignore * Duplicate things --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Jarl Sæther <[email protected]>
interTwin-eu · Dec 10, 2024 · affbf51 · affbf51
1 parent c5f7bc0
commit affbf51
Show file tree

Hide file tree

Showing 12 changed files with 821 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,7 +2,7 @@
 *_logs
 logs_*
 TODO
-/data
+data/
 nohup*
 tmp*
 .tmp*

diff --git a/docs/conf.py b/docs/conf.py
@@ -40,6 +40,7 @@
     "sphinx.ext.autodoc",
     "sphinx.ext.doctest",
     "sphinx.ext.viewcode",
+    "sphinx_tabs.tabs",
     "myst_parser",
     "nbsphinx",
     "sphinx.ext.napoleon",

diff --git a/docs/how-it-works/hpo/explain-hpo.rst b/docs/how-it-works/hpo/explain-hpo.rst
@@ -0,0 +1,139 @@
+.. _explain_hpo:
+
+Hyperparameter Optimization
+============================
+
+**Author(s)**: Anna Lappe (CERN)
+
+Hyperparameter optimization (HPO) is a core technique for improving machine learning model 
+performance. This page introduces the concepts behind HPO, covering key elements like 
+hyperparameters, search algorithms, schedulers, and more. 
+It also outlines the benefits and drawbacks of HPO, helping you make informed decisions when 
+applying it with itwinai. 
+
+
+Key Concepts
+-------------
+
+**Hyperparameters** are parameters in a machine learning model or training process that are set 
+before training begins. They are not learned from the data but can have a significant impact 
+on the model's performance and training efficiency. Examples of hyperparameters include:
+
+*    Learning rate
+*    Batch size
+*    Number of layers in a neural network
+*    Regularization coefficients (e.g., L2 penalty)
+
+HPO is the process of systematically searching for the optimal set of hyperparameters to 
+maximize model performance on a given task.
+
+The **search space** defines the range of values each hyperparameter can take. It may include 
+discrete values (e.g., [32, 64, 128]) or continuous ranges (e.g., learning rate from 1e-5 to 1e-1).
+
+**Search algorithms** explore the hyperparameter search space to identify the best configuration. 
+
+Common approaches include:
+
+*    Grid Search: Exhaustive search over all combinations of hyperparameter values.
+*    Random Search: Randomly samples combinations from the search space.
+*    Bayesian Optimization: Uses a probabilistic surrogate model to learn the shape of the search space and predict promising configurations.
+*    Evolutionary Algorithms: Use search algorithms based on evolutionary concepts to evolve hyperparameter configurations, e.g. genetic programming.
+
+**Schedulers** manage the allocation of computational resources across multiple hyperparameter 
+configurations. They help prioritize promising configurations and terminate less effective 
+ones early to save resources. 
+
+Examples include:
+
+*    ASHA (Asynchronous Successive Halving Algorithm): Allocates resources by successively discarding the lowest-performing hyperparameter combinations.
+*    Median Stopping Rule: Stops trials that perform below the median of completed trials.
+
+The **evaluation metric** determines the performance of a model on a validation set. 
+Common metrics include accuracy, F1 score, and mean squared error. 
+The choice of metric depends on the task and its goals.
+
+A **trial** is the evaluation of one set of hyperparameters. Depending on whether you are 
+using a scheduler, this could be the entire training run, so as many epochs as you 
+have specified, or it could be terminated early and thus run for fewer epochs.
+
+
+When to Use HPO and Key Considerations
+---------------------------------------
+HPO can significantly enhance a model's predictive accuracy and generalization to unseen data 
+by finding the best hyperparameter settings.
+However, there are some drawbacks, especially with regards to computational cost and resource 
+management. Especially distributed HPO requires careful planning of computational resources 
+to avoid bottlenecks or excessive costs. You should consider that if you want to run four different trials, 
+and you run them on the same amount of resources as you normally would, your training would run four times as long.
+
+Because of this, we want to design our HPO training wisely, so that we avoid unneseccary 
+computational cost. These are some things that might help you when you are getting started with HPO.
+
+HPO is beneficial when:
+
+*    model performance is sensitive to hyperparameter settings.
+*    you have access to sufficient computational resources.
+*    good hyperparameter settings are difficult to determine manually.
+
+To make sure you get the most out of your HPO training, the are some considerations you might want to make are to
+
+*    define the search space wisely: Narrow the search space to plausible ranges to improve efficiency.
+*    choose appropriate metrics: Use metrics aligned with your task's goals. With some search algorithms it is also possible to do multi-objective optimization, e.g. if you want to track both accuracy and L1 loss - just make sure that the metric that you track conforms with your search algorithm.
+*    allocate your resources strategically: Balance the computational cost with the expected performance gains. This is what the scheduler is for, and it is generally always a good idea to use one, unless you expect your objective function to be extremely heterogenous, i.e. that the performance of a hyperparameter configuration on the first (for example) ten epochs is not a good indicator for its future performance at all. You might also have experience in training your model and want to account for additional behaviors  - for this there are additional parameters you may set, such as a grace period (a minimum number of iterations a configuration is allowed to run).
+
+
+Hyperparameter Optimization in itwinai
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Now that we know the key concepts behind HPO, we can explore how these are implemented in itwinai. 
+We'll introduce distributed HPO, describe the architecture and operation of the ``RayTorchTrainer``,
+and see that with the itwinai HPO integration, you can start optimising the hyperparameters of your 
+models with very minimal changes to your existing itwinai pipeline.
+
+
+Ray Overview
+-------------
+
+We use an open-source framework called Ray to facilitate distributed HPO. Ray provides two key 
+components used in itwinai:
+
+*    **Ray Train**: A module for distributed model training.
+*    **Ray Tune**: A framework for hyperparameter optimization, supporting a variety of search algorithms and schedulers.
+
+Ray uses its own cluster architecture to distribute training. A ray cluster consists of a group 
+of nodes that work together to execute distributed tasks. Each node can contribute computational 
+resources, and Ray schedules and manages these resources.
+
+How a Ray Cluster Operates:
+
+#.    **Node Roles**: A cluster includes a head node (orchestrator) and worker nodes (executors). 
+#.    **Task Scheduling**: Ray automatically schedules trials across nodes based on available resources.
+#.    **Shared State**: Nodes share data such as checkpoints and trial results via a central storage path.
+
+We launch a ray cluster using a dedicated slurm job script. You may refer to `this script <https://github.com/interTwin-eu/itwinai/blob/main/tutorials/hpo-workflows/slurm_hpo.sh>`_ It should be suitable for almost any 
+time you wish to run an itwinai pipeline with Ray, the only thing you may have to change is the ``#SBATCH`` directives to set the proper resource requirements. 
+Also refer to the `ray documentation <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ 
+on this topic, if you want to learn more about how to launch a ray cluster with slurm.
+
+
+How Distributed Training Works with the RayTorchTrainer
+--------------------------------------------------------
+
+The ``RayTorchTrainer`` combines components from **Ray Train** and **Ray Tune**, enabling 
+distributed HPO to run within your pipeline while maintaining compatibility with other itwinai features. 
+Because it implements the same interface as the itwinai ``TorchTrainer``, you can easily
+replace the itwinai ``TorchTrainer`` with the ``RayTorchTrainer`` in your pipeline with only a few modifications. 
+The key features of this trainer are:
+
+#.    **Compatibility**: Use all itwinai components—loggers, data getters, splitters, and so on, with the ``RayTorchTrainer``.
+#.    **Flexibility**: Distributed HPO works with various search algorithms and schedulers supported by Ray Tune.
+#.    **Minimal Code Changes**: Replace the ``TorchTrainer`` with the ``RayTorchTrainer`` with very minimal code changes and you're ready to run HPO.
+
+In the ``TorchTrainer``, initialization tasks (e.g., model creation, logger setup) are done 
+outside of the ``train()`` function. However, in the ``RayTorchTrainer``, this logic must be 
+moved inside ``train()`` because Ray executes only the ``train()`` function for each trial independently, so allocation of trial resources is done only once ``train()`` is called.
+Furthermore distribution frameworks, such as DDP or DeepSpeed, are agnostic of the other trials, so they should be initialized only once the trial resources are allocated.
+
+For a hands-on tutorial for how to change your existing itwinai pipeline code to additionally 
+run HPO, or how to set up an HPO integration with itwinai from scratch, have a look at the 
+:doc:`HPO tutorial <../../tutorials/hpo-workflows/hpo-workflows>`.
diff --git a/docs/index.rst b/docs/index.rst
@@ -65,6 +65,7 @@ contains thoroughly tested features aligned with the toolkit's most recent relea
    how-it-works/training/training
    how-it-works/loggers/explain_loggers
    how-it-works/workflows/explain_workflows
+   how-it-works/hpo/explain-hpo
 
 .. toctree::
    :maxdepth: 2
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,7 +2,7 @@ @@
     *_logs
     logs_*
     TODO
-    /data
+    data/
     nohup*
     tmp*
     .tmp*
@@ Expand Down @@