Skip to content

aurelienmorgan/retrain-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI - Downloads GitHub - License

logo_large

retrain-pipelines simplifies the creation and management of machine learning retraining pipelines. The package is designed to remove the complexity of building end-to-end ML retraining pipelines, allowing users to focus on their data and model-architecture. With pre-built, highly adaptable pipeline examples that work out of the box, users can easily integrate their own data and begin retraining models with minimal-to-no setup.

Key features of retrain-pipelines include:

  • Model version blessing: Automatically compare the performance of retrained models against previous best versions to ensure only superior models are deployed.
  • Infrastructure validation: Each retraining pipeline includes inference pipeline packaging, local Docker container deployment, and request/response validation to ensure that models are production-ready.
  • Comprehensive documentation: Every retraining pipeline is fully documented with sections covering Exploratory Data Analysis (EDA), hyperparameter tuning, retraining steps, model performance metrics, and key commands for retrieving training artifacts. Additionally, DAG information for the retraining process is readily available for pipeline transparency and debugging.

In essence, retrain-pipelines offers a seamless solution: "Come with your data and it works" with the added benefit of flexibility for more advanced users to adjust and extend pipelines as needed.

Customizability & Adaptability

retrain-pipelines offers a high degree of flexibility, allowing users to tailor the pre-shipped pipelines to their specific needs:

  • Custom Preprocessing Functions: Users can provide their own Python functions for custom data preprocessing. For example, some built-in pipelines for tabular data allow optional bucketization of numerical features by name, but you can easily modify or extend these preprocessing steps to suit your dataset and feature requirements.
  • Custom Pipeline Card Generation: You can specify custom Python functions to generate pipeline cards, such as including specific performance charts or metrics relevant to your use case.
  • Custom HTML Templates: For further personalization, retrain-pipelines supports customizable HTML templates, enabling you to adjust formatting, insert specific charts, change page colors, or even add your company's logo to documentation pages.

retrain-pipelines doesn't just streamline the retraining process, it empowers teams to innovate faster, iterate smarter, and deploy more robust models with confidence. Whether you're looking for an out-of-the-box solution or a highly customizable pipeline, retrain-pipelines is your ultimate companion for continuous model improvement.

Getting Started

You can trigger a retrain-pipelines launch from many different places.

local_launcher.webm

Sample pipelines

the retrain-pipelines package comes with off-the-shelf Machine Learning retraining pipelines. Find them at /sample_pipelines. For instance :

framework modality task model lib Serving
Metaflow Tabular regression Dask / LightGBM ML Server starter-kit
Metaflow Tabular classification Pytorch / TabNet TorchServe starter-kit

You can simply give one of those your data and it just runs. The only manual change you need to do is regarding the endpoint request & serving signatures, since it is purposely hard-coded.
Indeed, the infra_validator step is here to ensure that your inference pipeline (the one you're working on building a continuous-retraining automation for) keeps adhering to the schema expected by consumers of the inference endpoint. So, if you break the format of the required input raw data, you need to create a somehow new retraining pipeline and assign it a new unique name. This is to ensure that any interface disruption between the inference endpoint and its consumer(s) is intentional.

some important markers

One of the things that make retrain-pipelines stand is its focus on strong MLOps fundamentals.

model blessing 🔽 retrain-pipelines cares for the newly-retrained model version to be evaluated against the previous model versions from that retraining pipeline. We indeed ensure that no lesser-performing model ever gets into production.
Default sample pipelines each come with certain built-in evaluation criteria but, you can customize those per your own requirement. You can for instance choose to include evaluation of model performance on a particular sub-population, so as to serve as a gateway against potential incoming biases.
infrastructure validation 🔽 retrain-pipelines cares for the inference endpoint to be tested prior to deployment. We pack the preprocessing engine together with the newly retrained (and blessed) model version with the ML-server of choice and deploy it locally. We then send an inference request to that temp endpoint and check for a 200 http-ok response with a valid payload format.
pipeline cards 🔽 retrain-pipelines is strongly opinionated around ease of quick-access to information ML-engineers care for when it comes to retraining and serving.
That's why it offers a central place and minimal amounts of clicks to navigate efficiently.

overview

EDA

overall retraining

hyperparameter tuning

key artifacts

pipeline DAG
click thumbnails to enlarge
Browse a live example for yourself here on W3Schools Spaces (click "continue" on the W3Schools landing page)
Third-parties integration 🔽 TensorBoard, PyTorch Profiler, Weights and Biases. retrain-pipelines aims at making centrally available to ML engineers the information they care for.
illustration with WandB in the LightGBM_hp_cv_WandB sample pipeline 🔽 In the example of the LightGBM_hp_cv_WandB sample pipeline for instance, you can find information on how to view details on logging performed during the different training_job steps of a given run. Follow the guidance from the below video :
wandb_integration.webm

customizability 🔽 As alluded to above, a lot of room is given to ML engineers for them to customize retrain-pipelines workflows.
For staters, the sample pipelines are freely modifiable themselves. But, it goes far beyond that. One can go deep into customization with the defaults for preprocessing and for pipeline_card being fully amendable as well.
illustration with the LightGBM_hp_cv_WandB sample pipeline 🔽 Start by getting the default which you'd like to customize (any combination of the below 3 you'd like) :
  • reprocessing.py module
  • pipeline_card.py module
  • template.html html template
cd sample_pipelines/LightGBM_hp_cv_WandB/
from retraining_pipeline import LightGbmHpCvWandbFlow

LightGbmHpCvWandbFlow.copy_default_preprocess_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_html_template(".", exists_ok=True)

Once you updated any of them, you can launch a retrain-pipelines run so it uses those :

%retrain_pipelines_local retraining_pipeline.py run \
  --pipeline_card_artifacts_path "." \
  --preprocess_artifacts_path "."

retrain-pipelines inspectors

Inspectors are convenience methods that abstract away some of the logic to get access to artifacts logged during retrain-pipelines runs.

For instance :

  • browse_local_pipeline_card 🔽 With this convenience method, programmatically open a pipeline_card without the need to browse and click a ML-framework UI :
    from retrain_pipelines.inspectors import browse_local_pipeline_card
    
    browse_local_pipeline_card(mf_flow_name)

    This opens the pipeline_card in a web-browser tab, so you don't have to look for it. It's ideal for quick ideation during the drafting phase : developers can now run/resume & browse in a chain on instructions.


  • get_execution_source_code 🔽 With this convenience method, programmatically access the versioned source code that was used for a particular retrain-pipelines run. This comes together with the WandB integration :
    from retrain_pipelines.inspectors import get_execution_source_code
    
    for source_code_artifact in get_execution_source_code(mf_run_id=<your_flow_id>):
      print(f" - {source_code_artifact.name} {source_code_artifact.url}")

    You can even have those artifacts downloaded on the go :

    from retrain_pipelines.inspectors import explore_source_code
    # download and open file explorer
    explore_source_code(mf_run_id=<your_flow_id>)

  • plot_run_all_cv_tasks 🔽 Specific to retrain-pipelines runs that involve data-parallelism, this inspector method plots each individual hyperparameter-tuning cross-validation training job, showing details for every data-parallel worker.
    For example, for executions of the LightGbmHpCvWandbFlow sample pipeline (which employs Dask for data-parallel training), this gives :
    from retrain_pipelines.inspectors.hp_cv_inspector import plot_run_all_cv_tasks
    
    plot_run_all_cv_tasks(mf_run_id=<your_flow_id>)

    with results looking like below for a run with 6 different sets of hp values, 2 cross-validation folds and with 4 Dask data-parallel workers :


  • and more.

launch tests

pytest -s tests

build from source

python -m build --verbose pkg_src

install from source (dev mode)

pip install -e pkg_src

install from remote source

pip install git+https://github.com/aurelienmorgan/retrain-pipelines.git@master#subdirectory=pkg_src

PyPi

find us @ https://pypi.org/project/retrain-pipelines/


GitHub Stars
Please consider dropping us a star ! ⭐