docs(examples): create demos on large data volumes #126

deepyaman · 2024-07-01T20:01:29Z

Objective

Create two real-world reference projects that showcase Ibis and IbisML at scale.

Documented end-to-end ML projects, including:
- data ingestion
- data exploration (using Ibis; stretch: produce visualizations using existing Ibis integrations)
- data processing (including feature engineering using Ibis)
- train-test split (manually using Ibis)
- last-mile feature preprocessing (using IbisML)
- handoff to model (approach TBD)
- modeling (one using Dask-XGBoost on GPU, another using PyTorch)
- stretch: real-time inference
Ideally, these can be written up as (series of) blog posts in the future.
They can also be submitted to conferences.
It could be useful to track approximate time needed for each stage of the project (e.g. to confirm whether most time really is spent on feature engineering).
Lessons learned on model handoff that can inform future work (if any necessary) in that area for IbisML
Also expect feedback across the rest of the pipeline, but this is where we have the most uncertainty

Lichess live win probability using distributed XGBoost
- Full dataset size: >12TB
TBD using PyTorch
(Backup option) NYC taxi dataset
(Backup option) Bureau of Transportation Statistics full airline dataset

github-project-automation bot added this to Ibis planning and roadmap Jul 1, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jul 1, 2024

deepyaman assigned deepyaman and jitingxu1 Jul 1, 2024

deepyaman added this to the Q3 2024 milestone Jul 1, 2024

deepyaman added the documentation Improvements or additions to documentation label Jul 1, 2024

lostmygithubaccount removed this from the Q3 2024 milestone Jul 17, 2024