Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(examples): create demos on large data volumes #126

Open
deepyaman opened this issue Jul 1, 2024 · 0 comments
Open

docs(examples): create demos on large data volumes #126

deepyaman opened this issue Jul 1, 2024 · 0 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@deepyaman
Copy link
Collaborator

deepyaman commented Jul 1, 2024

Objective

Create two real-world reference projects that showcase Ibis and IbisML at scale.

Outcomes

  • Documented end-to-end ML projects, including:

    • data ingestion
    • data exploration (using Ibis; stretch: produce visualizations using existing Ibis integrations)
    • data processing (including feature engineering using Ibis)
    • train-test split (manually using Ibis)
    • last-mile feature preprocessing (using IbisML)
    • handoff to model (approach TBD)
    • modeling (one using Dask-XGBoost on GPU, another using PyTorch)
    • stretch: real-time inference

    Ideally, these can be written up as (series of) blog posts in the future.
    They can also be submitted to conferences.
    It could be useful to track approximate time needed for each stage of the project (e.g. to confirm whether most time really is spent on feature engineering).

  • Lessons learned on model handoff that can inform future work (if any necessary) in that area for IbisML

  • Also expect feedback across the rest of the pipeline, but this is where we have the most uncertainty

Projects

  • Lichess live win probability using distributed XGBoost
    • Full dataset size: >12TB
  • TBD using PyTorch
  • (Backup option) NYC taxi dataset
  • (Backup option) Bureau of Transportation Statistics full airline dataset
@deepyaman deepyaman added this to the Q3 2024 milestone Jul 1, 2024
@deepyaman deepyaman added the documentation Improvements or additions to documentation label Jul 1, 2024
@lostmygithubaccount lostmygithubaccount removed this from the Q3 2024 milestone Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: backlog
Development

No branches or pull requests

3 participants