Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compile_fn parameter for Trainer #20269

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

mieshkiwrk
Copy link

@mieshkiwrk mieshkiwrk commented Sep 10, 2024

Add support for compile_fn for Trainer for example to compile model after applying strategy

Example usage: needed to compile after applying DDP strategy to get pre/post forward also compiled

Fixes #20242


📚 Documentation preview 📚: https://pytorch-lightning--20269.org.readthedocs.build/en/20269/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 10, 2024
@mieshkiwrk
Copy link
Author

Both benchmarks checks failed due to timeout

@mieshkiwrk
Copy link
Author

bump

Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 81%. Comparing base (5be58f6) to head (4648ea2).
Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (5be58f6) and HEAD (4648ea2). Click for more details.

HEAD has 553 uploads less than BASE
Flag BASE (5be58f6) HEAD (4648ea2)
cpu 147 21
lightning 106 16
pytest 87 2
python3.9 43 6
python3.10 42 6
lightning_fabric 25 0
gpu 4 2
python3.11 42 6
python3.12 20 3
pytorch2.1 38 12
pytest-full 64 21
pytorch2.3 9 3
pytorch_lightning 20 7
pytorch2.2 9 3
pytorch2.4 8 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20269     +/-   ##
=========================================
- Coverage      89%      81%     -8%     
=========================================
  Files         267      264      -3     
  Lines       23084    23032     -52     
=========================================
- Hits        20585    18620   -1965     
- Misses       2499     4412   +1913     

@mieshkiwrk mieshkiwrk force-pushed the feature/trainer-compile-fn branch from 4648ea2 to 925c376 Compare October 24, 2024 06:21
@lantiga
Copy link
Collaborator

lantiga commented Nov 12, 2024

Thank you @mieshkiwrk.

The way we recommend users to use torch.compile with lightning is to call torch.compile on the model and then pass it to the trainer.

import torch
import lightning as L

model = MyLightningModule()

model = torch.compile(model)

trainer = L.Trainer()
trainer.fit(model)

This PR would add an additional entrypoint and there's probably a simpler way to go about it (for users).

We should replicate what Fabric does here: https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/wrappers.py#L421

where we capture the arguments passed to torch.compile so we can re-apply it when using strategies, just like we do in Fabric but in the Trainer:

#19280

Would you like to take a stab at it?

@lantiga lantiga added the waiting on author Waiting on user action, correction, or update label Nov 12, 2024
@mieshkiwrk
Copy link
Author

Let me try, looks like I see what's needed to be done

@mieshkiwrk
Copy link
Author

@lantiga, something like this would be fine? Wanted to make sure about re-using _unwrap_compiled and _to_compiled from fabric or should these be copied to lightning.pytorch wrappers?
If this approach would be okay I'll add comments, verify sanity checks and some unit test

@lantiga
Copy link
Collaborator

lantiga commented Nov 27, 2024

hey, thanks for updating the PR
the approach looks good! I don’t think we need reapply_compile, we will want this whenever we have an optimized model in input (the call is largely free, until we run the first forward)
I’d reuse Fabric’s wrapper, unless we need to tweak them in which case I’d duplicate them

@lantiga lantiga removed the waiting on author Waiting on user action, correction, or update label Dec 3, 2024
@lantiga
Copy link
Collaborator

lantiga commented Dec 3, 2024

Thanks for the update! Can you add tests, ideally testing that DDP and FSDP behave correctly when applied?

Later on (not for this PR), we should also look into ModelParallelStrategy, but that's for later.

@mieshkiwrk
Copy link
Author

Sure it's wip, it will just take a while due to limited free time

@lantiga lantiga added the waiting on author Waiting on user action, correction, or update label Dec 10, 2024
@mergify mergify bot removed the has conflicts label Jan 10, 2025
@mieshkiwrk
Copy link
Author

mieshkiwrk commented Jan 10, 2025

@lantiga Added 2 tests for DDP/FSDP, CI is failing due to:
##[error]Bash exited with code 137, which means it ran out of memory. Make sure the agent (container) host has sufficient memory configured.

Locally added tests passed for me on CPU, I dont have nvidia gpu for validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pl Generic label for PyTorch Lightning package torch.compile waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add something like use_compile parameter for Trainer
2 participants