A more encompassing fix for offloading + ac #1936

janeyx99 · 2024-10-31T21:19:09Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
Fixes #1867

Changelog

What are the changes made in this PR?

Increase condition for calling record_stream to be whenever the storage count is greater.
Add a test case (fails before this PR) to make sure we fixed the issue. How does the test case work? See the notes I wrote myself when contriving:
Note: this uses a private storage use_count API. I'm working on a less sketchy looking API in core, but we don't want to halt this fix due to the progress there. We can always come back and change this to use the newer, nicer API.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-10-31T21:19:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1936

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3c9b33c with merge base f560cbb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Needed by pytorch/torchtune#1936 In favor over #139109, as exposing an existing API is better than adding a new one (and this enables a more robust fix) [ghstack-poisoned]

soulitzer

Cool!

Needed by pytorch/torchtune#1936 In favor over #139109, as exposing an existing API is better than adding a new one (and this enables a more robust fix) [ghstack-poisoned]

codecov-commenter · 2024-11-01T21:20:08Z

Codecov Report

Attention: Patch coverage is 77.94118% with 15 lines in your changes missing coverage. Please review.

Project coverage is 68.43%. Comparing base (f560cbb) to head (3c9b33c).

Files with missing lines	Patch %	Lines
...s/torchtune/training/test_activation_offloading.py	80.30%	13 Missing ⚠️
torchtune/training/_activation_offloading.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1936      +/-   ##
==========================================
+ Coverage   68.39%   68.43%   +0.04%     
==========================================
  Files         311      311              
  Lines       16901    16967      +66     
==========================================
+ Hits        11560    11612      +52     
- Misses       5341     5355      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ngimel · 2024-11-01T22:32:53Z

torchtune/training/_activation_offloading.py

+                    #    unpacked tensor to exist after the backward node has executed.
+                    storage_refcount = torch._C._storage_Use_Count(
+                        maybe_gpu_tensor.untyped_storage()._cdata
+                    )


Under what conditions would the refcount not be 1? If maybe_gpu_tensor was gpu already and was used somewhere?

so it’s usually 2 because calling the python untyped_storage() increases the count

ngimel · 2024-11-01T22:34:23Z

tests/torchtune/training/test_activation_offloading.py

+        loss = fwd(tensor)
+    # delete the fwd stash to avoid our peek-in-fwd-stash heuristic in the bwd
+    ctx.fwd_stash = {}
+    loss.backward()


what are you checking here?

the assert is line 158

Would be nice to replace the torch._C._storage_Use_Count call in pytorch/torchtune#1936, at least without needing to know about _cdata in OSS code. Initially keeping it private as Tensor._use_count is also private. In favor over #139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix) Pull Request resolved: #139426 Approved by: https://github.com/soulitzer

joecummings

4/5 dentists prefer "A more encompassing fix for offloading + ac" compared to other leading brands

felipemello1

thanks for the fix! qq: just as sanity check, were you able to run some model with offloading=True and confirmed it worked as before?

felipemello1 · 2024-11-04T15:37:49Z

tests/torchtune/training/test_activation_offloading.py

@@ -10,6 +10,8 @@
 from torch import nn
 from torchtune.training import OffloadActivations

+NUM_GPU_CYCLES_IN_ONE_SEC = 2000000000  # 2e9 is ~1s worth of GPU cycles


Suggested change

NUM_GPU_CYCLES_IN_ONE_SEC = 2000000000 # 2e9 is ~1s worth of GPU cycles

NUM_GPU_CYCLES_IN_ONE_SEC = 2_000_000_000 # 2e9 is ~1s worth of GPU cycles

Yes, I ran the repro script to confirm that memory usage + correctness were as expected. We also have a super small unit test to ensure correctness. I did not run a recipe though.

Would be nice to replace the torch._C._storage_Use_Count call in pytorch/torchtune#1936, at least without needing to know about _cdata in OSS code. Initially keeping it private as Tensor._use_count is also private. In favor over pytorch#139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix) Pull Request resolved: pytorch#139426 Approved by: https://github.com/soulitzer

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2024

janeyx99 mentioned this pull request Oct 31, 2024

Expose Storage _use_count API in Python pytorch/pytorch#139426

Open

janeyx99 mentioned this pull request Nov 1, 2024

Fix offloading saved tensor interaction with checkpointing #1913

Closed

13 tasks

janeyx99 requested review from soulitzer, joecummings, felipemello1 and ebsmothers November 1, 2024 20:34

A more encompassing fix for offloading + ac

3c9b33c

janeyx99 force-pushed the storage_use_count_fix branch from a326ab7 to 3c9b33c Compare November 1, 2024 20:54

janeyx99 marked this pull request as ready for review November 1, 2024 20:55

soulitzer approved these changes Nov 1, 2024

View reviewed changes

ngimel reviewed Nov 1, 2024

View reviewed changes

joecummings approved these changes Nov 4, 2024

View reviewed changes

felipemello1 approved these changes Nov 4, 2024

View reviewed changes

janeyx99 merged commit 9eced21 into pytorch:main Nov 4, 2024
17 checks passed

janeyx99 deleted the storage_use_count_fix branch November 4, 2024 15:58

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more encompassing fix for offloading + ac #1936

A more encompassing fix for offloading + ac #1936

janeyx99 commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

soulitzer left a comment

codecov-commenter commented Nov 1, 2024

ngimel Nov 1, 2024

janeyx99 Nov 1, 2024

ngimel Nov 1, 2024

janeyx99 Nov 1, 2024

joecummings left a comment

felipemello1 left a comment

felipemello1 Nov 4, 2024

janeyx99 Nov 4, 2024

	NUM_GPU_CYCLES_IN_ONE_SEC = 2000000000 # 2e9 is ~1s worth of GPU cycles
	NUM_GPU_CYCLES_IN_ONE_SEC = 2_000_000_000 # 2e9 is ~1s worth of GPU cycles

A more encompassing fix for offloading + ac #1936

A more encompassing fix for offloading + ac #1936

Conversation

janeyx99 commented Oct 31, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Oct 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1936

✅ No Failures

soulitzer left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 1, 2024

Codecov Report

ngimel Nov 1, 2024

Choose a reason for hiding this comment

janeyx99 Nov 1, 2024

Choose a reason for hiding this comment

ngimel Nov 1, 2024

Choose a reason for hiding this comment

janeyx99 Nov 1, 2024

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

felipemello1 left a comment

Choose a reason for hiding this comment

felipemello1 Nov 4, 2024

Choose a reason for hiding this comment

janeyx99 Nov 4, 2024

Choose a reason for hiding this comment

janeyx99 commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading