Fix grad accum + FSDP CPU offload, pass None via CLI #1941

ebsmothers · 2024-11-01T22:15:23Z

Fixes #1939

Two small fixes in this PR.

The first one is due to not moving our grad scaler to CPU when CPU offloading is enabled. Imo it's cleanest to just do this directly in the utility by inferring the right device from the first parameter we see, rather than relying on the FSDP CPU offload flag.

The second one is a bit of a hack but makes it possible to pass some_config_field=None from CLI and have it mean None in the Python sense. This means we can't ever use "None" as a string in configs or CLI overrides, but it eliminates some confusion around the fact that OmegaConf would expect some_config_field=null and instead parses None as a string.

Test plan:

Fix 1:

tune run --nproc_per_node 2 full_finetune_distributed --config llama3/8B_full \
gradient_accumulation_steps=2 fsdp_cpu_offload=True

On main

...
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

On this PR

...
1|1|Loss: 3.1665189266204834:   0%|                                                                                                                                                                 | 1/6500 [00:25<46:53:08, 25.97s/it]

Fix 2:

tune run full_finetune_single_device --config llama3/8B_full_single_device \
clip_grad_norm=None

On main:

...
    raise RuntimeError(
RuntimeError: Gradient clipping is not supported with optimizer in bwd.Please set clip_grad_norm=None, or optimizer_in_bwd=False.

On this PR:

...
1|5|Loss: 2.147994041442871:   0%|                                                                                                                                                                  | 5/26001 [00:08<7:58:13,  1.10s/it]

pytorch-bot · 2024-11-01T22:15:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1941

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 216d1c3 with merge base f560cbb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-11-01T22:20:42Z

Codecov Report

Attention: Patch coverage is 16.66667% with 5 lines in your changes missing coverage. Please review.

Project coverage is 65.98%. Comparing base (f560cbb) to head (216d1c3).

Files with missing lines	Patch %	Lines
torchtune/training/_grad_scaler.py	0.00%	4 Missing ⚠️
torchtune/config/_utils.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1941      +/-   ##
==========================================
- Coverage   68.39%   65.98%   -2.42%     
==========================================
  Files         311      311              
  Lines       16901    16907       +6     
==========================================
- Hits        11560    11156     -404     
- Misses       5341     5751     +410

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

felipemello1 · 2024-11-01T22:39:08Z

torchtune/config/_utils.py

@@ -173,6 +173,11 @@ def _merge_yaml_and_cli_args(yaml_args: Namespace, cli_args: List[str]) -> DictC
        # key string to reflect this
        if k in yaml_kwargs and _has_component(yaml_kwargs[k]):
            k += "._component_"
+
+        # None passed via CLI will be parsed as string, but we really want OmegaConf null
+        if v == "None":


Should it be v.lower() == “none”? To avoid the None/none case

I might leave it as is.. I know it's not a heavily-used API in the library as of today, but e.g.

torchtune/torchtune/training/activations.py

Line 75 in f560cbb

ac_mode (str): Activation checkpointing mode. ['none', 'full', 'selective']

Since we already have a case that's using 'none' as a string, I don't wanna mess with that

felipemello1

Lgtm! Just a small comment

Fix grad accum + FSDP CPU offload, pass None via CLI

216d1c3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 1, 2024

ebsmothers requested review from felipemello1 and RdoubleA November 1, 2024 22:22

felipemello1 reviewed Nov 1, 2024

View reviewed changes

felipemello1 approved these changes Nov 1, 2024

View reviewed changes

ebsmothers merged commit bc4acc1 into pytorch:main Nov 1, 2024
17 checks passed

joecummings pushed a commit that referenced this pull request Nov 4, 2024

Fix grad accum + FSDP CPU offload, pass None via CLI (#1941)

1a1cab0

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix grad accum + FSDP CPU offload, pass None via CLI #1941

Fix grad accum + FSDP CPU offload, pass None via CLI #1941

ebsmothers commented Nov 1, 2024 •

edited

Loading

pytorch-bot bot commented Nov 1, 2024 •

edited

Loading

codecov-commenter commented Nov 1, 2024

felipemello1 Nov 1, 2024

ebsmothers Nov 1, 2024

felipemello1 left a comment

Fix grad accum + FSDP CPU offload, pass None via CLI #1941

Fix grad accum + FSDP CPU offload, pass None via CLI #1941

Conversation

ebsmothers commented Nov 1, 2024 • edited Loading

Test plan:

Fix 1:

On main

On this PR

Fix 2:

On main:

On this PR:

pytorch-bot bot commented Nov 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1941

✅ No Failures

codecov-commenter commented Nov 1, 2024

Codecov Report

felipemello1 Nov 1, 2024

Choose a reason for hiding this comment

ebsmothers Nov 1, 2024

Choose a reason for hiding this comment

felipemello1 left a comment

Choose a reason for hiding this comment

ebsmothers commented Nov 1, 2024 •

edited

Loading

pytorch-bot bot commented Nov 1, 2024 •

edited

Loading