-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set gloo process group for FSDP with CPU offload #2108
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2108
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e4f00c4 with merge base 32e265d (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
recipes/full_finetune_distributed.py
Outdated
if cfg.get("fsdp_cpu_offload", False): | ||
# Utilize all available CPU cores for intra-op parallelism. This provides ~2x | ||
# speed up when benchmarking fused AdamW on CPU | ||
training.set_torch_num_threads() | ||
process_group = "cuda:nccl,cpu:gloo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: why not just make this the default every time, instead of "gloo" if cfg.device == "cpu" else "nccl"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wasn't sure myself why we do this. Well now I know. So yes, I think we can take your suggestion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be low risk, since it was tested in the dcp PR already. Approving to unblock.
Addresses #1977.
As discussed in the issue (see this comment), FSDP's implementation of gradient clipping uses _NormPartial, which requires comms primitives (specifically
all_reduce
). This means that when running with CPU offloading we need to initialize the gloo process group to calculate the grad norm for DTensors on CPU. For simplicity this PR enables it whenever CPU offloading is enabled regardless of gradient clipping.Test plan:
Added a test case for gradient clipping + CPU offload to
test_full_finetune_distributed.py
.