Faster intermediate checkpoints with DCP async save in TorchTune #2006

saumishr · 2024-11-14T16:07:28Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

This diff introduces the DistributedCheckpointing based asynchronous checkpointing and enables it for intermediate checkpoints.

Experiments

Full Finetune Distributed

Ranks	Model Type	Current	DCP Async
1	Llama 3.2 11B	33.51s	7.05s
8	Llama 3.2 Vision 90B	454.68s	22.28s

Changelog

What are the changes made in this PR?

Introduces DistributedCheckpointing based Checkpointer
Ability to checkpoint asynchronously for intermediate checkpoints.
Ability to load the latest intermediate checkpoint via DistributedCheckpointing
Clean up the older checkpoints to save storage.

Usage

Config enable_async_checkpointing enables the asynchronous checkpoint saving for intermediate checkpoints.
Config resume_from_checkpoint will enable the resume from the latest intermediate checkpoint. No recipe state needs to be provided as its already saved in the distributed checkpoint.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-11-14T16:07:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2006

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74ae681 with merge base cdaece1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-11-14T22:21:15Z

I will review it tomorrow. Thanks for the PR!

pbontrager · 2024-11-15T15:48:34Z

recipes/full_finetune_distributed.py

                else None
            ),
        )

+        if self._resume_from_checkpoint and self._enable_async_checkpointing:


I understand that this allowed for fast verification of the checkpoint time, and the improvements look really great. But I don't think we want the internals of how checkpointing works exposed in the recipe. How can this the user exposure between dcp and standard checkpointer be consolidated to the same thing so that checkpointer logic only has to take a few lines of code in the recipe?

I agree. Checkpointing in general needs a fair bit of refactor. Plenty of code duplication. I wanted to avoid a major refactor here since I believe @joecummings is planning this out already. Happy to help on that.

pbontrager · 2024-11-15T15:50:33Z

recipes/full_finetune_distributed.py

@@ -633,11 +744,17 @@ def save_checkpoint(

        intermediate_checkpoint = epoch + 1 < self.total_epochs

+        # If async checkpointing is enabled, intermediate checkpoints will
+        # be saved asynchronously.
+        if intermediate_checkpoint and self._enable_async_checkpointing:


Is there ever a case where we wouldn't want async checkpointing for intermediate? Is the flag really needed? Also, it should be possible to merge the async save and full save logic into the same function to not proliferate state dict code everywhere.

@pbontrager This flag was added temporarily to make it an opt-in feature which allows us to get more validations before we make it a default. Also, in case of any bugs, devs can turn it off and fallback to the prior state. Currently async save and full save happen via different Checkpointers having different formats but we can probably create a helper method to move this logic out of the recipe code. I will refactor this a bit more.

pbontrager · 2024-11-15T15:51:37Z

recipes/full_finetune_distributed.py

@@ -614,6 +671,60 @@ def _setup_data(

        return sampler, dataloader

+    def save_checkpoint_async(


Would the idea be to eventually use async checkpoint everywhere and consolidate the files at the end? Or will we always do the all gather at the end?

By end, I assume you mean the final checkpoint at the end of the fine-tuning?

Final checkpoint format should be the one that user configures in their job config. Intermediate checkpoints are only for fault tolerance so we can choose the Checkpointer that saves the wasted training time and GPUs the best, which in this case, is DCP (DistributedCheckpointing) with Async save. We will however provide DCP option as another Checkpointer and users can choose to get the final checkpoint in that format as well but we wont enforce it, probably. Use case I can think of is when users train using TorchTitan, where DCP is the default Checkpointer, they can then plug it into TorchTune directly for fine-tuning instead of conversions to torch.save or HF formats.

RdoubleA

I am unclear on if the DistributedCheckpointer class will replace the HF or Meta checkpointers or if it should be used in tandem for the fault tolerant intermediate checkpoints. How can users easily convert between checkpoint formats? Would it make sense to integrate the APIs into each checkpointer just for the purposes of async intermediate checkpoints, and then the first load / final save still remain in the original HF / meta format?

Maybe I just need to see the north star for the checkpoint redesign (@joecummings)

RdoubleA · 2024-11-16T01:17:26Z

torchtune/training/checkpointing/_checkpointer.py

+                "No intermediate checkpoint found in the specified directory. Please ensure that a checkpoint exists."
+            )
+
+        if self._is_rank_zero:


you can use the torchtune.utils.logging.log_rank_zero utility so you don't have to keep checking for rank

i feel that these if is_rank_zero are most of the code :P

Yeah. DCP APIs are trainer agnostic. So really all the code in the integration is gonna be state dict init & updates along with the tests :)

RdoubleA · 2024-11-16T01:19:42Z

torchtune/training/checkpointing/_checkpointer.py

+        _, rank = training.get_world_size_and_rank()
+        self._is_rank_zero = rank == 0
+
+    def _get_latest_intermediate_checkpoint_path(self) -> Optional[str]:


this could be generally useful for all other checkpointers, cc @joecummings to consider this for the redesign

RdoubleA · 2024-11-16T01:22:22Z

torchtune/training/checkpointing/_checkpointer.py

+                            os.path.join(last_checkpoint, self._metadata_file)
+                        ):
+                            if self._is_rank_zero:
+                                logger.info(


do we want to automatically remove the previous checkpoint? some users may want to keep them around so they can evaluate model checkpoints over the course of training

Without this, we will be limiting the number of checkpoints we can take as its gonna be bound by the storage available on the host and it can run out quickly for large models. For example, Llama 3.2 vision90B is almost 150-200GB in size.
For internal trainers, we keep the last checkpoint only, as for fault tolerance we always restore the latest one. We can possibly make it a config so the users can configure the behavior if they want to evaluate intermediate checkpoints and can also choose to keep only the latest, if the checkpoints are large and taken frequently.

Yeah agreed with making it configurable. For this PR I would actually lean towards keeping all intermediate checkpoints around. Mainly because that's what we do for our other checkpointers and we shouldn't have diverging behavior across different checkpoint formats. I'd like to see us add a field like save_last_n_checkpoints to all checkpointers in one go

Yes. Exactly. It seems like we would need to have a set of checkpointing configs for the user to tune the performance and behavior. We should do it in one go across Checkpointers and get these configs documented. I will get it cleaned up from this PR./

Before making changes, can we align on the config to avoid extra work?

The option seems to be:
freq_save_intermediate_ckpt = N, and when N=0, it means that we never save it.
keep_only_last_intermediate_ckpt = True/False, if True, we dont keep them

Is that it? Probably needs better naming.

@felipemello1 We decided to keep the current behavior consistent with DistributedCheckpointer as well in this PR, which is to keep all the checkpoints persisted for now. I believe when we introduce this feature across Checkpointers, just having keep_last_n_checkpoints should be sufficient.

None -> All the checkpoints are kept

0 -> No intermediate checkpoints

1 -> Only the latest valid checkpoint is persisted. (Probably the default)

N ( N > 1) -> Latest N checkpoints are kept.

saumishr · 2024-11-16T17:31:05Z

I am unclear on if the DistributedCheckpointer class will replace the HF or Meta checkpointers or if it should be used in tandem for the fault tolerant intermediate checkpoints. How can users easily convert between checkpoint formats? Would it make sense to integrate the APIs into each checkpointer just for the purposes of async intermediate checkpoints, and then the first load / final save still remain in the original HF / meta format?

Maybe I just need to see the north star for the checkpoint redesign (@joecummings)

Absolutely. Happy to brainstorm on the future direction whenever the proposal is ready.
As the model size and number of GPUs grow, the cost of rank-0 checkpointing will grow dramatically. My view is that its worth providing DistributedCheckpointer as another Checkpointer, specially since its the default one in TorchTitan. Due to distributed nature, it will improve the save and load time for the first and final checkpoints as well which are GPU blocking and can't be async. Depending on the adoption and user feedback, we can decide on making it a default or not.

For intermediate checkpoints, in this PR, I preferred the first suggestion of making it work in tandem with other Checkpointers. Positioning DCP as a default Checkpointer and a default format for intermediate checkpoints explicitly, while users control the first load and final save formats (with DCP also as an option), is much easier to understand instead of the same Checkpointer producing two different formats for final and intermediate saves.

I will write up some checkpointing suggestions that you and @joecummings can consider as you plan out the refactor.

ebsmothers · 2024-11-16T20:59:21Z

recipes/full_finetune_distributed.py

+        dcp_saver = DistributedCheckpointer(
+            checkpoint_dir=self._checkpointer._checkpoint_dir,
+            output_dir=self._checkpointer._output_dir,
+        )


Just wanna check my understanding here. We normally would define checkpointer as part of the config to be instantiated but here you basically use enable_async_checkpointing as a proxy for whether we're the DistributedCheckpointer or not. Is the idea behind this to have the ability to save and load intermediate checkpoints with DistributedCheckpointer but still save the final checkpoint in the HF/Meta format?

Yes. Exactly. I was less inclined to have users configure two different Checkpointers in the same job config: DistributedCheckpointer for intermediate and HF/Meta checkpointer for first load and final save. I exposed the async checkpointing as a feature instead under the enable_async_checkpointing config for two reasons:

DistributedCheckpointer is the only checkpointer which supports async checkpointing.

Users will probably want to control the first load and final save checkpoints. Intermediate checkpoints are for fault tolerance and we have flexibility to enforce the Checkpointer that works best for that.

Basically the end state I was thinking of is the following:

For first load and final save, users can choose from any of the following Checkpointers: HF, Meta or DistributedCheckpointer. No async checkpointing here, since checkpoint loads and final saves will always need to be synchronous. We can not finish the job early until the entire final checkpoint has been persisted.

For intermediate checkpoints, we use async save with DistributedCheckpointer by default.

ebsmothers · 2024-11-16T21:03:23Z

recipes/full_finetune_distributed.py

+        resume_from_checkpoint: bool = (
+            False if self._enable_async_checkpointing else self._resume_from_checkpoint
+        )


I'm a bit confused by this.. is it just because of the if statement for self._update_recipe_state (i.e. we do not need to explicitly load state when using DCP)? If so I think we should just use a different variable e.g. should_load_recipe_state = self._resume_from_checkpoint and not self._enable_async_checkpointing or something to be more explicit.

Yes. Thats right. I agree. Current way is a bit sloppy to avoid loading the recipe state. I will update the variables to make it a bit more readable.

ebsmothers · 2024-11-16T21:21:13Z

torchtune/training/checkpointing/_checkpointer.py

+                else:
+                    logger.error(
+                        f"Checkpoint failed to save asynchronously to {checkpoint_path} with the exception {f.exception()}"
+                    )


I see a bunch of exceptions coming from rmtree when I run this on my machine. E.g.

exception calling callback for <Future at 0x7f7415a06a50 state=finished returned Metadata> Traceback (most recent call last): File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/concurrent/futures/_base.py", line 340, in _invoke_callbacks callback(self) File "/data/users/ebs/ebs-torchtune-alt/torchtune/training/checkpointing/_checkpointer.py", line 1062, in callback shutil.rmtree(last_checkpoint) File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 752, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: '__2_46.distcp'

(However I am still able to save the checkpoints successfully)

Ohh, my bad. Thanks for catching that. I somehow did not notice this in my testing. Basically the checkpoint deletion should happen at rank_0 only otherwise two different ranks may have a race condition in deleting a file and will run into the error above. Its not a problem during save, since DCP prepares the global plan and distributes it across ranks so every rank knows exactly what needs to be saved and there is no collision there.

Since you suggested to remove this functionality for now from this PR, this logic will get cleaned up. I will record this issue it in my doc of suggestions for refactor and we can get this added for all Checkpointers in one go.

ebsmothers · 2024-11-16T21:21:39Z

torchtune/training/checkpointing/_checkpointer.py

+                state_dict,
+                storage_writer=FileSystemWriter(
+                    checkpoint_path,
+                    thread_count=8,


Out of curiosity, how is this value determined? (Similar question above when you set it to 16 for async_save)

This is set as per the available storage I/O bandwidth. This value determines the number of parallel I/O threads while writing the data to storage. For our internal storage like Manifold or even S3, we know the capacity available and tune this value accordingly. This way we use max IO capacity with optimal performance without getting throttled.

For TorchTune case, since we are writing to Disk, we may have to run some experiments and tune it as per the IOPS available. Ideally, this should also be exposed as a checkpointing_config to the user. They can tune it to get the best performance depending on the underlying storage: HDD, SSD or cloud storage like S3.

I used 16 since that is a default for Manifold integrations internally. For Disks, this number can be higher. Also, it will only affect the upload speed to storage so it virtually has no effect on the async checkpointing blocking time.

ebsmothers · 2024-11-16T21:25:19Z

torchtune/training/checkpointing/_checkpointer.py

+                            f"Checkpoint is saved asynchronously to {checkpoint_path}"
+                        )
+
+                    for index in range(epoch):


also nit but do you really need a for loop here if this runs every epoch? (i.e. isn't it just sufficient to clean up epoch n-1 checkpoint after each epoch n save)

True. We can do that. I added this code instead just for edge cases when some prior checkpoint folder still did not get cleaned up due to user cancelling the job or the job errored out while the clean up was happening asynchronously.

ebsmothers

Thank you for this PR, the results look great! I left a bunch of questions (some probably pretty basic) but I don't have any huge concerns here. The main things needed in my mind are:

Testing. One thing that we should definitely test is that resuming from an intermediate DCP checkpoint works as expected. I think we can do that by modifying this recipe test. And unit tests for the DistributedCheckpointer class would also be nice, you can check the unit tests for our other checkpointers in this file.
Figuring out some of the UX points raised by @pbontrager. I think there are some larger points that @joecummings will be figuring out and we probably don't need to block this PR on. But would definitely like to figure out how we can get to a place where this is less checkpointing code in the recipe and more in reusable utilities.

saumishr · 2024-11-17T20:31:23Z

Thank you for this PR, the results look great! I left a bunch of questions (some probably pretty basic) but I don't have any huge concerns here. The main things needed in my mind are:

Testing. One thing that we should definitely test is that resuming from an intermediate DCP checkpoint works as expected. I think we can do that by modifying this recipe test. And unit tests for the DistributedCheckpointer class would also be nice, you can check the unit tests for our other checkpointers in this file.

Figuring out some of the UX points raised by @pbontrager. I think there are some larger points that @joecummings will be figuring out and we probably don't need to block this PR on. But would definitely like to figure out how we can get to a place where this is less checkpointing code in the recipe and more in reusable utilities.

Thanks @ebsmothers, @pbontrager, @RdoubleA for the reviews. Very helpful!
I will be updating the PR with a robust set of UTs now. I did not do that in the first iteration since the RFC implementation could change drastically after the reviews and I will end up refactoring the tests :) . For the load path, I validated the loss values and progress status in the recipe state after the training restarts from a checkpoint. I will be addressing some of the comments in the PR and update it with UTs now. Thank you!

felipemello1

Just left small comments since the code will still be refactored. As I was reading through it, i was thinking if the same utilities will also work for LoRA. I think that its great that you are focusing on a single recipe for now, but just food for thought as you are designing it. Thanks for the PR! :)

felipemello1 · 2024-11-18T16:26:35Z

recipes/full_finetune_distributed.py

+                    training.SEED_KEY: 0,
+                    training.EPOCHS_KEY: 0,
+                    training.TOTAL_EPOCHS_KEY: 0,
+                    training.MAX_STEPS_KEY: 0,


this problem existed before this PR, btu we need to keep track of the LR of the scheduler too. Currently when we restart from ckpt, i believe we dont restart from the same lr. Its one of the issues in this list: #1551

#2017 introduces the lr_scheduler. Once both of these PRs land, it should be straightforward to add the checkpointing support. We can do it as a follow up.

recipes/full_finetune_distributed.py

felipemello1 · 2024-11-18T16:54:06Z

torchtune/training/checkpointing/_checkpointer.py

+                            os.path.join(last_checkpoint, self._metadata_file)
+                        ):
+                            if self._is_rank_zero:
+                                logger.info(


Before making changes, can we align on the config to avoid extra work?

The option seems to be:
freq_save_intermediate_ckpt = N, and when N=0, it means that we never save it.
keep_only_last_intermediate_ckpt = True/False, if True, we dont keep them

Is that it? Probably needs better naming.

felipemello1 · 2024-11-18T16:58:20Z

torchtune/training/checkpointing/_checkpointer.py

+                "No intermediate checkpoint found in the specified directory. Please ensure that a checkpoint exists."
+            )
+
+        if self._is_rank_zero:


i feel that these if is_rank_zero are most of the code :P

facebook-github-bot · 2024-11-19T10:37:58Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-19T19:43:03Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-19T21:49:36Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-19T22:58:00Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-20T19:23:03Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

felipemello1 · 2024-12-09T15:12:29Z

@saumishr, there were considerable changed made to the output path of the files. Can you double check that we can still resume from checkpoint even if we are using dcp? #2074

Also, please check the folder structure. The assumption is that the use can do cp path_with_ckpt path_with_cfgs, and this is ready to upload o HF. But if there are 50 files there related to dcp, this will look a bit ugly.

I think that we also need to have a clear solution about going from intermediate ckpt -> hf_format. Like we discussed, many of our users run eval on the intermediate ckpts. Do we have it?

cc: @joecummings

facebook-github-bot · 2024-12-11T16:43:50Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

joecummings · 2024-12-11T18:00:41Z

@saumishr Does it matter if the Facebook Internal tests are failing?

saumishr · 2024-12-11T18:16:46Z

@joecummings There are no tests failing. A CI/CD job is complaining that the diff needs to be updated with the latest. I'm validating the async checkpoints with the recent refactor @felipemello1 had done. I will notify here once the validation is done and the PR can be merged. Thank you!

joecummings · 2024-12-12T10:52:32Z

@joecummings There are no tests failing. A CI/CD job is complaining that the diff needs to be updated with the latest. I'm validating the async checkpoints with the recent refactor @felipemello1 had done. I will notify here once the validation is done and the PR can be merged. Thank you!

Once this validation is confirmed, I'm happy to merge this!

facebook-github-bot · 2024-12-12T20:58:28Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-13T15:20:19Z

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…kpoints

…cy is too high

…heckpoint restore

…heck lint issue

…hould_load_recipe_state

codecov-commenter · 2024-12-13T17:35:11Z

Codecov Report

Attention: Patch coverage is 28.48837% with 246 lines in your changes missing coverage. Please review.

Project coverage is 67.70%. Comparing base (9cfa288) to head (74ae681).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
...htune/training/checkpointing/_checkpoint_client.py	0.00%	119 Missing ⚠️
tests/recipes/test_full_finetune_distributed.py	14.85%	86 Missing ⚠️
torchtune/training/checkpointing/_checkpointer.py	58.57%	29 Missing ⚠️
recipes/full_finetune_distributed.py	0.00%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2006      +/-   ##
==========================================
- Coverage   67.75%   67.70%   -0.06%     
==========================================
  Files         334      337       +3     
  Lines       19281    19665     +384     
==========================================
+ Hits        13064    13314     +250     
- Misses       6217     6351     +134

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* Llama 3.3 70B (pytorch#2124) * Llama 3.3 readme updates (pytorch#2125) * update configs (pytorch#2107) Co-authored-by: Felipe Mello <[email protected]> * Reduce logging output for distributed KD (pytorch#2120) * Support Early Exit Loss and/or Layer Dropout (pytorch#1076) Co-authored-by: ebsmothers <[email protected]> * Update checkpointing directory (pytorch#2074) Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: vancoyendall <[email protected]> * pass correct arg (pytorch#2127) Co-authored-by: Felipe Mello <[email protected]> * update configs (pytorch#2128) Co-authored-by: Felipe Mello <[email protected]> * fix qat_lora_test (pytorch#2131) Co-authored-by: Felipe Mello <[email protected]> * guard ckpt imports (pytorch#2133) Co-authored-by: Felipe Mello <[email protected]> * [bug fix] add parents=True (pytorch#2136) Co-authored-by: Felipe Mello <[email protected]> * [bug fix] re-add model (pytorch#2135) Co-authored-by: Felipe Mello <[email protected]> * Update save sizes into GiB (pytorch#2143) * [bug fix] remove config download when source is kaggle (pytorch#2144) Co-authored-by: Felipe Mello <[email protected]> * [fix] remove "with_suffix" (pytorch#2146) Co-authored-by: Felipe Mello <[email protected]> * DoRA fixes (pytorch#2139) Co-authored-by: Mircea Mironenco <[email protected]> * [Fix] Llama 3.2 Vision decoder_trainable flag fixed (pytorch#2150) * Small readme, config updates (pytorch#2157) * Using `FormattedCheckpointFiles` in configs (pytorch#2147) * Move ``get_world_size_and_rank`` to utils (pytorch#2155) * Faster intermediate checkpoints with DCP async save in TorchTune (pytorch#2006) Co-authored-by: Saurabh Mishra <[email protected]> * torchdata integration - multi-dataset and streaming support (pytorch#1929) * Allow higher version of lm-eval (pytorch#2165) * Using `FormattedCheckpointFiles` in configs... round 2 (pytorch#2167) * [EZ] Fix set_torch_num_threads in multi-node. (pytorch#2164) --------- Co-authored-by: Philip Bontrager <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Joe Cummings <[email protected]> Co-authored-by: Mostafa Elhoushi <[email protected]> Co-authored-by: vancoyendall <[email protected]> Co-authored-by: Mircea Mironenco <[email protected]> Co-authored-by: salman <[email protected]> Co-authored-by: Saurabh Mishra <[email protected]> Co-authored-by: Saurabh Mishra <[email protected]> Co-authored-by: Andrew Ho <[email protected]> Co-authored-by: Eugen Hotaj <[email protected]>

…orch#2006) Co-authored-by: Saurabh Mishra <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 14, 2024

pbontrager reviewed Nov 15, 2024

View reviewed changes

RdoubleA reviewed Nov 16, 2024

View reviewed changes

ebsmothers reviewed Nov 16, 2024

View reviewed changes

RdoubleA mentioned this pull request Nov 18, 2024

[RFC] Integration with DCP - Benchmark Results #443

Closed

felipemello1 reviewed Nov 18, 2024

View reviewed changes

saumishr force-pushed the torchtune_dcp branch 5 times, most recently from 9b47975 to daaf4b8 Compare November 19, 2024 09:02

saumishr changed the title ~~[DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints [WIP]~~ [DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints Nov 19, 2024

saumishr changed the title ~~[DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints~~ [DCP][RFC] Faster intermediate checkpoints with DCP async save in TorchTune Nov 19, 2024

saumishr force-pushed the torchtune_dcp branch from cec9587 to 1b1342e Compare November 19, 2024 21:46

saumishr force-pushed the torchtune_dcp branch from 5bd4009 to d1c3944 Compare December 11, 2024 16:41

saumishr force-pushed the torchtune_dcp branch from d1c3944 to 303e4e5 Compare December 12, 2024 20:58

saumishr force-pushed the torchtune_dcp branch from 303e4e5 to 3436115 Compare December 13, 2024 15:20

Saurabh Mishra and others added 8 commits December 13, 2024 07:44

[DCP][RFC] DCP async checkpointing in TorchTune for intermediate chec…

baf00bd

…kpoints

[torchtune][dcp] Unit test for the DistributedCheckpointer

bceebde

[DCP] Fix the typecheck error with state dict in DCP load API

47b096b

[torchtune][dcp] Wait on previous checkpoint to finish if the frequen…

1da9339

…cy is too high

[torchtune][dcp] Refactor checkpointing code from the recipe

8e9593a

[torchtune][dcp] Updated the code documentation for the distributed c…

7fb8c6f

…heckpoint restore

[torchtune][dcp] Making state dict optional in DCP load API for typec…

e745717

…heck lint issue

[torchtune][dcp] Do not expose checkpoint_client as a public API yet

21b6ded

saumishr force-pushed the torchtune_dcp branch from 3436115 to c30392c Compare December 13, 2024 15:45

[torchtune][dcp]: Deprecation of resume_from_checkpoint in favor of s…

74ae681

…hould_load_recipe_state

saumishr force-pushed the torchtune_dcp branch from c30392c to 74ae681 Compare December 13, 2024 16:49

joecummings added the triage review This issue should be discussed in weekly review label Dec 13, 2024

joecummings changed the title ~~[DCP][RFC] Faster intermediate checkpoints with DCP async save in TorchTune~~ Faster intermediate checkpoints with DCP async save in TorchTune Dec 13, 2024

joecummings merged commit c2c6f4a into pytorch:main Dec 13, 2024
17 checks passed

joecummings removed the triage review This issue should be discussed in weekly review label Dec 13, 2024

ebsmothers mentioned this pull request Dec 13, 2024

torchdata integration - multi-dataset and streaming support #1929

Merged

13 tasks

rahul-sarvam pushed a commit to sarvamai/torchtune that referenced this pull request Dec 23, 2024

Faster intermediate checkpoints with DCP async save in TorchTune (pyt…

e6d7ef5

…orch#2006) Co-authored-by: Saurabh Mishra <[email protected]>

rahul-sarvam pushed a commit to sarvamai/torchtune that referenced this pull request Dec 23, 2024

Faster intermediate checkpoints with DCP async save in TorchTune (pyt…

f17c3fc

…orch#2006) Co-authored-by: Saurabh Mishra <[email protected]>

		@@ -614,6 +671,60 @@ def _setup_data(

		return sampler, dataloader

		def save_checkpoint_async(

Faster intermediate checkpoints with DCP async save in TorchTune #2006

Faster intermediate checkpoints with DCP async save in TorchTune #2006

Conversation

saumishr commented Nov 14, 2024 • edited Loading

Context

Experiments

Full Finetune Distributed

Changelog

Usage

Test plan

UX

pytorch-bot bot commented Nov 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2006

✅ No Failures

felipemello1 commented Nov 14, 2024

Choose a reason for hiding this comment

saumishr Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saumishr Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saumishr commented Nov 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

saumishr Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

saumishr commented Nov 17, 2024 • edited Loading

felipemello1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 19, 2024

facebook-github-bot commented Nov 19, 2024

facebook-github-bot commented Nov 19, 2024

facebook-github-bot commented Nov 19, 2024

facebook-github-bot commented Nov 20, 2024

felipemello1 commented Dec 9, 2024 • edited Loading

facebook-github-bot commented Dec 11, 2024

joecummings commented Dec 11, 2024

saumishr commented Dec 11, 2024

joecummings commented Dec 12, 2024

facebook-github-bot commented Dec 12, 2024

facebook-github-bot commented Dec 13, 2024

codecov-commenter commented Dec 13, 2024

Codecov Report

saumishr commented Nov 14, 2024 •

edited

Loading

pytorch-bot bot commented Nov 14, 2024 •

edited

Loading

saumishr Nov 15, 2024 •

edited

Loading

saumishr Nov 18, 2024 •

edited

Loading

ebsmothers Nov 16, 2024 •

edited

Loading

saumishr Nov 17, 2024 •

edited

Loading

saumishr commented Nov 17, 2024 •

edited

Loading

felipemello1 commented Dec 9, 2024 •

edited

Loading