Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster intermediate checkpoints with DCP async save in TorchTune #2006

Merged
merged 9 commits into from
Dec 13, 2024

Conversation

saumishr
Copy link
Contributor

@saumishr saumishr commented Nov 14, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

This diff introduces the DistributedCheckpointing based asynchronous checkpointing and enables it for intermediate checkpoints.

Experiments

Full Finetune Distributed
Ranks Model Type Current DCP Async
1 Llama 3.2 11B 33.51s 7.05s
8 Llama 3.2 Vision 90B 454.68s 22.28s

Changelog

What are the changes made in this PR?

  • Introduces DistributedCheckpointing based Checkpointer
  • Ability to checkpoint asynchronously for intermediate checkpoints.
  • Ability to load the latest intermediate checkpoint via DistributedCheckpointing
  • Clean up the older checkpoints to save storage.

Usage

  • Config enable_async_checkpointing enables the asynchronous checkpoint saving for intermediate checkpoints.
  • Config resume_from_checkpoint will enable the resume from the latest intermediate checkpoint. No recipe state needs to be provided as its already saved in the distributed checkpoint.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2006

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74ae681 with merge base cdaece1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 14, 2024
@felipemello1
Copy link
Contributor

I will review it tomorrow. Thanks for the PR!

else None
),
)

if self._resume_from_checkpoint and self._enable_async_checkpointing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this allowed for fast verification of the checkpoint time, and the improvements look really great. But I don't think we want the internals of how checkpointing works exposed in the recipe. How can this the user exposure between dcp and standard checkpointer be consolidated to the same thing so that checkpointer logic only has to take a few lines of code in the recipe?

Copy link
Contributor Author

@saumishr saumishr Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Checkpointing in general needs a fair bit of refactor. Plenty of code duplication. I wanted to avoid a major refactor here since I believe @joecummings is planning this out already. Happy to help on that.

@@ -633,11 +744,17 @@ def save_checkpoint(

intermediate_checkpoint = epoch + 1 < self.total_epochs

# If async checkpointing is enabled, intermediate checkpoints will
# be saved asynchronously.
if intermediate_checkpoint and self._enable_async_checkpointing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case where we wouldn't want async checkpointing for intermediate? Is the flag really needed? Also, it should be possible to merge the async save and full save logic into the same function to not proliferate state dict code everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbontrager This flag was added temporarily to make it an opt-in feature which allows us to get more validations before we make it a default. Also, in case of any bugs, devs can turn it off and fallback to the prior state. Currently async save and full save happen via different Checkpointers having different formats but we can probably create a helper method to move this logic out of the recipe code. I will refactor this a bit more.

@@ -614,6 +671,60 @@ def _setup_data(

return sampler, dataloader

def save_checkpoint_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the idea be to eventually use async checkpoint everywhere and consolidate the files at the end? Or will we always do the all gather at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By end, I assume you mean the final checkpoint at the end of the fine-tuning?

Final checkpoint format should be the one that user configures in their job config. Intermediate checkpoints are only for fault tolerance so we can choose the Checkpointer that saves the wasted training time and GPUs the best, which in this case, is DCP (DistributedCheckpointing) with Async save. We will however provide DCP option as another Checkpointer and users can choose to get the final checkpoint in that format as well but we wont enforce it, probably. Use case I can think of is when users train using TorchTitan, where DCP is the default Checkpointer, they can then plug it into TorchTune directly for fine-tuning instead of conversions to torch.save or HF formats.

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear on if the DistributedCheckpointer class will replace the HF or Meta checkpointers or if it should be used in tandem for the fault tolerant intermediate checkpoints. How can users easily convert between checkpoint formats? Would it make sense to integrate the APIs into each checkpointer just for the purposes of async intermediate checkpoints, and then the first load / final save still remain in the original HF / meta format?

Maybe I just need to see the north star for the checkpoint redesign (@joecummings)

"No intermediate checkpoint found in the specified directory. Please ensure that a checkpoint exists."
)

if self._is_rank_zero:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the torchtune.utils.logging.log_rank_zero utility so you don't have to keep checking for rank

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel that these if is_rank_zero are most of the code :P

Copy link
Contributor Author

@saumishr saumishr Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. DCP APIs are trainer agnostic. So really all the code in the integration is gonna be state dict init & updates along with the tests :)

_, rank = training.get_world_size_and_rank()
self._is_rank_zero = rank == 0

def _get_latest_intermediate_checkpoint_path(self) -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be generally useful for all other checkpointers, cc @joecummings to consider this for the redesign

os.path.join(last_checkpoint, self._metadata_file)
):
if self._is_rank_zero:
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to automatically remove the previous checkpoint? some users may want to keep them around so they can evaluate model checkpoints over the course of training

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, we will be limiting the number of checkpoints we can take as its gonna be bound by the storage available on the host and it can run out quickly for large models. For example, Llama 3.2 vision90B is almost 150-200GB in size.
For internal trainers, we keep the last checkpoint only, as for fault tolerance we always restore the latest one. We can possibly make it a config so the users can configure the behavior if they want to evaluate intermediate checkpoints and can also choose to keep only the latest, if the checkpoints are large and taken frequently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed with making it configurable. For this PR I would actually lean towards keeping all intermediate checkpoints around. Mainly because that's what we do for our other checkpointers and we shouldn't have diverging behavior across different checkpoint formats. I'd like to see us add a field like save_last_n_checkpoints to all checkpointers in one go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Exactly. It seems like we would need to have a set of checkpointing configs for the user to tune the performance and behavior. We should do it in one go across Checkpointers and get these configs documented. I will get it cleaned up from this PR./

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before making changes, can we align on the config to avoid extra work?

The option seems to be:
freq_save_intermediate_ckpt = N, and when N=0, it means that we never save it.
keep_only_last_intermediate_ckpt = True/False, if True, we dont keep them

Is that it? Probably needs better naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipemello1 We decided to keep the current behavior consistent with DistributedCheckpointer as well in this PR, which is to keep all the checkpoints persisted for now. I believe when we introduce this feature across Checkpointers, just having keep_last_n_checkpoints should be sufficient.

  • None -> All the checkpoints are kept
  • 0 -> No intermediate checkpoints
  • 1 -> Only the latest valid checkpoint is persisted. (Probably the default)
  • N ( N > 1) -> Latest N checkpoints are kept.

@saumishr
Copy link
Contributor Author

I am unclear on if the DistributedCheckpointer class will replace the HF or Meta checkpointers or if it should be used in tandem for the fault tolerant intermediate checkpoints. How can users easily convert between checkpoint formats? Would it make sense to integrate the APIs into each checkpointer just for the purposes of async intermediate checkpoints, and then the first load / final save still remain in the original HF / meta format?

Maybe I just need to see the north star for the checkpoint redesign (@joecummings)

Absolutely. Happy to brainstorm on the future direction whenever the proposal is ready.
As the model size and number of GPUs grow, the cost of rank-0 checkpointing will grow dramatically. My view is that its worth providing DistributedCheckpointer as another Checkpointer, specially since its the default one in TorchTitan. Due to distributed nature, it will improve the save and load time for the first and final checkpoints as well which are GPU blocking and can't be async. Depending on the adoption and user feedback, we can decide on making it a default or not.

For intermediate checkpoints, in this PR, I preferred the first suggestion of making it work in tandem with other Checkpointers. Positioning DCP as a default Checkpointer and a default format for intermediate checkpoints explicitly, while users control the first load and final save formats (with DCP also as an option), is much easier to understand instead of the same Checkpointer producing two different formats for final and intermediate saves.

I will write up some checkpointing suggestions that you and @joecummings can consider as you plan out the refactor.

Comment on lines 712 to 745
dcp_saver = DistributedCheckpointer(
checkpoint_dir=self._checkpointer._checkpoint_dir,
output_dir=self._checkpointer._output_dir,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna check my understanding here. We normally would define checkpointer as part of the config to be instantiated but here you basically use enable_async_checkpointing as a proxy for whether we're the DistributedCheckpointer or not. Is the idea behind this to have the ability to save and load intermediate checkpoints with DistributedCheckpointer but still save the final checkpoint in the HF/Meta format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Exactly. I was less inclined to have users configure two different Checkpointers in the same job config: DistributedCheckpointer for intermediate and HF/Meta checkpointer for first load and final save. I exposed the async checkpointing as a feature instead under the enable_async_checkpointing config for two reasons:

  • DistributedCheckpointer is the only checkpointer which supports async checkpointing.
  • Users will probably want to control the first load and final save checkpoints. Intermediate checkpoints are for fault tolerance and we have flexibility to enforce the Checkpointer that works best for that.

Basically the end state I was thinking of is the following:

  • For first load and final save, users can choose from any of the following Checkpointers: HF, Meta or DistributedCheckpointer. No async checkpointing here, since checkpoint loads and final saves will always need to be synchronous. We can not finish the job early until the entire final checkpoint has been persisted.
  • For intermediate checkpoints, we use async save with DistributedCheckpointer by default.

Comment on lines 206 to 218
resume_from_checkpoint: bool = (
False if self._enable_async_checkpointing else self._resume_from_checkpoint
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this.. is it just because of the if statement for self._update_recipe_state (i.e. we do not need to explicitly load state when using DCP)? If so I think we should just use a different variable e.g. should_load_recipe_state = self._resume_from_checkpoint and not self._enable_async_checkpointing or something to be more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thats right. I agree. Current way is a bit sloppy to avoid loading the recipe state. I will update the variables to make it a bit more readable.

Comment on lines 1063 to 1304
else:
logger.error(
f"Checkpoint failed to save asynchronously to {checkpoint_path} with the exception {f.exception()}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a bunch of exceptions coming from rmtree when I run this on my machine. E.g.

exception calling callback for <Future at 0x7f7415a06a50 state=finished returned Metadata>
Traceback (most recent call last):
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/concurrent/futures/_base.py", line 340, in _invoke_callbacks
    callback(self)
  File "/data/users/ebs/ebs-torchtune-alt/torchtune/training/checkpointing/_checkpointer.py", line 1062, in callback
    shutil.rmtree(last_checkpoint)
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 752, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: '__2_46.distcp'

(However I am still able to save the checkpoints successfully)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, my bad. Thanks for catching that. I somehow did not notice this in my testing. Basically the checkpoint deletion should happen at rank_0 only otherwise two different ranks may have a race condition in deleting a file and will run into the error above. Its not a problem during save, since DCP prepares the global plan and distributes it across ranks so every rank knows exactly what needs to be saved and there is no collision there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you suggested to remove this functionality for now from this PR, this logic will get cleaned up. I will record this issue it in my doc of suggestions for refactor and we can get this added for all Checkpointers in one go.

state_dict,
storage_writer=FileSystemWriter(
checkpoint_path,
thread_count=8,
Copy link
Contributor

@ebsmothers ebsmothers Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, how is this value determined? (Similar question above when you set it to 16 for async_save)

Copy link
Contributor Author

@saumishr saumishr Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set as per the available storage I/O bandwidth. This value determines the number of parallel I/O threads while writing the data to storage. For our internal storage like Manifold or even S3, we know the capacity available and tune this value accordingly. This way we use max IO capacity with optimal performance without getting throttled.

For TorchTune case, since we are writing to Disk, we may have to run some experiments and tune it as per the IOPS available. Ideally, this should also be exposed as a checkpointing_config to the user. They can tune it to get the best performance depending on the underlying storage: HDD, SSD or cloud storage like S3.

I used 16 since that is a default for Manifold integrations internally. For Disks, this number can be higher. Also, it will only affect the upload speed to storage so it virtually has no effect on the async checkpointing blocking time.

f"Checkpoint is saved asynchronously to {checkpoint_path}"
)

for index in range(epoch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit but do you really need a for loop here if this runs every epoch? (i.e. isn't it just sufficient to clean up epoch n-1 checkpoint after each epoch n save)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We can do that. I added this code instead just for edge cases when some prior checkpoint folder still did not get cleaned up due to user cancelling the job or the job errored out while the clean up was happening asynchronously.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, the results look great! I left a bunch of questions (some probably pretty basic) but I don't have any huge concerns here. The main things needed in my mind are:

  1. Testing. One thing that we should definitely test is that resuming from an intermediate DCP checkpoint works as expected. I think we can do that by modifying this recipe test. And unit tests for the DistributedCheckpointer class would also be nice, you can check the unit tests for our other checkpointers in this file.
  2. Figuring out some of the UX points raised by @pbontrager. I think there are some larger points that @joecummings will be figuring out and we probably don't need to block this PR on. But would definitely like to figure out how we can get to a place where this is less checkpointing code in the recipe and more in reusable utilities.

@saumishr
Copy link
Contributor Author

saumishr commented Nov 17, 2024

Thank you for this PR, the results look great! I left a bunch of questions (some probably pretty basic) but I don't have any huge concerns here. The main things needed in my mind are:

  1. Testing. One thing that we should definitely test is that resuming from an intermediate DCP checkpoint works as expected. I think we can do that by modifying this recipe test. And unit tests for the DistributedCheckpointer class would also be nice, you can check the unit tests for our other checkpointers in this file.
  2. Figuring out some of the UX points raised by @pbontrager. I think there are some larger points that @joecummings will be figuring out and we probably don't need to block this PR on. But would definitely like to figure out how we can get to a place where this is less checkpointing code in the recipe and more in reusable utilities.

Thanks @ebsmothers, @pbontrager, @RdoubleA for the reviews. Very helpful!
I will be updating the PR with a robust set of UTs now. I did not do that in the first iteration since the RFC implementation could change drastically after the reviews and I will end up refactoring the tests :) . For the load path, I validated the loss values and progress status in the recipe state after the training restarts from a checkpoint. I will be addressing some of the comments in the PR and update it with UTs now. Thank you!

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left small comments since the code will still be refactored. As I was reading through it, i was thinking if the same utilities will also work for LoRA. I think that its great that you are focusing on a single recipe for now, but just food for thought as you are designing it. Thanks for the PR! :)

training.SEED_KEY: 0,
training.EPOCHS_KEY: 0,
training.TOTAL_EPOCHS_KEY: 0,
training.MAX_STEPS_KEY: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem existed before this PR, btu we need to keep track of the LR of the scheduler too. Currently when we restart from ckpt, i believe we dont restart from the same lr. Its one of the issues in this list: #1551

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2017 introduces the lr_scheduler. Once both of these PRs land, it should be straightforward to add the checkpointing support. We can do it as a follow up.

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved
os.path.join(last_checkpoint, self._metadata_file)
):
if self._is_rank_zero:
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before making changes, can we align on the config to avoid extra work?

The option seems to be:
freq_save_intermediate_ckpt = N, and when N=0, it means that we never save it.
keep_only_last_intermediate_ckpt = True/False, if True, we dont keep them

Is that it? Probably needs better naming.

"No intermediate checkpoint found in the specified directory. Please ensure that a checkpoint exists."
)

if self._is_rank_zero:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel that these if is_rank_zero are most of the code :P

@saumishr saumishr force-pushed the torchtune_dcp branch 5 times, most recently from 9b47975 to daaf4b8 Compare November 19, 2024 09:02
@saumishr saumishr changed the title [DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints [WIP] [DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints Nov 19, 2024
@saumishr saumishr changed the title [DCP][RFC] DCP async checkpointing in TorchTune for intermediate checkpoints [DCP][RFC] Faster intermediate checkpoints with DCP async save in TorchTune Nov 19, 2024
@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

2 similar comments
@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@felipemello1
Copy link
Contributor

felipemello1 commented Dec 9, 2024

@saumishr, there were considerable changed made to the output path of the files. Can you double check that we can still resume from checkpoint even if we are using dcp? #2074

Also, please check the folder structure. The assumption is that the use can do cp path_with_ckpt path_with_cfgs, and this is ready to upload o HF. But if there are 50 files there related to dcp, this will look a bit ugly.

I think that we also need to have a clear solution about going from intermediate ckpt -> hf_format. Like we discussed, many of our users run eval on the intermediate ckpts. Do we have it?

cc: @joecummings

@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@joecummings
Copy link
Contributor

@saumishr Does it matter if the Facebook Internal tests are failing?

@saumishr
Copy link
Contributor Author

@joecummings There are no tests failing. A CI/CD job is complaining that the diff needs to be updated with the latest. I'm validating the async checkpoints with the recent refactor @felipemello1 had done. I will notify here once the validation is done and the PR can be merged. Thank you!

@joecummings
Copy link
Contributor

@joecummings There are no tests failing. A CI/CD job is complaining that the diff needs to be updated with the latest. I'm validating the async checkpoints with the recent refactor @felipemello1 had done. I will notify here once the validation is done and the PR can be merged. Thank you!

Once this validation is confirmed, I'm happy to merge this!

@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@saumishr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 28.48837% with 246 lines in your changes missing coverage. Please review.

Project coverage is 67.70%. Comparing base (9cfa288) to head (74ae681).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...htune/training/checkpointing/_checkpoint_client.py 0.00% 119 Missing ⚠️
tests/recipes/test_full_finetune_distributed.py 14.85% 86 Missing ⚠️
torchtune/training/checkpointing/_checkpointer.py 58.57% 29 Missing ⚠️
recipes/full_finetune_distributed.py 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2006      +/-   ##
==========================================
- Coverage   67.75%   67.70%   -0.06%     
==========================================
  Files         334      337       +3     
  Lines       19281    19665     +384     
==========================================
+ Hits        13064    13314     +250     
- Misses       6217     6351     +134     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joecummings joecummings added the triage review This issue should be discussed in weekly review label Dec 13, 2024
@joecummings joecummings changed the title [DCP][RFC] Faster intermediate checkpoints with DCP async save in TorchTune Faster intermediate checkpoints with DCP async save in TorchTune Dec 13, 2024
@joecummings joecummings merged commit c2c6f4a into pytorch:main Dec 13, 2024
17 checks passed
@joecummings joecummings removed the triage review This issue should be discussed in weekly review label Dec 13, 2024
rahul-sarvam added a commit to sarvamai/torchtune that referenced this pull request Dec 18, 2024
* Llama 3.3 70B (pytorch#2124)

* Llama 3.3 readme updates (pytorch#2125)

* update configs (pytorch#2107)

Co-authored-by: Felipe Mello <[email protected]>

* Reduce logging output for distributed KD (pytorch#2120)

* Support Early Exit Loss and/or Layer Dropout (pytorch#1076)

Co-authored-by: ebsmothers <[email protected]>

* Update checkpointing directory (pytorch#2074)

Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: vancoyendall <[email protected]>

* pass correct arg (pytorch#2127)

Co-authored-by: Felipe Mello <[email protected]>

* update configs (pytorch#2128)

Co-authored-by: Felipe Mello <[email protected]>

* fix qat_lora_test (pytorch#2131)

Co-authored-by: Felipe Mello <[email protected]>

* guard ckpt imports (pytorch#2133)

Co-authored-by: Felipe Mello <[email protected]>

* [bug fix] add parents=True (pytorch#2136)

Co-authored-by: Felipe Mello <[email protected]>

* [bug fix] re-add model (pytorch#2135)

Co-authored-by: Felipe Mello <[email protected]>

* Update save sizes into GiB (pytorch#2143)

* [bug fix] remove config download when source is kaggle (pytorch#2144)

Co-authored-by: Felipe Mello <[email protected]>

* [fix] remove "with_suffix" (pytorch#2146)

Co-authored-by: Felipe Mello <[email protected]>

* DoRA fixes (pytorch#2139)



Co-authored-by: Mircea Mironenco <[email protected]>

* [Fix] Llama 3.2 Vision decoder_trainable flag fixed (pytorch#2150)

* Small readme, config updates (pytorch#2157)

* Using `FormattedCheckpointFiles` in configs (pytorch#2147)

* Move ``get_world_size_and_rank`` to utils (pytorch#2155)

* Faster intermediate checkpoints with DCP async save in TorchTune (pytorch#2006)

Co-authored-by: Saurabh Mishra <[email protected]>

* torchdata integration - multi-dataset and streaming support (pytorch#1929)

* Allow higher version of lm-eval (pytorch#2165)

* Using `FormattedCheckpointFiles` in configs... round 2 (pytorch#2167)

* [EZ] Fix set_torch_num_threads in multi-node. (pytorch#2164)

---------

Co-authored-by: Philip Bontrager <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Joe Cummings <[email protected]>
Co-authored-by: Mostafa Elhoushi <[email protected]>
Co-authored-by: vancoyendall <[email protected]>
Co-authored-by: Mircea Mironenco <[email protected]>
Co-authored-by: salman <[email protected]>
Co-authored-by: Saurabh Mishra <[email protected]>
Co-authored-by: Saurabh Mishra <[email protected]>
Co-authored-by: Andrew Ho <[email protected]>
Co-authored-by: Eugen Hotaj <[email protected]>
rahul-sarvam pushed a commit to sarvamai/torchtune that referenced this pull request Dec 23, 2024
rahul-sarvam pushed a commit to sarvamai/torchtune that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants