Fix issues for saving checkpointing steps #9891

leisuzz · 2024-11-08T14:43:09Z

What does this PR do?

These modification can help to save the checkpoint steps while training. Otherwise it will just stuck for too long and timeout.
Fixes get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora #2606
Bug fix for weight pop from empty list

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

leisuzz · 2024-11-13T04:43:56Z

@sayakpaul Please take a look at this PR, thanks for your help!

sayakpaul

Thanks for your PR. Can you please modify a single file first and discuss the changes first?

leisuzz · 2024-11-13T13:34:19Z

When I did the deambooth flux without Lora, and save the check pointing. It stuck for a while and break. So I think they all need these modifications. I can only do the flux ones if you want

sayakpaul · 2024-11-13T13:36:21Z

Yeah let's change a single file first and then we can discuss the changes first.

leisuzz · 2024-11-13T13:37:21Z

Sure

leisuzz · 2024-11-14T00:51:59Z

@sayakpaul I already changed the modifications only on FLUX models

sayakpaul · 2024-11-14T22:36:13Z

examples/dreambooth/train_dreambooth_flux.py

+                if global_step % args.checkpointing_steps == 0:
+                    # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`


Well, there is a better way to handle it:

if accelerator.is_main_process or accelerator.distributed_type == DistributedType.DEEPSPEED:

sayakpaul

Thaks for your PR!

You can refer to the following scripts:

To see how we handle saving and loading from checkpoints when using DeepSpeed.

Search for DistributedType.DEEPSPEED.

leisuzz · 2024-11-15T02:38:25Z

@sayakpaul I've changed it based on the reference

sayakpaul

Thanks! Left more comments. LMK if they're clear.

sayakpaul · 2024-11-15T08:40:12Z

examples/dreambooth/train_dreambooth_flux.py

                    model.load_state_dict(load_model.state_dict())
-                except Exception:
+                elif isinstance(unwrap_model(model), (CLIPTextModelWithProjection, T5EncoderModel)):


We don't support fine-tuning the T5 model. So, this seems wrong. It should just be CLIPTextModelWithProjection, no?

sayakpaul · 2024-11-15T08:40:27Z

examples/dreambooth/train_dreambooth_flux.py

+                        try:
+                            load_model = T5EncoderModel.from_pretrained(input_dir, subfolder="text_encoder_2")
+                            model(**load_model.config)
+                            model.load_state_dict(load_model.state_dict())
+                        except Exception:
+                            raise ValueError(f"Couldn't load the model of type: ({type(model)}).")
+                else:
+                    raise ValueError(f"Unsupported model found: {type(model)=}")


Same for this.

sayakpaul · 2024-11-15T08:41:36Z

examples/dreambooth/train_dreambooth_flux.py

-                try:
-                    load_model = CLIPTextModelWithProjection.from_pretrained(input_dir, subfolder="text_encoder")
-                    model(**load_model.config)
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:


We also need to handle the case when we're actually doing DeepSpeed training. Similar to:
https://github.com/a-r-r-o-w/cogvideox-factory/blob/d63a826f37758eccf226710f94f6c3a4d4ee7a25/training/cogvideox_text_to_video_sft.py#L385

sayakpaul · 2024-11-15T08:42:37Z

examples/dreambooth/train_dreambooth_lora_flux.py

@@ -1262,15 +1263,16 @@ def load_model_hook(models, input_dir):
        transformer_ = None
        text_encoder_one_ = None

-        while len(models) > 0:
-            model = models.pop()
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:


Same. We need to handle the case when we're doing DeepSpeed training. Reference:
https://github.com/a-r-r-o-w/cogvideox-factory/blob/d63a826f37758eccf226710f94f6c3a4d4ee7a25/training/cogvideox_text_to_video_lora.py#L396

sayakpaul · 2024-11-15T10:52:42Z

examples/dreambooth/train_dreambooth_flux.py

@@ -1187,7 +1187,8 @@ def save_model_hook(models, weights, output_dir):
                    raise ValueError(f"Wrong model supplied: {type(model)=}.")

                # make sure to pop weight so that corresponding model is not saved again
-                weights.pop()
+                if weights:
+                    weights.pop()

    def load_model_hook(models, input_dir):


Seems like we're not handling the loading case appropriately here. I repeated this multiple times now but please refer to the changes here to get an idea of what is required.

In summary, we're not dealing with the changes required to load the state dict in the models being trained when DeepSpeed is enabled.

Fix issues for saving checkpointing steps

6837a81

leisuzz mentioned this pull request Nov 8, 2024

get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora #2606

Open

蒋硕 added 2 commits November 13, 2024 11:42

Fix issues for saving checkpointing steps

486a115

Fix issues for saving checkpointing steps

5d07df8

sayakpaul reviewed Nov 13, 2024

View reviewed changes

蒋硕 added 2 commits November 14, 2024 08:48

Fix issues for saving checkpointing steps

1aaffab

Fix issues for saving checkpointing steps

bbb29a0

sayakpaul reviewed Nov 14, 2024

View reviewed changes

蒋硕 added 2 commits November 15, 2024 10:30

Fix issues for saving checkpointing steps

371b3a9

Fix issues for saving checkpointing steps

e78afaf

sayakpaul reviewed Nov 15, 2024

View reviewed changes

Fix issues for saving checkpointing steps

7eae07b

sayakpaul reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues for saving checkpointing steps #9891

Fix issues for saving checkpointing steps #9891

leisuzz commented Nov 8, 2024 •

edited

Loading

leisuzz commented Nov 13, 2024

sayakpaul left a comment

leisuzz commented Nov 13, 2024

sayakpaul commented Nov 13, 2024

leisuzz commented Nov 13, 2024

leisuzz commented Nov 14, 2024

sayakpaul Nov 14, 2024

sayakpaul left a comment

leisuzz commented Nov 15, 2024

sayakpaul left a comment

sayakpaul Nov 15, 2024

sayakpaul Nov 15, 2024

sayakpaul Nov 15, 2024

sayakpaul Nov 15, 2024

sayakpaul Nov 15, 2024

		if global_step % args.checkpointing_steps == 0:
		# _before_ saving state, check if this save would set us over the `checkpoints_total_limit`

Fix issues for saving checkpointing steps #9891

Are you sure you want to change the base?

Fix issues for saving checkpointing steps #9891

Conversation

leisuzz commented Nov 8, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

leisuzz commented Nov 13, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

leisuzz commented Nov 13, 2024

sayakpaul commented Nov 13, 2024

leisuzz commented Nov 13, 2024

leisuzz commented Nov 14, 2024

sayakpaul Nov 14, 2024

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

leisuzz commented Nov 15, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul Nov 15, 2024

Choose a reason for hiding this comment

sayakpaul Nov 15, 2024

Choose a reason for hiding this comment

sayakpaul Nov 15, 2024

Choose a reason for hiding this comment

sayakpaul Nov 15, 2024

Choose a reason for hiding this comment

sayakpaul Nov 15, 2024

Choose a reason for hiding this comment

leisuzz commented Nov 8, 2024 •

edited

Loading