Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it work with Spatiotemporal Skip Guidance? #118

Open
renschni opened this issue Dec 28, 2024 · 4 comments
Open

Does it work with Spatiotemporal Skip Guidance? #118

renschni opened this issue Dec 28, 2024 · 4 comments
Assignees

Comments

@renschni
Copy link

As title says, combining speed improvement with STG improvement would be great. Did anybody conduct tests already if it works?

@foreverpiano
Copy link
Collaborator

foreverpiano commented Dec 28, 2024

I think it is compatible with any CFG enhanced methods.
I have created an issue before. junhahyung/STGuidance#14
They don't have a clear monkey patch implementation. I will integrate the utils then.

@foreverpiano foreverpiano self-assigned this Dec 28, 2024
@renschni
Copy link
Author

renschni commented Jan 3, 2025

I tried to understand what exactly STG does and where it is applied. My naive approach was to compare the pipeline scripts of the fastvideo and STGuidance repos (pipeline_hunyuan_video.py). The first mentioning of stg happens in def call(
In the fastvideo version the codeblock it ends with:
...
vae_ver: str = "88-4c-sd",
enable_tiling: bool = False,
n_tokens: Optional[int] = None,
embedded_guidance_scale: Optional[float] = None,
**kwargs,
):

whereas in STG it adds:
     ....
     vae_ver: str = "88-4c-sd",
    enable_tiling: bool = False,
    n_tokens: Optional[int] = None,
    embedded_guidance_scale: Optional[float] = None,
    stg_mode: Optional[str] = None,
    stg_block_idx: List[int] = [-1],
    stg_scale: float = 0.0,
    **kwargs,
):

Then further down the pipeline in the section

perform guidance

I think is where the magic happens:

            # perform guidance
            if self.do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + self.guidance_scale * (
                    noise_pred_text - noise_pred_uncond
                )
            elif not self.do_classifier_free_guidance and self.do_spatio_temporal_guidance:
                with torch.autocast(
                    device_type="cuda", dtype=target_dtype, enabled=autocast_enabled
                ):
                    noise_pred_perturb = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
                        latent_model_input,  # [2, 16, 33, 24, 42]
                        t_expand,  # [2]
                        text_states=prompt_embeds,  # [2, 256, 4096]
                        text_mask=prompt_mask,  # [2, 256]
                        text_states_2=prompt_embeds_2,  # [2, 768]
                        freqs_cos=freqs_cis[0],  # [seqlen, head_dim]
                        freqs_sin=freqs_cis[1],  # [seqlen, head_dim]
                        guidance=guidance_expand,
                        return_dict=[stg_block_idx, stg_mode, True],
                    )[
                        "x"
                    ]
                noise_pred = noise_pred_perturb + self._stg_scale * (
                    noise_pred - noise_pred_perturb
                )

I couldn't see any difference in the get_guidance_scale_embedding functions of the pipleine scripts so it really seems to be the math section in #perform guidance that does it.

of course, this property got added to the stack:

@property
def do_spatio_temporal_guidance(self):
    # return self._guidance_scale > 1 and self.transformer.config.time_cond_proj_dim is None
    return self._stg_scale > 1

I can't test it on my machine, but my naive approach would be to simply change the pipeline script and add the STG related codeblocks and see what happens :)

@foreverpiano
Copy link
Collaborator

Could you provide a few examples of how disabling certain guidance features affects the results? I want to check the differences before adding them to our repository. I'm waiting for their response before making any additions.

@renschni
Copy link
Author

renschni commented Jan 5, 2025

The project team provided some very good examples on their pages. I find this one here (mochi1 though) most impressive. I still don't fully understand how "reducing" the guidance allows for this jump in quality. But then again I don't fully understand transformer models. But I guess too much micro-management is never good with any intelligence, be it natural or artificial...
As soon as I have build my new workstation (as I said I can't test it currently) I will implement it locally and conduct tests with this repo here, STG injected.

https://junhahyung.github.io/STGuidance/assets/circle/mochi/sample2.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants