FLUX LoRA Training: text encoder training? #634

billnye2 · 2024-08-04T18:03:49Z

billnye2
Aug 4, 2024

Hi,

Still testing flux lora training, thank you for your work!

Question: does it also train the text encoder(s)? With kohya for sdxl, training the text encoder noticeably helped the lora learn for me.

Thank you

EDIT: also perhaps we could test adding new tokens to the tokenizer and train text encoder on those new tokens to see if learning is better? Or perhaps we could do lora just on the dit blocks (or whatever the rest of model is called) but also simultaneously do a light textual inversion

bghira · 2024-08-04T19:13:59Z

bghira
Aug 4, 2024
Maintainer

for Flux, the text encoder you can train is just a 768 dim CLIP-L model. the only part of it used is the pooled embed, essentially an averaged value. i'm not sure that in this case it will work very well to train it. obviously you can, but it was never a great method of finetuning CLIP via transformer or unet gradients that don't directly describe the CLIP model's needs

that said, --train_text_encoder is untested for Flux, but it should work. if it doesn't, open a new issue and report back. but i would avoid touching it until you have some idea of what training just the base model looks like.

0 replies

bghira · 2024-08-04T19:14:24Z

bghira
Aug 4, 2024
Maintainer

you can't really add to the vocabulary of T5. it uses SentencePiece, which has a fixed vocabulary.

0 replies

billnye2 · 2024-08-04T19:27:07Z

billnye2
Aug 4, 2024
Author

you can't really add to the vocabulary of T5. it uses SentencePiece, which has a fixed vocabulary.

Oh interesting, so you're saying something like this wouldn't work for T5/SentencePiece?

EDIT: also you mention we can train CLIP-L; we can't train T5? Too much vram needed for that? Not physically possible to train?

0 replies

setothegreat · 2024-08-04T22:12:40Z

setothegreat
Aug 4, 2024

we can't train T5? Too much vram needed for that? Not physically possible to train?

Simply speaking from my own experiments with Kolors as a layman, but even if it were trainable there doesn't seem to be much of a reason to train it. Every new model that uses T5 is using the base T5 model that was released with Floyd, not a custom finetune. Training the Unet on it's own seems to have the exact same effect that training both the Unet and CLIP has on SD 1.5 and XL models.

0 replies

bghira · 2024-08-04T22:36:36Z

bghira
Aug 4, 2024
Maintainer

if you're asking about the origin of the CLIP models and whether they're frozen, i'd suggest to just stick with training the transformer. it's heavy enough already and enough of a mystery how to get good quality. why complicate things?

1 reply

billnye2 Aug 4, 2024
Author

Yup I'm sure it's fine to just train the transformer, but I only brought it up because when using kohya with sdxl, training the clip model has a noticeable improvement in model output for me. If T5 is good enough where we don't need to do that, then cool

setothegreat · 2024-08-04T22:37:53Z

setothegreat
Aug 4, 2024

Ah interesting, does SDXL use a custom finetune of clip or the base model? One advantage I've heard (but not tested myself) is injecting new tokens into tokenizer and dual training the text encoder with the unet (transformer blocks here) so that the new tokens are learned on their own and don't have leakage from other tokens/their embeddings (which I think removes the requirement to find "rare" tokens to use as the trigger word, i.e. you just create a completely new one). Would be cool to test that with Flux

For the most part finetuned checkpoints and LoRAs will train at least the first CLIP model alongside the Unet, though some specific LoRAs won't train the text encoder for a variety of reasons.

This though is because CLIP functions entirely differently from T5.
From my understanding, CLIP functions more like a relatively simple way to derive a set of tokens from a text input, and then weight them relative to the other tokens it's found to generate an image with as little loss as possible, but because of this it doesn't truly understand the context for how those tokens should exist in the image, hence why there's often issues of concept bleeding or specific aspects of a given prompt not being fully understood.

By comparison, T5 functions more like an LLM by weighting elements across the image depending upon the entirety of the context of the prompt you give it relative to the Unet weights, so there's not much reason to train the T5 as it's already got a full vocabulary and interacts with the Unet in a far more refined and direct manner.

At least that's my understanding. I'm not a data scientist, so feel free to correct any misunderstandings I might have.

3 replies

billnye2 Aug 4, 2024
Author

I'm no data scientist, but don't both of those, CLIP and T5, just produce a text embedding (through transformer encoder) which is then used as the condition to cross attend with the unet/dit layers? I would imagine T5 is better mainly because it's larger? But you do see people online saying CLIP is important because it was trained on shared text-image embedding space, not sure, I'm still learning

setothegreat Aug 4, 2024

I would imagine T5 is better mainly because it's larger

That's definitely not the case. T5 has an understanding of the full context of the prompt before any conditioning is generated since it functions like an actual LLM, whilst CLIP does not. This is why T5 performs substantially better than CLIP with natural language prompting, whilst CLIP performs better with danbooru style prompting. This is also why you can refer to specific elements in the prompt in different contexts, such as describing the specifics of a character in a scene after the entire scene has been described and have the model understand what you're referring to, whilst CLIP natively can't even understand which elements should be separate from any other elements.

I would highly recommend reading into how both function, because they are dramatically different in how they function.

billnye2 Aug 4, 2024
Author

They're both used as a transformer encoder here right? Which then produce a latent feature embedding? So is the difference in their training where CLIP had text-image pairs (captions) and T5 had regular long-form text?

Deathawaits4 · 2024-09-04T12:26:22Z

Deathawaits4
Sep 4, 2024

I just want to say that you do NOT need to train the TE. Black forest lab haven't trained a single TE. The reason you are getting concept bleeding is by how you train. I managed to train two people into one lora without them bleeding together. Flux does not like trigger words, it's not natural for it to understand trigger words. T5 understands contextual meaning and that is how it also feeds your tokens to the unet. A trigger word does nothing more than confuse the model. The best results i got were when it used semi-long captions in natural language, also using different description styles. Flux can overtrain on certain captioning styles. If you constantly use the same sequence, length and style of sentences, it is very likely that Flux overtrains on that before it learns your subjects and thus cause them to bleed into eachother (because it didn't even correctly understand them). Training clip l would not solve the issue entirely because T5 will still do most of the work when feeding your prompt to the unet. Try incorporating your triggerword naturally into your caption in fluid language. Switch between vivid and professional descriptions of your subjects. If you have simple images with simple backgrounds, use short captions, if you have detailed scenes, add very detailed captions. Saying that it's the TEs fault for mixing up concepts would mean that Flux should theoretically be a scrambled mess, because they never trained any TE (neither did SD or any other T2i model) its just that we are still way to used to SDXL garbo that needed tags to even remotely understand.

2 replies

bghira Sep 4, 2024
Maintainer

jfyi comfy calls it a unet but it's a transformer

mkyung Sep 8, 2024

any regularization needed when you train the two-person LoRA?

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX LoRA Training: text encoder training? #634

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

This comment has been minimized.

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

FLUX LoRA Training: text encoder training? #634

billnye2 Aug 4, 2024

Replies: 8 comments · 6 replies

bghira Aug 4, 2024 Maintainer

bghira Aug 4, 2024 Maintainer

billnye2 Aug 4, 2024 Author

setothegreat Aug 4, 2024

This comment has been minimized.

bghira Aug 4, 2024 Maintainer

billnye2 Aug 4, 2024 Author

setothegreat Aug 4, 2024

billnye2 Aug 4, 2024 Author

setothegreat Aug 4, 2024

billnye2 Aug 4, 2024 Author

Deathawaits4 Sep 4, 2024

bghira Sep 4, 2024 Maintainer

mkyung Sep 8, 2024

billnye2
Aug 4, 2024

Replies: 8 comments 6 replies

bghira
Aug 4, 2024
Maintainer

bghira
Aug 4, 2024
Maintainer

billnye2
Aug 4, 2024
Author

setothegreat
Aug 4, 2024

bghira
Aug 4, 2024
Maintainer

billnye2 Aug 4, 2024
Author

setothegreat
Aug 4, 2024

billnye2 Aug 4, 2024
Author

billnye2 Aug 4, 2024
Author

Deathawaits4
Sep 4, 2024

bghira Sep 4, 2024
Maintainer