Replies: 8 comments 6 replies
-
for Flux, the text encoder you can train is just a 768 dim CLIP-L model. the only part of it used is the pooled embed, essentially an averaged value. i'm not sure that in this case it will work very well to train it. obviously you can, but it was never a great method of finetuning CLIP via transformer or unet gradients that don't directly describe the CLIP model's needs that said, |
Beta Was this translation helpful? Give feedback.
-
you can't really add to the vocabulary of T5. it uses SentencePiece, which has a fixed vocabulary. |
Beta Was this translation helpful? Give feedback.
-
Oh interesting, so you're saying something like this wouldn't work for T5/SentencePiece? EDIT: also you mention we can train CLIP-L; we can't train T5? Too much vram needed for that? Not physically possible to train? |
Beta Was this translation helpful? Give feedback.
-
Simply speaking from my own experiments with Kolors as a layman, but even if it were trainable there doesn't seem to be much of a reason to train it. Every new model that uses T5 is using the base T5 model that was released with Floyd, not a custom finetune. Training the Unet on it's own seems to have the exact same effect that training both the Unet and CLIP has on SD 1.5 and XL models. |
Beta Was this translation helpful? Give feedback.
This comment has been minimized.
This comment has been minimized.
-
if you're asking about the origin of the CLIP models and whether they're frozen, i'd suggest to just stick with training the transformer. it's heavy enough already and enough of a mystery how to get good quality. why complicate things? |
Beta Was this translation helpful? Give feedback.
-
For the most part finetuned checkpoints and LoRAs will train at least the first CLIP model alongside the Unet, though some specific LoRAs won't train the text encoder for a variety of reasons. This though is because CLIP functions entirely differently from T5. By comparison, T5 functions more like an LLM by weighting elements across the image depending upon the entirety of the context of the prompt you give it relative to the Unet weights, so there's not much reason to train the T5 as it's already got a full vocabulary and interacts with the Unet in a far more refined and direct manner. At least that's my understanding. I'm not a data scientist, so feel free to correct any misunderstandings I might have. |
Beta Was this translation helpful? Give feedback.
-
I just want to say that you do NOT need to train the TE. Black forest lab haven't trained a single TE. The reason you are getting concept bleeding is by how you train. I managed to train two people into one lora without them bleeding together. Flux does not like trigger words, it's not natural for it to understand trigger words. T5 understands contextual meaning and that is how it also feeds your tokens to the unet. A trigger word does nothing more than confuse the model. The best results i got were when it used semi-long captions in natural language, also using different description styles. Flux can overtrain on certain captioning styles. If you constantly use the same sequence, length and style of sentences, it is very likely that Flux overtrains on that before it learns your subjects and thus cause them to bleed into eachother (because it didn't even correctly understand them). Training clip l would not solve the issue entirely because T5 will still do most of the work when feeding your prompt to the unet. Try incorporating your triggerword naturally into your caption in fluid language. Switch between vivid and professional descriptions of your subjects. If you have simple images with simple backgrounds, use short captions, if you have detailed scenes, add very detailed captions. Saying that it's the TEs fault for mixing up concepts would mean that Flux should theoretically be a scrambled mess, because they never trained any TE (neither did SD or any other T2i model) its just that we are still way to used to SDXL garbo that needed tags to even remotely understand. |
Beta Was this translation helpful? Give feedback.
-
Hi,
Still testing flux lora training, thank you for your work!
Question: does it also train the text encoder(s)? With kohya for sdxl, training the text encoder noticeably helped the lora learn for me.
Thank you
EDIT: also perhaps we could test adding new tokens to the tokenizer and train text encoder on those new tokens to see if learning is better? Or perhaps we could do lora just on the dit blocks (or whatever the rest of model is called) but also simultaneously do a light textual inversion
Beta Was this translation helpful? Give feedback.
All reactions