-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Redux conditioning for Flux Lora training #1838
base: sd3
Are you sure you want to change the base?
Conversation
@recris amazing work did you notice this is solving issue of training multiple same class concept? like 2 man at the same time or when you train a man it makes all other mans to turn into you. is this solving this problem moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 |
This has nothing to do with either of those issues. For multiple concepts you would need something like pivotal tuning which currently is not supported either. This PR is only an attempt to improve overall quality in the presence of poorly captioned training data. |
@recris thanks but you still recommend vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 and then we can use trained lora without flux redux right? |
Please read the notes fully before posting - these are not "recommendations", this hardly has been tested in a comprehensive way and it probably is not ready for widespread use. That said, you can probably start with |
Interesting concept! if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned. Similar to this concept: but not using the base model as teacher, but the base model conditioned by Redux. |
This is what the |
After some additional testing, in my (subjective) experience I can set I still don't think mashing together text and vision conditioning is the right approach. The correct way to do this would require concatenating the embeddings, then use a non-binary attention mask to control the influence of the Redux tokens. But this would require deeper changes to the Flux model code. |
If I understand your code correctly, the dropout removes the Redux-conditioning. Therefore, in a dropout-step you are training normally against the training images. I have now experimented myself a bit with this idea: On the left is the (only) training image, middle and right are samples of the LoRA, without Redux: Here is some experimental code: Nerogar/OneTrainer@master...dxqbYD:OneTrainer:redux |
I am still running some tests; I had a configuration mistake which might have affected my earlier results. With Redux dropout my expectation was that it would "diversify" the captions seen during training, improving learning robustness; the concept here is similar to providing multiple captions for the same image with varying levels of detail and selecting one at random in each training step (for reference: #1643) |
I get the idea, but what I am saying is: you are doing "multi-caption training", And then you are done with training, and during inference you can only use a). You cannot access what the model has learned for b) because you don't have the embedding. That's why I'm proposing that there should only be one optimization target - the one you still have during inference. That's what I've done in my samples. The central line in my code is this: |
@dxqbYD I think what you're talking about is only valid for unknown things. For example, if we're training a specific person, then we naturally want to give the model his name manually. But for some known concepts, for example, training cars to be less broken, we don't need to manually tell the model that a car is a car. Siglip will do it for us perfectly. |
sorry, I don't understand how any of this is related to what I wrote before. |
I've changed my previous approach to controlling the strength of the Redux conditioning:
The amount of Redux tokens in the conditioning seems to affect the ability to make the LoRA usable with text prompts, from testing various downsampling sizes I've noticed that beyond N=5 (25 tokens) it starts to perform noticeably worse (unless counteracted with dropout) Also training with both text and Redux seems to have a negative performance impact due to the amount of tokens being used, but there is also a significant amount of padding being added by default. I recommend lowering |
This PR adds support for training Flux.1 LoRA using conditioning from the Redux image encoder.
Instead of relying on text captions to condition the model, why not use the image itself to provide a "perfect" caption instead?
Redux+SigLIP provide a T5 compatible embedding that generates images very close to the target. I thought this could be used instead of relying on text descriptions that may or may not match the concepts as understood by the base model.
To use this I've added the following new parameters:
redux_model_path
: Safetensors file for the Redux model (downloadable from here)google/siglip-so400m-patch14-384
)vision_cond_downsample
: this controls downsampling for Redux tokens. By default, Redux conditioning uses a 27x27 set of tokens, which is a lot and has a very strong effect preventing proper learning. By setting this parameter toN
the tokens will be downsampled to aNxN
grid, thus reducing the effect. (By default this is disabled)vision_cond_dropout
: probability of drop-out for the vision conditioning. During a training step this will randomly chose to ignore the vision conditioning and use the text conditioning instead. For example0.2
means it will use Redux 80% of the time and use regular captions for the other 20%Experimental Notes:
vision_cond_ratio
I usually have to set it to0.2
or lower before I start seeing meaningful differences on what gets learned.vision_cond_dropout = 0.5
seems to work well enough, I noticed an improvement on the end result, less "broken" images (bad anatomy, etc.) during inference.The interpolation method behindvision_cond_ratio
feels very crude and unsound to me, maybe there is a better approach?vision_cond_downsample = 5
seems like a good place to start. Note: now the training uses both text and Redux tokens simultaneously.I don't expect this PR to be merged anytime soon, had to make some sub-optimal code changes to make this work. I am just posting this for visibility, so that people can play with it and gather feedback.