Replies: 6 comments 4 replies
-
you need --vae_cache_preprocess |
Beta Was this translation helpful? Give feedback.
-
I added |
Beta Was this translation helpful? Give feedback.
-
can you try with just one GPU? |
Beta Was this translation helpful? Give feedback.
-
for what it's worth, 40 seconds is a lot more than i'd expect. especially for a LoRA on a 2B model. it should be more like 3-5 seconds per second at worst, and 10 seconds per step when training from S3 backend edit: make sure you're not using DoRA. it's slower. |
Beta Was this translation helpful? Give feedback.
-
Tested with 1 GPU (A8000 with 48GB), with batch size 10 it still takes ~40 seconds each iteration. |
Beta Was this translation helpful? Give feedback.
-
I finally found out the problem: SimpleTuner has defaulted to use |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for this nice repo!
I have been trying to train a LoRA on SD3 using the multi-GPU setting. I am on 4 48GPUs (A6000), and my dataset is 1000 1024x1024 images --- I set up the batch size as 10 and do no gradient accumulation. Each iteration takes 40-50 seconds to complete.
It seems this training speed is dramatically slower comparing with many of the other logs I find in issues of this repo. Is it a normal phenomenon, or potentially I have some wrong setups?
Best regards and thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions