Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kohya_ss flux #5

Open
dimonnwc3 opened this issue Nov 13, 2024 · 5 comments
Open

kohya_ss flux #5

dimonnwc3 opened this issue Nov 13, 2024 · 5 comments

Comments

@dimonnwc3
Copy link

I was trying to fine-tune flux model with the sd3-flux.1 branch, by adding KOHYA_REF=sd3-flux.1 env variable.

Although the container starts, training fails immediately with the following error: Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory

I checked /usr/lib/x86_64-linux-gnu directory, where libnvrtc.so and libnvrtc.so.12 files are missing for some reason.

Then I tried to mount volume x86_64-linux-gnu from the host, by changing my docker-compose file:

volumes:
    - aidock_workspace_dev:/workspace
    - /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu <- added this line

Fine tuning starts working, but during startup and training it still shows some errors:

1.

ERROR: ld.so: object 'libtcmalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
This one, happens on startup and multiple times later.

2.

E0000 00:00:1731487428.843877    2536 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1731487428.847053    2536 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

this one happens when fine-tuning starts

Seems my solution by mapping /usr/lib/x86_64-linux-gnu is not correct and there has to be another one to fix the original error: Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory.

Any ideas, how to make it work?

@robballantyne
Copy link
Member

@dimonnwc3 I have built a new image for the flux branch. I have not had time to test it, but feel free to give it a try. If we're still having issues with the shared libs I'll update the base image

@dimonnwc3
Copy link
Author

thank you

I tried to use image with sd3-flux.1 tag and notice a minor issue. It starts with the correct branch, but then switches back to v24.1.7 if AUTO_UPDATE is enabled and KOHYA_REF is not defined.

After explicitly defining KOHYA_REF, to make sure I'm on the right branch, I still get same error after starting the training:

Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so. Error: libnvrtc.so: cannot open shared object file: No such file or directory

@dimonnwc3
Copy link
Author

after some debugging I found out that libnvrtc.so is not a part of base cuda image, howewer I found it in devel branch

building custom docker image with arg: IMAGE_BASE=ghcr.io/ai-dock/python:3.10-v2-cuda-12.1.1-devel-22.04
solves the issue

also because of the devel, image size is significant larger

@kostenickj
Copy link

after some debugging I found out that libnvrtc.so is not a part of base cuda image, howewer I found it in devel branch

building custom docker image with arg: IMAGE_BASE=ghcr.io/ai-dock/python:3.10-v2-cuda-12.1.1-devel-22.04 solves the issue

also because of the devel, image size is significant larger

Did you ever get this working? Im also trying to use this for flux. If so, would you mind sharing your docker file?

@dimonnwc3
Copy link
Author

after some debugging I found out that libnvrtc.so is not a part of base cuda image, howewer I found it in devel branch
building custom docker image with arg: IMAGE_BASE=ghcr.io/ai-dock/python:3.10-v2-cuda-12.1.1-devel-22.04 solves the issue
also because of the devel, image size is significant larger

Did you ever get this working? Im also trying to use this for flux. If so, would you mind sharing your docker file?

I ended up building custom image like this docker build --build-arg KOHYA_BUILD_REF=d0cd9f5 --build-arg PYTORCH_VERSION=2.5.0 --build-arg IMAGE_BASE=ghcr.io/ai-dock/python:3.10-v2-cuda-12.1.1-devel-22.04 -t latest .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants