Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix Dockerfile.gpu #34

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Hoda1394
Copy link
Contributor

I have tried so many different things to address this issue #33 and among them, this Dockerfile can be built without error but when I run the container, TensorFlow does not see the gpu and runs on cpu!
This container image is available in docker hub as hodadock/kwyk:gpu_test

@satra, @kaczmarj -Any idea how we can fix it?

@kaczmarj
Copy link
Member

kaczmarj commented Jul 19, 2022

@Hoda1394 - what command are you using to run the container? i don't have any experience running a docker image with a gpu; i've only used apptainer/singularity with gpu.

a few potential problems come to mind (not saying that any of these are present here):

  1. the cuda/cudnn versions are not appropriate for the installed tensorflow version (though using the official tensorflow image should prevent this issue).
  2. the docker run command is not correct. i'm assuming there are some extra flags that need to be added to use gpu.
  3. the nvidia drivers are too old on the host system (possible, but unlikely because this container uses tensorflow 1.x, which has been around for several years now).

another point -- you can test whether a gpu is available with tf.test.is_gpu_available(). tensorflow 2.x also has tf.config.list_physical_devices("GPU") but not sure if 1.x has it.

another thought -- try validating that the official tensorflow image can use the gpu. so run the tensorflow/tensorflow:1.12.3-gpu-py3 image in a way that should use the gpu and test that it actually sees the gpu. if the container sees the gpu, the problem is somewhere in the dockerfile.

@Hoda1394
Copy link
Contributor Author

Actually, I was running the singularity conversion of this image with gpu.
the gpu is visible inside the container but TensorFlow can't see it. I tested with the official image and tf.test.is_gpu_available() returns False. So, it seems that the issue is related to the base image!

@Hoda1394
Copy link
Contributor Author

As additional info when I run pip list |grep tensorflow inside the container, I get

tensorflow          1.12.3                
tensorflow-gpu      1.12.0 

there are two versions of TensorFlow installed. not sure if this can cause this issue...

@kaczmarj
Copy link
Member

As additional info when I run pip list |grep tensorflow inside the container, I get

tensorflow          1.12.3                
tensorflow-gpu      1.12.0 

there are two versions of TensorFlow installed. not sure if this can cause this issue...

this is probably the problem (or one of them!). can you try pip list with the base image? see which one is present. and see if the base image can see the gpu.

@Hoda1394
Copy link
Contributor Author

I tried this with the base image and saw both. when running python and import tf, the tf.__version__ returns 1.12.3 . So, it seems that tensorflow is getting imported rather than tensorflow-gpu
I tried to uninstall it inside the container but I was not successful.

@kaczmarj
Copy link
Member

i can reproduce this... it could be a problem with the 1.12.3-gpu-py3 docker image. why are we using such an old image anyway?

docker run --rm tensorflow/tensorflow:1.12.3-gpu-py3 python -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'
False

the 1.14.0-gpu-py3 image works.

docker run --rm tensorflow/tensorflow:1.14.0-gpu-py3 python -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'
True

we should probably use a newer image. i realize we used 1.12 in the project, but we can test if everything works correctly with 1.15 (the last release of the 1.x series).

@Hoda1394
Copy link
Contributor Author

I tried removing tensorflow during the build and tensorflow-gpu doesn't work properly without it. I already tested the tensorflow1.15 and I got some other errors due to the version mismatch so if we want to use tensorflow1.15 we may need to update the code.

@Hoda1394
Copy link
Contributor Author

I will try version 1.14.0-gpu-py3 also.

@kaczmarj
Copy link
Member

feel free to post any errors you get when trying newer versions. paste the entire traceback and i can take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants