-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting the wrong GPUs #422
Comments
Try the active queue. qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=4:shared:gtxtitans |
You shouldn't need to use the |
Using the active queue didn't fix it. Although, I just managed to get 'good' (aka conforming to my request) gpus on |
This worked correctly for me when I included
I wonder why omitting the |
Oh, there are some problems with GPU spillover though. I was allocated
This seems to be a docker job that is using GPUs but didn't request them, or is still running after supposedly being killed by
|
It's impossible to tell who is/was running that docker job, but they are processing data in @gideonite's directory. |
I will look in a moment. I have been on the road all morning. |
I was running a docker container in an active session on gpu-2-14 but the process should have stopped using GPU resources sometime yesterday evening. Perhaps I requested the node by running |
I have a dim memory of seeing this before where without the "nodes" stanza the qsub does not do what is expected. I would need to locate the Git issue or Torque ticket that matches that part of my memory. As for the docker item, I'd have to investigate that as well if you feel the state of the card is wrong. |
The docker flag seems to be working fine! |
For the gpu-2-14 docker and nvidia GPU resources item I show via lsof that these processes appear to still have nvidia devices open.
Those are docker processes (note the root part) and an attempt to narrow down a bit besides being the only docker job on the system is that the cwd is showing: /proc/23613/cwd -> /tf_data Thats within the chroot. And also if we expand that docker instance a bit we see that ipyton is associated with that one:
So I guess the question this is "why isn't this correct when you've both I believe requested "shared" mode for the nvidia cards". Or do I misunderstand that aside? @corcra please note this is NOT related to your item. |
I have reproduced as you folks have leaving off "nodes=X" appears to result in the behavior. Now trying to remember where I remember this from. |
Things pointing at |
No, I don't think its complicating things. I think basically the syntax you've used at the start of this doesn't work properly and we've talked about it before. I'm just trying to locate that conversation. |
Ah, we may have noted something similar in #275 and Adaptive assigned me a bug number after confirming a resource parsing error. Let me see if I can spot anything on that. I don't recall ever seeing that bug being fixed. |
They believe its basically the same bug and that nodes=X is required at this time in that release. He is however checking where the bug number went for the developers to fix the one I reported many moons ago. As it seems to have fallen out of existence. My preference with "enforced syntax" is the parser should tell you its wrong and not just "do something random" ;) I know I'm weird that way. |
This is confirmed back in their bug system but not addressed. Please use nodes=X in qsub resource requests. |
I am confused by/failing to do
qsub
commands to get the correct resources.For example, I ran:
qsub -I -q gpu -l gpus=4:gtxtitan:docker:shared
and got this setup: (
gpu-1-5
fwiw)The GPUs aren't shared, and aren't gtxtitans ...what's going on here? I need both non-exclusive and gtxtitan (or >gtx680 at least) to run Tensorflow, so this is problematic.
The text was updated successfully, but these errors were encountered: