-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yank processes not cleaning exiting #415
Comments
Why do you feel the GPUs are not in use? |
Because I checked for the running gpu jobs and found very few. |
How are you checking? As I do not see that. Removing username.
Double checking but that sure seems like a fair chunk. |
The program is called "yank" and I show it on lots of GPUs. |
Oddly enough, there are some GPUs that are running jobs from two different people, which shouldn't be happening:
|
alright, I misinterpreted that and thought these are single jobs/gpus. Thank you, Paul. I will probably need to deal with this diplomatically ;) |
I show
It looks like this is related to #409 (comment) |
John, do you think it would be possible to keep 5-10 gpus useable or unoccupied by yank until friday (NIPS deadline)? I really will not be needing any more, but just having 3 as is the case now is a bit tight. |
I show processes by luirink on gpu-2-5. If the code is not exiting cleanly it will need to be fixed.
|
But basically @karalets the statement in this Git that "the cluster seems to prefer to let the gpus gather dust over giving them to me" is incorrect and I'm still waiting for how you determined that so we can resolve the actually requested support. I show yank consuming all the GPUs. |
I see your note @karalets. Ignore. If you want my methods of parsing the qstat I can document. |
@MehtapIsik: Can you halt one or more of your |
Paul @tatarsky , I must have made a mistake as said earlier, I was just counting the number of jobs, not the number of gpus each one of these jobs was using. So please put my comments down to my ignorance of the cluster-diagnostics. |
Noted your Git response already. Just an ordering. I am happy to share my method of determining. |
I was doing 'qstat | grep gpu' which obviously was suboptimal. |
Yeah that doesn't work. |
My output was from a cheap quick way. There are others.
|
I'll look into why @tatarsky: Have you seen threads like this that note that this might be due to the way in which |
Can check the old Relion notes which I dimly recall something similar. |
The hydra
|
(@karalets: I'm in Boston, but @MehtapIsik is just upstairs in Z11 if you want to go find her in person!) |
There is probably a better way but thats my cheap way. The spaces around "gpu" prevent grepping the name of the host which contains gpu as well. |
There's also this:
|
This page suggests:
I'm still investigating whether this requires we specify a specify flag to |
My Relion memory may not be relevant. That issue involved a wrapper script that was not then properly providing mpirun with the nodefile or blocking its ability to get the Torque environment provided one. At least that is my brief review of the old issue. The MPI "tm" interface was indeed discussed. But I don't recall a flag for more better exiting. Git #329 has all sorts of stuff in it. Probably of no relevance. |
Do you wish all luirink node processes killed off before I scrub the name from the Git page? Or does it help to leave some to debug why they are not dead? |
I'm changing the Git name so I can focus on that part. I will mention there is a Tesla sitting idle in |
Hmm. I may need to adjust that systems gpu queue oversubscription rule though. Let me check on that. |
Also there are a fair number of free gpus at the moment. |
Ah but slots/ram are a bit short. Let me see if I can tweak something there. |
I added the node property I do not know enough about user reservations to on a Friday try to guarantee you get priority on those nodes. Perhaps next week. |
Hmm. And it may not be working as I expect. So if it doesn't seem to do what you need I'll have to revisit it. |
There are however now several free gpus on the cluster but don't know how long it will last. |
Thanks! Will debug over the weekend! Traveling back from Boston tonight. @karalets: This is a serious issue and needs to be addressed ASAP! Thanks for pointing it out! |
Starting to debug now. Thanks again, @tatarsky! @karalets: I notice that your job
This may be intended, so no worries if so---but it seems like it may be a sign something is wrong! |
Looks like the 'gtxdebug' and I've tried to replicate this problem on the GTX-680s, but am unable to. The jobs all terminate cleanly when torque kills the master process. |
Seeing if I can spot some other titans to apply the tag. |
@tatarsky: If sometime this week you have an idea about how to reserve a node of GTX-TITANs for us to use to debug (via torque), I can sit down with @MehtapIsik and interactively try to see if there are any issues with this specifically or with her environment that may be causing this problem. In the meantime, I am unable to reproduce, and can't debug further until GTX-TITAN nodes are free. @karalets: Will try debugging again tomorrow in case some nodes are free. |
BTW the tag is |
No, I'm wrong. He's got jobs on lots of the |
Well, also as you note he seems to be requesting four gpus but nvidia-smi shows only 1 gpu in use. In at least a spot check of gpu-2-10.
And unless I'm reading this wrong.
|
@karalets has reserved all four GPUs on those nodes and locked all of the GPUs in thread-exclusive mode, even though he is only using one:
I'm guessing this is unintentional. @karalets: Might want to check what's going on with your code. |
Yep. Seeing the same. |
In the meantime, thanks for the weekend help, @tatarsky, and let's connect up during the week to further debug if needed! |
Fair enough. |
I've found a free GTX-TITAN-X node ( |
I have opened up some of them after reading this. |
Thanks! @karalets: Was the use of 1/4 GPUs expected? |
I sometimes reserve a bunch and use a variable amount when I am trying out new code. |
Ah, OK! Thanks for the clarification! |
I can reproduce this on |
I've tried this dozens of times, but I can't seem to consistently reproduce this problem. It happened once, but I don't seem to be able to get it to happen again. |
I suspect it mostly happens when queue jobs time out. |
I'm currently trying to harden the YANK code with an explicit call to @tatarsky: Do you know what signal Torque sends when killing jobs that hit their resource limits? The following dump to
Are there other signals Torque might send that I should worry about too? I think it might send a |
I believe SIGTERM is correct. And it does indeed send SIGKILL after some period which I dimly recall is also tunable but defaults to 60 seconds. There is also a command called I believe it also logs when it sends the signal with something like this:
|
"I suspect it mostly happens when queue jobs time out." I assume this means when the jobs hit a walltime limit and are killed. Is there a major problem in simply stating the walltime higher and letting the jobs complete without such an event? Or is the walltime limit being used to control the usage of the job. Its a subtle point, but why not set the walltime to a value that better matches the needs of the job. |
The actual jobs may take many days, but the walltime limit is being used to
break the jobs into more queue-neighbor-friendly chunks. So it is a
significant problem if our code doesn't cleanly exit when requested to do
so!
Still tinkering with MPI.Abort() calls...
|
Hi,
I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.
As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!
Is there any explanation for this? Or, to be actionable: can I do something to change that?
Best and thanks,
Theo
The text was updated successfully, but these errors were encountered: