-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Average queue wait time is now over 10 hours! #405
Comments
Noted. I am trying to prepare for a queue modification to allow the Fuchs group purchased nodes to act as batch and gpu nodes when idle. There is a detail involving the unlimited wall time of the batch queue however that I was wanting to propose a change for but is being reviewed. One item I could start if you had a moment is if I offline one (as its idle) can you validate your code works properly on one of them via manual SSH? |
I could do that now! |
OK. Please ssh manually to gg06. It has two GTX Titans. |
Looks like you have the wrong CUDA version installed as default:
|
One second that is correct. |
OK. Fixed some rules. Try again. |
Hmm. I'm actually showing a regression somewhere on this topic of the default /usr/local/cuda symlink. I'm checking into it now. |
OK. I believe that is correct everywhere now. I noted a few were 7-0 and I'm not sure why. I am investigating. |
Found rule issue and believed now fixed correctly. I will double check after next puppet run but please continue to test as desired on |
Seems to work now. Thanks! |
OK. I will get an update on the ruling for making these nodes able to handle overflow. Thank you for testing. I will likely announce a general "batch" test on this node as well. |
Please note gg06 back in the queue. I believe you got the data we need to proceed with the process but they have an important deadline. |
So I've been watching this and while we don't still have full agreement on how to share the added nodes, I am paying attention to the average Qtime. Its currently down to:
Work continues on the policies/config to allow the groups that purchased additional nodes. I'm currently making use of standing reservations when a deadline is upon the group in question. I am however leaving this open until I get a better final statement on some of those sharing policies. |
I feel sorry for the poor sap who was waiting 351.68 hours (15 days) for their jobs to start... |
So I wrote a script to try to analyze the job logs for that waittime and I cannot locate said job shown there as max. Longest I see is actually a gpu job of yours back on 5/16 which was in queue for 115 hours. Still not good but I can't find that one. Which means my script is probably wrong, but I'm trying to analyze for @juanperin the resource shortages in terms of the discussions going on to share the added nodes. |
This is getting to be pretty long:
The text was updated successfully, but these errors were encountered: