Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average queue wait time is now over 10 hours! #405

Open
jchodera opened this issue Apr 29, 2016 · 15 comments
Open

Average queue wait time is now over 10 hours! #405

jchodera opened this issue Apr 29, 2016 · 15 comments

Comments

@jchodera
Copy link
Member

This is getting to be pretty long:

[chodera@mskcc-ln1 ~/scripts]$ showstats

moab active for   14:00:12:17  stats initialized on Tue Mar  8 12:16:45 2016

Eligible/Idle Jobs:              2102/2102   (100.000%)
Active Jobs:                      673
Successful/Completed Jobs:     235236/235236 (100.000%)
Avg/Max QTime (Hours):          10.82/351.68
Avg/Max XFactor:                 0.13/704.36

Dedicated/Total ProcHours:      1.34M/3.97M  (33.660%)

Current Active/Total Procs:      1706/3344   (51.017%)

Avg WallClock Accuracy:          10.335%
Avg Job Proc Efficiency:         66.654%
Est/Avg Backlog:                19:37:00/1:11:55:15 
@tatarsky
Copy link
Contributor

Noted. I am trying to prepare for a queue modification to allow the Fuchs group purchased nodes to act as batch and gpu nodes when idle. There is a detail involving the unlimited wall time of the batch queue however that I was wanting to propose a change for but is being reviewed.

One item I could start if you had a moment is if I offline one (as its idle) can you validate your code works properly on one of them via manual SSH?

@jchodera
Copy link
Member Author

I could do that now!

@tatarsky
Copy link
Contributor

OK. Please ssh manually to gg06. It has two GTX Titans.

@jchodera
Copy link
Member Author

Looks like you have the wrong CUDA version installed as default:

[chodera@mskcc-ln1 ~]$ which nvcc
/usr/local/cuda-7.5//bin/nvcc
[chodera@mskcc-ln1 ~]$ ssh gg06
Last login: Fri Apr 29 11:27:20 2016 from mskcc-ln1.fast
[chodera@gg06 ~]$ which nvcc
nvcc: Command not found.
[chodera@gg06 ~]$ ls -ltr /usr/local/cuda
lrwxrwxrwx 1 root root 19 Mar 10 15:13 /usr/local/cuda -> /usr/local/cuda-7.0
[```

@tatarsky
Copy link
Contributor

One second that is correct.

@tatarsky
Copy link
Contributor

OK. Fixed some rules. Try again.

@tatarsky
Copy link
Contributor

Hmm. I'm actually showing a regression somewhere on this topic of the default /usr/local/cuda symlink. I'm checking into it now.

@tatarsky
Copy link
Contributor

OK. I believe that is correct everywhere now. I noted a few were 7-0 and I'm not sure why. I am investigating.

@tatarsky
Copy link
Contributor

Found rule issue and believed now fixed correctly. I will double check after next puppet run but please continue to test as desired on gg06. When the review of the concerns about unlimited batch walltime on these nodes by other groups is addressed I will update everyone via a separate Git.

@jchodera
Copy link
Member Author

Seems to work now. Thanks!

@tatarsky
Copy link
Contributor

OK. I will get an update on the ruling for making these nodes able to handle overflow. Thank you for testing. I will likely announce a general "batch" test on this node as well.

@tatarsky
Copy link
Contributor

Please note gg06 back in the queue. I believe you got the data we need to proceed with the process but they have an important deadline.

@tatarsky
Copy link
Contributor

So I've been watching this and while we don't still have full agreement on how to share the added nodes, I am paying attention to the average Qtime.

Its currently down to:

Avg/Max QTime (Hours):           6.34/351.68

Work continues on the policies/config to allow the groups that purchased additional nodes. I'm currently making use of standing reservations when a deadline is upon the group in question.

I am however leaving this open until I get a better final statement on some of those sharing policies.

@jchodera
Copy link
Member Author

I feel sorry for the poor sap who was waiting 351.68 hours (15 days) for their jobs to start...

@tatarsky
Copy link
Contributor

So I wrote a script to try to analyze the job logs for that waittime and I cannot locate said job shown there as max. Longest I see is actually a gpu job of yours back on 5/16 which was in queue for 115 hours. Still not good but I can't find that one.

Which means my script is probably wrong, but I'm trying to analyze for @juanperin the resource shortages in terms of the discussions going on to share the added nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants