Average queue wait time is now over 10 hours! #405

jchodera · 2016-04-29T15:07:47Z

This is getting to be pretty long:

[chodera@mskcc-ln1 ~/scripts]$ showstats

moab active for   14:00:12:17  stats initialized on Tue Mar  8 12:16:45 2016

Eligible/Idle Jobs:              2102/2102   (100.000%)
Active Jobs:                      673
Successful/Completed Jobs:     235236/235236 (100.000%)
Avg/Max QTime (Hours):          10.82/351.68
Avg/Max XFactor:                 0.13/704.36

Dedicated/Total ProcHours:      1.34M/3.97M  (33.660%)

Current Active/Total Procs:      1706/3344   (51.017%)

Avg WallClock Accuracy:          10.335%
Avg Job Proc Efficiency:         66.654%
Est/Avg Backlog:                19:37:00/1:11:55:15

The text was updated successfully, but these errors were encountered:

tatarsky · 2016-04-29T15:20:55Z

Noted. I am trying to prepare for a queue modification to allow the Fuchs group purchased nodes to act as batch and gpu nodes when idle. There is a detail involving the unlimited wall time of the batch queue however that I was wanting to propose a change for but is being reviewed.

One item I could start if you had a moment is if I offline one (as its idle) can you validate your code works properly on one of them via manual SSH?

jchodera · 2016-04-29T15:21:55Z

I could do that now!

tatarsky · 2016-04-29T15:25:40Z

OK. Please ssh manually to gg06. It has two GTX Titans.

jchodera · 2016-04-29T15:28:31Z

Looks like you have the wrong CUDA version installed as default:

[chodera@mskcc-ln1 ~]$ which nvcc
/usr/local/cuda-7.5//bin/nvcc
[chodera@mskcc-ln1 ~]$ ssh gg06
Last login: Fri Apr 29 11:27:20 2016 from mskcc-ln1.fast
[chodera@gg06 ~]$ which nvcc
nvcc: Command not found.
[chodera@gg06 ~]$ ls -ltr /usr/local/cuda
lrwxrwxrwx 1 root root 19 Mar 10 15:13 /usr/local/cuda -> /usr/local/cuda-7.0
[```

tatarsky · 2016-04-29T15:37:08Z

One second that is correct.

tatarsky · 2016-04-29T15:43:25Z

OK. Fixed some rules. Try again.

tatarsky · 2016-04-29T15:45:49Z

Hmm. I'm actually showing a regression somewhere on this topic of the default /usr/local/cuda symlink. I'm checking into it now.

tatarsky · 2016-04-29T15:49:39Z

OK. I believe that is correct everywhere now. I noted a few were 7-0 and I'm not sure why. I am investigating.

tatarsky · 2016-04-29T15:58:58Z

Found rule issue and believed now fixed correctly. I will double check after next puppet run but please continue to test as desired on gg06. When the review of the concerns about unlimited batch walltime on these nodes by other groups is addressed I will update everyone via a separate Git.

jchodera · 2016-04-29T16:12:37Z

Seems to work now. Thanks!

tatarsky · 2016-04-29T16:54:58Z

OK. I will get an update on the ruling for making these nodes able to handle overflow. Thank you for testing. I will likely announce a general "batch" test on this node as well.

tatarsky · 2016-04-29T20:14:06Z

Please note gg06 back in the queue. I believe you got the data we need to proceed with the process but they have an important deadline.

tatarsky · 2016-06-21T20:51:30Z

So I've been watching this and while we don't still have full agreement on how to share the added nodes, I am paying attention to the average Qtime.

Its currently down to:

Avg/Max QTime (Hours):           6.34/351.68

Work continues on the policies/config to allow the groups that purchased additional nodes. I'm currently making use of standing reservations when a deadline is upon the group in question.

I am however leaving this open until I get a better final statement on some of those sharing policies.

jchodera · 2016-06-21T20:52:37Z

I feel sorry for the poor sap who was waiting 351.68 hours (15 days) for their jobs to start...

tatarsky · 2016-06-22T13:51:50Z

So I wrote a script to try to analyze the job logs for that waittime and I cannot locate said job shown there as max. Longest I see is actually a gpu job of yours back on 5/16 which was in queue for 115 hours. Still not good but I can't find that one.

Which means my script is probably wrong, but I'm trying to analyze for @juanperin the resource shortages in terms of the discussions going on to share the added nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average queue wait time is now over 10 hours! #405

Average queue wait time is now over 10 hours! #405

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Jun 21, 2016

jchodera commented Jun 21, 2016

tatarsky commented Jun 22, 2016

Average queue wait time is now over 10 hours! #405

Average queue wait time is now over 10 hours! #405

Comments

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

jchodera commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Apr 29, 2016

tatarsky commented Jun 21, 2016

jchodera commented Jun 21, 2016

tatarsky commented Jun 22, 2016