Yank processes not cleaning exiting #415

karalets · 2016-05-17T21:07:33Z

Hi,

I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.

As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!

Is there any explanation for this? Or, to be actionable: can I do something to change that?

Best and thanks,

Theo

tatarsky · 2016-05-17T21:11:04Z

Why do you feel the GPUs are not in use?

karalets · 2016-05-17T21:11:33Z

Because I checked for the running gpu jobs and found very few.

tatarsky · 2016-05-17T21:12:45Z

How are you checking? As I do not see that. Removing username.

7205572.hal-sched1.loc  USER       gpu      i_hsa_dansylamid  13454     3     12    --   72:00:00 R  49:58:48   gpu-1-16/15,20,24,34+gpu-1-17/2-3,13,24+gpu-2-7/32-35
7205573.hal-sched1.loc  USER       gpu      i_hsa_dansylglyc  20518     3     12    --   72:00:00 R  49:57:32   gpu-1-7/32-35+gpu-2-6/13,28,31-32+gpu-2-9/11-12,20,27
7205574.hal-sched1.loc  USER       gpu      i_hsa_indomethac  13848     3     12    --   72:00:00 R  49:55:40   gpu-1-13/27,29,32,34+gpu-1-15/15-17,28+gpu-1-11/32-35
7205575.hal-sched1.loc  USER       gpu      i_hsa_lapatinib    5463     3     12    --   72:00:00 R  49:54:25   gpu-1-12/25,33-35+gpu-1-10/32-35+gpu-1-6/28-30,35
7212730.hal-sched1.loc  USER       gpu      i_hsa_naproxen    23815     3     12    --   72:00:00 R  26:58:28   gpu-2-12/15,32-34+gpu-1-4/19,32-34+gpu-3-9/32-35
7212731.hal-sched1.loc  USER       gpu      i_hsa_phenylbuta    993     3     12    --   72:00:00 R  26:09:30   gpu-2-11/32-35+gpu-1-14/32-35+gpu-2-5/19-20,25-26
7212733.hal-sched1.loc  USER       gpu      i_hsa_ponatinib   17700     3     12    --   72:00:00 R  26:08:09   gpu-2-17/23,33-35+gpu-3-8/7,33-35+gpu-1-5/2,32-34
7212769.hal-sched1.loc  USER       gpu      hsa_dansylamide    1332     2      8    --   48:00:00 R  25:12:37   gpu-2-16/24,32-34+gpu-1-8/29,31,33,35

Double checking but that sure seems like a fair chunk.

tatarsky · 2016-05-17T21:14:06Z

The program is called "yank" and I show it on lots of GPUs.

jchodera · 2016-05-17T21:15:57Z

yank is a program from our lab, and does use lots of GPUs. I'm looking into trying to figure out exactly which GPUs are free at the moment.

Oddly enough, there are some GPUs that are running jobs from two different people, which shouldn't be happening:

gpu-2-5
Tue May 17 17:11:59 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   53C    P2    83W / 250W |    521MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   51C    P2    82W / 250W |    522MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 22%   57C    P2    89W / 250W |    522MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:84:00.0     Off |                  N/A |
| 22%   55C    P2    81W / 250W |    521MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      6897    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    0     13653    C   /cbio/jclab/home/luirink/anaconda/bin/python   170MiB |
|    0     31953    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    1      6896    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    1     13652    C   /cbio/jclab/home/luirink/anaconda/bin/python   171MiB |
|    1     31952    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    2      6895    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    2     13651    C   /cbio/jclab/home/luirink/anaconda/bin/python   171MiB |
|    2     31951    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    3      6894    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    3     13650    C   /cbio/jclab/home/luirink/anaconda/bin/python   170MiB |
|    3     31950    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
+-----------------------------------------------------------------------------+

gpu-2-11
Tue May 17 17:12:04 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 36%   54C    P0    70W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 36%   56C    P0    72W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 38%   60C    P0    72W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 42%   68C    P0   157W / 250W |    227MiB /  6143MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1049    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    0     10711    C   /cbio/jclab/home/luirink/anaconda/bin/python   129MiB |
|    1      1048    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    1     10710    C   /cbio/jclab/home/luirink/anaconda/bin/python   129MiB |
|    2      1047    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    2     10709    C   /cbio/jclab/home/luirink/anaconda/bin/python   130MiB |
|    3      1046    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    3     10708    C   /cbio/jclab/home/luirink/anaconda/bin/python   128MiB |
+-----------------------------------------------------------------------------+

karalets · 2016-05-17T21:15:57Z

alright, I misinterpreted that and thought these are single jobs/gpus.

Thank you, Paul. I will probably need to deal with this diplomatically ;)

jchodera · 2016-05-17T21:17:48Z

I show luirink as having no running jobs, so her jobs should not be still running on GPUs:

[chodera@mskcc-ln1 ~/scripts]$ qstat -u luirink
[chodera@mskcc-ln1 ~/scripts]$

It looks like this is related to #409 (comment)

karalets · 2016-05-17T21:18:40Z

John, do you think it would be possible to keep 5-10 gpus useable or unoccupied by yank until friday (NIPS deadline)?

I really will not be needing any more, but just having 3 as is the case now is a bit tight.

tatarsky · 2016-05-17T21:19:37Z

I show processes by luirink on gpu-2-5. If the code is not exiting cleanly it will need to be fixed.

luirink  13650 93.9  0.2 308770540 739620 ?    Rl   May11 8806:31 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13651 94.4  0.2 308770796 743364 ?    Rl   May11 8848:40 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13652 92.8  0.2 308770800 729816 ?    Rl   May11 8704:07 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13653 94.2  0.2 308770536 739504 ?    Rl   May11 8834:58 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit

tatarsky · 2016-05-17T21:22:01Z

But basically @karalets the statement in this Git that "the cluster seems to prefer to let the gpus gather dust over giving them to me" is incorrect and I'm still waiting for how you determined that so we can resolve the actually requested support. I show yank consuming all the GPUs.

tatarsky · 2016-05-17T21:23:17Z

I see your note @karalets. Ignore. If you want my methods of parsing the qstat I can document.

jchodera · 2016-05-17T21:23:38Z

@MehtapIsik: Can you halt one or more of your yank jobs right now so @karalets can use some GPUs?

karalets · 2016-05-17T21:23:51Z

Paul @tatarsky , I must have made a mistake as said earlier, I was just counting the number of jobs, not the number of gpus each one of these jobs was using. So please put my comments down to my ignorance of the cluster-diagnostics.

tatarsky · 2016-05-17T21:24:43Z

Noted your Git response already. Just an ordering. I am happy to share my method of determining.

karalets · 2016-05-17T21:26:02Z

I was doing 'qstat | grep gpu' which obviously was suboptimal.

tatarsky · 2016-05-17T21:26:18Z

Yeah that doesn't work.

tatarsky · 2016-05-17T21:27:32Z

My output was from a cheap quick way. There are others.

qstat -1tnr|grep " gpu "

jchodera · 2016-05-17T21:27:49Z

I'll look into why mpirun is not causing all processes to cleanly exit when terminated.

@tatarsky: Have you seen threads like this that note that this might be due to the way in which mpirun/mpiexec launches processes in way that Torque is unable to keep track of? I wonder if there's a different argument we can pass to mpirun that ensure Torque is notified of processes it should track.

tatarsky · 2016-05-17T21:28:20Z

Can check the old Relion notes which I dimly recall something similar.

jchodera · 2016-05-17T21:29:19Z

The hydra mpirun has the following options that may be relevant here:

Hydra specific options (treated as global):

  Launch options:
    -launcher                        launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)
    -launcher-exec                   executable to use to launch processes

karalets · 2016-05-17T21:29:24Z

@tatarsky thank you, I will use that henceforth.

@jchodera that would be great if we could get some of the yank jobs to run later if that does not hurt too much.

jchodera · 2016-05-17T21:30:07Z

(@karalets: I'm in Boston, but @MehtapIsik is just upstairs in Z11 if you want to go find her in person!)

tatarsky · 2016-05-17T21:30:19Z

There is probably a better way but thats my cheap way. The spaces around "gpu" prevent grepping the name of the host which contains gpu as well.

jchodera · 2016-05-17T21:31:51Z

There's also this:

  Resource management kernel options:
    -rmk                             resource management kernel to use ( user slurm ll lsf sge pbs)

jchodera · 2016-05-17T21:34:54Z

This page suggests:

Hydra integrates with PBS-like DRMs (PBSPro, Torque). The integration means that, for example, you don't have to provide a list of hosts to mpiexec since the list of granted nodes is obtained automatically. It also uses the native tm interface of PBS to launch and monitor remote processes.

I'm still investigating whether this requires we specify a specify flag to mpirun.

tatarsky · 2016-05-17T21:40:00Z

My Relion memory may not be relevant. That issue involved a wrapper script that was not then properly providing mpirun with the nodefile or blocking its ability to get the Torque environment provided one.

At least that is my brief review of the old issue.

The MPI "tm" interface was indeed discussed. But I don't recall a flag for more better exiting. Git #329 has all sorts of stuff in it. Probably of no relevance.

tatarsky · 2016-05-17T21:47:49Z

Do you wish all luirink node processes killed off before I scrub the name from the Git page? Or does it help to leave some to debug why they are not dead?

tatarsky · 2016-05-17T21:56:24Z

I'm changing the Git name so I can focus on that part. I will mention there is a Tesla sitting idle in cc27 @karalets

tatarsky · 2016-05-17T21:59:02Z

Hmm. I may need to adjust that systems gpu queue oversubscription rule though. Let me check on that.

tatarsky · 2016-05-20T22:04:43Z

Also there are a fair number of free gpus at the moment.

tatarsky · 2016-05-20T22:07:48Z

Ah but slots/ram are a bit short. Let me see if I can tweak something there.

tatarsky · 2016-05-20T22:20:00Z

I added the node property gpudebug to a set of titan nodes gpu-2-14-17. It at least might help you narrow down the systems a test job will get scheduled. I LEFT the gtxtitan resource.

I do not know enough about user reservations to on a Friday try to guarantee you get priority on those nodes. Perhaps next week.

tatarsky · 2016-05-20T22:30:52Z

Hmm. And it may not be working as I expect. So if it doesn't seem to do what you need I'll have to revisit it.

tatarsky · 2016-05-20T22:32:00Z

There are however now several free gpus on the cluster but don't know how long it will last.

jchodera · 2016-05-21T00:27:01Z

Thanks! Will debug over the weekend! Traveling back from Boston tonight.

@karalets: This is a serious issue and needs to be addressed ASAP! Thanks for pointing it out!

jchodera · 2016-05-21T14:44:35Z

Starting to debug now. Thanks again, @tatarsky!

@karalets: I notice that your job 7249830 has reserved the four GPUs on gpu-2-14, but is only running processes on one of them:

[chodera@gpu-2-14 ~]$ nvidia-smi
Sat May 21 10:43:01 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   31C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   31C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   31C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   31C    P8    14W / 250W |    205MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     29114    C   ...grlab/home/karaletsos/anaconda/bin/python   189MiB |
+-----------------------------------------------------------------------------+

This may be intended, so no worries if so---but it seems like it may be a sign something is wrong!

jchodera · 2016-05-21T14:52:02Z

Looks like the 'gtxdebug' and gtxtitan resources are all booked up for at least a day, which will make it very difficult to debug processes that remain on the GPUs (since these show running processes).

I've tried to replicate this problem on the GTX-680s, but am unable to. The jobs all terminate cleanly when torque kills the master process.

tatarsky · 2016-05-21T14:56:40Z

Seeing if I can spot some other titans to apply the tag.

jchodera · 2016-05-21T14:57:16Z

@tatarsky: If sometime this week you have an idea about how to reserve a node of GTX-TITANs for us to use to debug (via torque), I can sit down with @MehtapIsik and interactively try to see if there are any issues with this specifically or with her environment that may be causing this problem. In the meantime, I am unable to reproduce, and can't debug further until GTX-TITAN nodes are free.

@karalets: Will try debugging again tomorrow in case some nodes are free.

tatarsky · 2016-05-21T14:59:35Z

BTW the tag is gpudebug. I've added it to three more and removed it from ones with @karalets jobs

tatarsky · 2016-05-21T15:00:50Z

No, I'm wrong. He's got jobs on lots of the gtxtitans. This will require coordination next week.

tatarsky · 2016-05-21T15:03:01Z

Well, also as you note he seems to be requesting four gpus but nvidia-smi shows only 1 gpu in use. In at least a spot check of gpu-2-10.

checkjob -v -v 7249831
job 7249831 (RM job '7249831.hal-sched1.local')

AName: STDIN
State: Running 
Creds:  user:karaletsos  group:grlab  class:gpu  qos:preemptorgpu
WallTime:   18:10:38 of 3:00:00:00
SubmitTime: Fri May 20 16:51:17
  (Time Queued  Total: 00:00:03  Eligible: 00:00:03)

StartTime: Fri May 20 16:51:20
TemplateSets:  DEFAULT
Total Requested Tasks: 1
Total Requested Nodes: 0

Req[0]  TaskCount: 1  Partition: MSKCC
Available Memory >= 0  Available Swap >= 0
Opsys: ---  Arch: ---  Features: gtxtitan
Dedicated Resources Per Task: PROCS: 1  GPUS: 4

And unless I'm reading this wrong.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     10435    C   ...grlab/home/(him)/anaconda/bin/python   192MiB |
+-----------------------------------------------------------------------------+
[root@gpu-2-10 ~]#

jchodera · 2016-05-21T15:04:12Z

@karalets has reserved all four GPUs on those nodes and locked all of the GPUs in thread-exclusive mode, even though he is only using one:

[chodera@gpu-2-12 ~]$ nvidia-smi
Sat May 21 11:02:53 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   27C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   30C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   29C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   29C    P8    14W / 250W |    205MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     13278    C   ...grlab/home/karaletsos/anaconda/bin/python   189MiB |
+-----------------------------------------------------------------------------+

I'm guessing this is unintentional.

@karalets: Might want to check what's going on with your code.

tatarsky · 2016-05-21T15:04:27Z

Yep. Seeing the same.

jchodera · 2016-05-21T15:04:41Z

In the meantime, thanks for the weekend help, @tatarsky, and let's connect up during the week to further debug if needed!

tatarsky · 2016-05-21T15:05:55Z

Fair enough.

jchodera · 2016-05-21T15:24:10Z

I've found a free GTX-TITAN-X node (gpu-2-5) and am testing that now.

karalets · 2016-05-21T17:00:48Z

I have opened up some of them after reading this.

jchodera · 2016-05-21T17:07:03Z

Thanks!

@karalets: Was the use of 1/4 GPUs expected?

karalets · 2016-05-21T17:37:37Z

I sometimes reserve a bunch and use a variable amount when I am trying out new code.
This is interactive, so it mimics an environment where I can debug and try out stuff. The merits of not having a devbox, I guess.

jchodera · 2016-05-21T17:39:29Z

Ah, OK! Thanks for the clarification!

jchodera · 2016-05-21T19:10:39Z

I can reproduce this on gpu-2-5, the GTX-TITAN-X node! That's a start!

jchodera · 2016-05-22T03:41:00Z

I've tried this dozens of times, but I can't seem to consistently reproduce this problem. It happened once, but I don't seem to be able to get it to happen again.

MehtapIsik · 2016-05-22T16:16:44Z

I suspect it mostly happens when queue jobs time out.

jchodera · 2016-05-22T16:19:30Z

I'm currently trying to harden the YANK code with an explicit call to MPI.Abort() on interrupt, following this thread.

@tatarsky: Do you know what signal Torque sends when killing jobs that hit their resource limits? The following dump to stdout/stderr suggests a signal 15 (SIGTERM) is sent, and that I should intercept this and make sure MPI.Abort() is explicitly called before actual termination:

=>> PBS: job killed: walltime 1340 exceeded limit 1320
Terminated

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 15
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

Are there other signals Torque might send that I should worry about too? I think it might send a SIGKILL a little later if the process still hasn't terminated?

tatarsky · 2016-05-22T17:00:25Z

I believe SIGTERM is correct. And it does indeed send SIGKILL after some period which I dimly recall is also tunable but defaults to 60 seconds.

There is also a command called qsig which can send signals to a job IIRC if you wanted to test.

I believe it also logs when it sends the signal with something like this:

20160329:03/29/2016 10:07:56;0008;PBS_Server.2241;Job;7058923.hal-sched1.local;Job sent signal SIGTERM on delete

tatarsky · 2016-05-22T18:51:22Z

"I suspect it mostly happens when queue jobs time out."

I assume this means when the jobs hit a walltime limit and are killed. Is there a major problem in simply stating the walltime higher and letting the jobs complete without such an event?

Or is the walltime limit being used to control the usage of the job.

Its a subtle point, but why not set the walltime to a value that better matches the needs of the job.

jchodera · 2016-05-22T21:11:00Z

The actual jobs may take many days, but the walltime limit is being used to break the jobs into more queue-neighbor-friendly chunks. So it is a significant problem if our code doesn't cleanly exit when requested to do so! Still tinkering with MPI.Abort() calls...

tatarsky changed the title ~~cannot get gpu jobs scheduled~~ Yank processes not cleaning exiting May 17, 2016

jchodera mentioned this issue May 22, 2016

[WIP] Quick band-aid fix for MPI termination choderalab/yank#374

Closed

Yank processes not cleaning exiting #415

Yank processes not cleaning exiting #415

Comments

karalets commented May 17, 2016

tatarsky commented May 17, 2016

karalets commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

jchodera commented May 17, 2016

karalets commented May 17, 2016

jchodera commented May 17, 2016

karalets commented May 17, 2016 • edited Loading

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

jchodera commented May 17, 2016

karalets commented May 17, 2016 • edited Loading

tatarsky commented May 17, 2016

karalets commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

jchodera commented May 17, 2016

tatarsky commented May 17, 2016

jchodera commented May 17, 2016

karalets commented May 17, 2016

jchodera commented May 17, 2016

tatarsky commented May 17, 2016

jchodera commented May 17, 2016

jchodera commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 17, 2016

tatarsky commented May 20, 2016

tatarsky commented May 20, 2016

tatarsky commented May 20, 2016

tatarsky commented May 20, 2016

tatarsky commented May 20, 2016

jchodera commented May 21, 2016

jchodera commented May 21, 2016

jchodera commented May 21, 2016

tatarsky commented May 21, 2016

jchodera commented May 21, 2016

tatarsky commented May 21, 2016

tatarsky commented May 21, 2016

tatarsky commented May 21, 2016

jchodera commented May 21, 2016

tatarsky commented May 21, 2016

jchodera commented May 21, 2016

tatarsky commented May 21, 2016

jchodera commented May 21, 2016

karalets commented May 21, 2016

jchodera commented May 21, 2016

karalets commented May 21, 2016 • edited Loading

jchodera commented May 21, 2016

jchodera commented May 21, 2016

jchodera commented May 22, 2016

MehtapIsik commented May 22, 2016

jchodera commented May 22, 2016

tatarsky commented May 22, 2016

tatarsky commented May 22, 2016

jchodera commented May 22, 2016 via email

karalets commented May 17, 2016 •

edited

Loading

karalets commented May 17, 2016 •

edited

Loading

karalets commented May 21, 2016 •

edited

Loading