Kubernetes Coexecution jobs are only removed under some circumstances #286

natefoo · 2021-10-08T20:57:04Z

Such as:

When the user stops/deletes the job output(s)
When the job ends in error (because the Pulsar runner's fail_job() calls stop_job()

Circumstances where the job is not removed:

When the job finishes normally
When the job hits its (k8s) walltime and is killed by k8s

This last one is a source of job "loss" (stuck non-terminal) because Pulsar will never send a terminal status update. The runner should probably poll (as in galaxyproject/galaxy#9911) for this case.

The quickest and easiest (and IMO correct) solution would be to set the TTL in the template as described in the docs. But it would also be a good idea to call MessageCoexecutionPodJobClient.kill() for all jobs when their terminal message is received.

The text was updated successfully, but these errors were encountered:

natefoo mentioned this issue Oct 8, 2021

Support the k8s_job_ttl_secs_after_finished option as in the Galaxy Kubernetes runner #287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Coexecution jobs are only removed under some circumstances #286

Kubernetes Coexecution jobs are only removed under some circumstances #286

natefoo commented Oct 8, 2021

Kubernetes Coexecution jobs are only removed under some circumstances #286

Kubernetes Coexecution jobs are only removed under some circumstances #286

Comments

natefoo commented Oct 8, 2021