Clean the job directory when new jobs are received #284

natefoo · 2021-09-28T17:26:05Z

I sometimes have to requeue jobs in Galaxy that have finished remotely but weren't finished properly in Galaxy. This is a problem if the job directory still exists on the Pulsar side and the job is sent to the same Pulsar as it was previously. Pulsar attempts to resume stage in files but fails:

Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: 2021-09-28 17:22:41,228 INFO  [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=37992802]] Failed to execute action[Staging input 'dataset_61602712.dat' via FileAction[path=/galaxy-repl/main/files/061/602/dataset_61602712.dat,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5c6ea3db5fc7ab0c5&job_key=c0ffee&path=/galaxy-repl/main/files/061/602/dataset_61602712.dat&file_type=input] to /jetstream/scratch0/main/jobs/37992802/inputs/dataset_61602712.dat], retrying in 6.0 seconds.
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: Traceback (most recent call last):
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: return fun(*args, **kwargs)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/staging/pre.py", line 19, in <lambda>
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/action_mapper.py", line 465, in write_to_path
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: get_file(self.url, path)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/transport/curl.py", line 93, in get_file
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: c.perform()
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: pycurl.error: (33, "HTTP server doesn't seem to support byte ranges. Cannot resume.")

This may be a more general problem as well of Pulsar not knowing the file length and attempting to fetch past the file. Which is to say, it should remove existing job directories when a new setup message is received, and it should also not attempt to resume past the file size when staging in (a separate issue).

The text was updated successfully, but these errors were encountered:

gmauro · 2021-09-29T06:19:24Z

I have a cronjob deleting successful/unsuccessful job directories but, I agree a more structured approach would be needed.

natefoo · 2021-09-29T18:11:31Z

Yeah, I have a cron job running tmpwatch for this, which is needed regardless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean the job directory when new jobs are received #284

Clean the job directory when new jobs are received #284

natefoo commented Sep 28, 2021

gmauro commented Sep 29, 2021

natefoo commented Sep 29, 2021

Clean the job directory when new jobs are received #284

Clean the job directory when new jobs are received #284

Comments

natefoo commented Sep 28, 2021

gmauro commented Sep 29, 2021

natefoo commented Sep 29, 2021