-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip postprocessing POST retries when the file does not exist on the pulsar machine #298
Comments
Looking into this |
This was useful for me when I had filesystem problems on the Pulsar side where the filesystem did eventually come back, but I agree that it is far more nuisance than help in the overwhelming majority of cases. I'd typically just prefer to fail and rerun the job for the rare occurrence of filesystem problems rather than have this happen for legitimate job failures. |
I've been working on a fix for this (for a while, in the background) but it creates a nasty UX issue for many of our users. Any job run on Pulsar which fails, the user needs to wait an hour to get the fail message back. With AlphaFold, this also means an hour of Azure GPU time wasted! Perhaps we can add an additional check for a failed status before aborting the retry, either way I would plan on making this configurable. |
The issue is (partly) that in most cases, Pulsar is not really the arbiter of what is failed. It simply dutifully copies things back to Galaxy and then lets Galaxy decide. That said, failing to copy back outputs (after that long delay) is one of the things that does result in Pulsar informing Galaxy that the job failed. As @cat-bro said, I think we're best off just not retrying when the file does not exist, or at least having a separate configurable - you could have NFS attribute caching issues that would cause you to want to retry a few times, but not extensively like you might for if posting it to Galaxy fails. |
Yep, we're in agreement on that. I was suggesting that we can also check for a job-failed status before aborting a missing-file retry loop? That still allows for NFS issues (etc.) to be resolved on a successful job. Do you think there's a way to do that consistently? Or is there no good way for Pulsar to determine that based on the job working directory? |
Setting
max_retries
to retry posts of output files to galaxy is extremely useful, since galaxy is sometimes restarting or too busy to receive the post. The retries also occur when the file does not exist on pulsar and this is not useful, because if the file does not exist upon the completion of a job it will not exist X retries later. Most often the output files are missing because a job has failed. Depending on the settings and the number of expected outputs, a user might have to wait over an hour to find out that their job has failed. Nonexistence of expected output files could be handled by a separate check, prior to the retry loop, and retries skipped in this case.The text was updated successfully, but these errors were encountered: