Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer retries fix #329

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

neoformit
Copy link

Potentially resolves #298.
Tested manually on one of our dev pulsar nodes by @cat-bro.

@neoformit neoformit force-pushed the transfer-retries-fix branch from 492229d to f549bf9 Compare July 16, 2023 03:19
@neoformit
Copy link
Author

Rebased against master

@natefoo
Copy link
Member

natefoo commented Aug 7, 2023

Looks good to me, thanks! I suspect this would not result in a "green-but-actually-failed" job since Galaxy will still expect the output to exist in its own finish method, did you happen to include that in testing?

@neoformit
Copy link
Author

Haven't tested that Nate, @cat-bro is the only one who got this running on an actual Pulsar server.

Unfortunately, I can't figure out why the tests are failing. Apparently this change causes the output file transfer test to fail, which I guess must mean that the source file path doesn't exist in the test case, which doesn't make any sense to me... I spent some time looking at the tests and couldn't figure it out or debug it.

@mvdbeek I don't suppose you'd be able to take a look at this? I assume the explanation lies in https://github.com/galaxyproject/pulsar/blob/master/pulsar/client/test/check.py. Maybe there is a test that checks whether a transfer will retry when the source file is missing on the first attempt...?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

I haven't forgotten about this, but I didn't think this was quite the right layer to fix this at. You'd probably want to parameterize this on the job state.

@neoformit
Copy link
Author

Sorry @mvdbeek I'm not sure what you mean. Only skip retrying if the job has failed and the output file doesn't exist? Isn't job state decided by Galaxy and not Pulsar anyway? 🤔

@mvdbeek
Copy link
Member

mvdbeek commented Aug 9, 2023

Isn't job state decided by Galaxy and not Pulsar anyway?

Pulsar knows the exit code, but you're right, we don't really know when that's a failure or not. I'll try getting back to it.

@neoformit
Copy link
Author

Thanks! I'm interested to see what you find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip postprocessing POST retries when the file does not exist on the pulsar machine
3 participants