Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Releaser] Implement a retry mechanism to survive random failures caused by github #48

Open
kikmon opened this issue May 31, 2022 · 8 comments
Assignees
Labels
Releaser Action 'releaser'

Comments

@kikmon
Copy link

kikmon commented May 31, 2022

I'm getting some random failures when publishing a package, and a retry fixes the issue
This is causing a lot of noise in the pipeline, so is it possible to add some retry policies to the releaser action?

The action call is very simple, pushing 3 small (2.5Mb) zip files

Here's the kind of error I'm getting
Post "https://uploads.github.com/repos/kikmon/huc/releases/67601582/assets?label=&name=huc-2022-05-31-Darwin.zip": http2: client connection force closed via ClientConn.Close
Traceback (most recent call last):
File "/releaser.py", line 187, in
check_call(cmd, env=env)
File "/usr/local/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)

@Paebbels
Copy link
Member

@kikmon the link doesn't work. Can you post your pipeline/job-log link?

The infrastructure of GitHub is not very stable. From time to time we see a lot of issues with their network.

I'll check with @umarcor how to solve this problem.

@Paebbels Paebbels added the Releaser Action 'releaser' label May 31, 2022
@Paebbels
Copy link
Member

Oh one question: Is it about releasing to GitHub Release Pages or releasing to PyPI? Just to be specific.

@kikmon
Copy link
Author

kikmon commented May 31, 2022

Sorry about the missing long link. Here it is :)
https://github.com/kikmon/huc/runs/6663199807?check_suite_focus=true

It is just a simple github release, no Pypi involved here.

Doing retries at the yaml level doesn't seem to be natively supported, so it would be really appreciated if the releaser action could be more resilient to infra glitches :)

@kikmon
Copy link
Author

kikmon commented Jun 1, 2022

and I just got a another error (Post "https://uploads.github.com/repos/kikmon/huc/releases/68184973/assets?label=&name=huc-2022-06-01-Darwin.zip": read tcp 172.17.0.2:46068->140.82.113.14:443: read: connection reset by peer
Traceback (most recent call last):)
Seems weird to have these errors happen so frequently.
Could it be related to adding multiple zip files to the release ?

@umarcor
Copy link
Member

umarcor commented Jun 28, 2022

@kikmon, @epsilon-0, this is an annoying issue that has been bugging us since this Action was created. At first, we used the GitHub API through PyGithub. It failed very frequently. Then, we changed to using the GitHub CLI (459faf8). That reduced the frequency of failures, but they are still common. I believe it's because of stability/reliability of the free infrastructure provided by GitHub. I find that small files rarely fail, but larger ones which need to keep the connection alive are tricky.

A few months ago, GitHub added the feature to restart individual jobs in the CI runs. Hence, the strategy I've been following is to have all the "assets" uploaded as artifacts and then have a last job in the workflow which just picks them and uploads them to the release through the releaser. When a failure occurs, that job needs to be restarted only.

Nonetheless, I of course want to improve the reliability of the releaser Action. I think that retry won't always work. Precisely because the feature I explained in the previous paragraph, I do manually restart the CI in https://github.com/ghdl/ghdl. Sometimes it works, but rather frequently it is not effective. The infrastucture is not reliable for some minutes/hours and I need to wait until later or the next day to restart. As a result, when implementing a retry strategy, we should consider that retrying multiple times in a few minutes might be worthless. Instead, large wait times should be implemented. That can be complex, because workflows might be running on the 6h limit, so there might not be time to wait until the API is stable again. We can either:

  • Allow users to provide the sequence of wait times through an option/input,
  • and/or, encourage an strategy based on using a sibling job for the releaser, which can be triggered.
    • Yet, I'm not sure about the default token being able to trigger other workflows. When I last used/implemented it 1-2y ago, a PAT was required.

@kikmon
Copy link
Author

kikmon commented Jul 7, 2022

Thanks for the explanations.
I think retying a few times would be better than no retry at all, without going up to the 6 hours limit :)
I'd be curious to see if it really helps.
Manually babysitting a flow is a bit annoying when trying to automate a pipeline. :)
There are many retry strategies that can be used, but what about exposing a few simple options like the number of retries, or the max amount of time to wait before failing for real. ?
As for my case, the releaser part of the pipeline is doing that already.
Only fetching the artifacts from the previous jobs and then calling releaser.
I've been exploring the wretry action, but it doesn't play nicely with the Releaser action syntax

@Samrose-Ahmed
Copy link

Seeing Post "https://uploads.github.com/repos/matanolabs/matano/releases/75878670/assets?label=&name=matano-macos-x64.sh": unexpected EOF in the action.

Example of failed build.

@Paebbels
Copy link
Member

I'm open to accept pull requests.

Please also see #82.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Releaser Action 'releaser'
Projects
None yet
Development

No branches or pull requests

4 participants