[Releaser] Implement a retry mechanism to survive random failures caused by github #48

kikmon · 2022-05-31T13:56:04Z

I'm getting some random failures when publishing a package, and a retry fixes the issue
This is causing a lot of noise in the pipeline, so is it possible to add some retry policies to the releaser action?

The action call is very simple, pushing 3 small (2.5Mb) zip files

Here's the kind of error I'm getting
Post "https://uploads.github.com/repos/kikmon/huc/releases/67601582/assets?label=&name=huc-2022-05-31-Darwin.zip": http2: client connection force closed via ClientConn.Close
Traceback (most recent call last):
File "/releaser.py", line 187, in
check_call(cmd, env=env)
File "/usr/local/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)

Paebbels · 2022-05-31T15:37:02Z

@kikmon the link doesn't work. Can you post your pipeline/job-log link?

The infrastructure of GitHub is not very stable. From time to time we see a lot of issues with their network.

I'll check with @umarcor how to solve this problem.

Paebbels · 2022-05-31T15:38:41Z

Oh one question: Is it about releasing to GitHub Release Pages or releasing to PyPI? Just to be specific.

kikmon · 2022-05-31T20:56:59Z

Sorry about the missing long link. Here it is :)
https://github.com/kikmon/huc/runs/6663199807?check_suite_focus=true

It is just a simple github release, no Pypi involved here.

Doing retries at the yaml level doesn't seem to be natively supported, so it would be really appreciated if the releaser action could be more resilient to infra glitches :)

kikmon · 2022-06-01T02:46:49Z

and I just got a another error (Post "https://uploads.github.com/repos/kikmon/huc/releases/68184973/assets?label=&name=huc-2022-06-01-Darwin.zip": read tcp 172.17.0.2:46068->140.82.113.14:443: read: connection reset by peer
Traceback (most recent call last):)
Seems weird to have these errors happen so frequently.
Could it be related to adding multiple zip files to the release ?

umarcor · 2022-06-28T12:14:44Z

@kikmon, @epsilon-0, this is an annoying issue that has been bugging us since this Action was created. At first, we used the GitHub API through PyGithub. It failed very frequently. Then, we changed to using the GitHub CLI (459faf8). That reduced the frequency of failures, but they are still common. I believe it's because of stability/reliability of the free infrastructure provided by GitHub. I find that small files rarely fail, but larger ones which need to keep the connection alive are tricky.

A few months ago, GitHub added the feature to restart individual jobs in the CI runs. Hence, the strategy I've been following is to have all the "assets" uploaded as artifacts and then have a last job in the workflow which just picks them and uploads them to the release through the releaser. When a failure occurs, that job needs to be restarted only.

Nonetheless, I of course want to improve the reliability of the releaser Action. I think that retry won't always work. Precisely because the feature I explained in the previous paragraph, I do manually restart the CI in https://github.com/ghdl/ghdl. Sometimes it works, but rather frequently it is not effective. The infrastucture is not reliable for some minutes/hours and I need to wait until later or the next day to restart. As a result, when implementing a retry strategy, we should consider that retrying multiple times in a few minutes might be worthless. Instead, large wait times should be implemented. That can be complex, because workflows might be running on the 6h limit, so there might not be time to wait until the API is stable again. We can either:

Allow users to provide the sequence of wait times through an option/input,
and/or, encourage an strategy based on using a sibling job for the releaser, which can be triggered.
- Yet, I'm not sure about the default token being able to trigger other workflows. When I last used/implemented it 1-2y ago, a PAT was required.

kikmon · 2022-07-07T15:08:45Z

Thanks for the explanations.
I think retying a few times would be better than no retry at all, without going up to the 6 hours limit :)
I'd be curious to see if it really helps.
Manually babysitting a flow is a bit annoying when trying to automate a pipeline. :)
There are many retry strategies that can be used, but what about exposing a few simple options like the number of retries, or the max amount of time to wait before failing for real. ?
As for my case, the releaser part of the pipeline is doing that already.
Only fetching the artifacts from the previous jobs and then calling releaser.
I've been exploring the wretry action, but it doesn't play nicely with the Releaser action syntax

Samrose-Ahmed · 2022-09-03T14:33:04Z

Seeing Post "https://uploads.github.com/repos/matanolabs/matano/releases/75878670/assets?label=&name=matano-macos-x64.sh": unexpected EOF in the action.

Example of failed build.

Paebbels · 2024-07-29T21:45:17Z

I'm open to accept pull requests.

Please also see #82.

Paebbels assigned umarcor May 31, 2022

Paebbels added the Releaser Action 'releaser' label May 31, 2022

kikmon mentioned this issue May 31, 2022

[CI] Publish step randomly fails pce-devel/huc#19

Open

Paebbels mentioned this issue Jun 19, 2022

Uploading archives with releaser fails once in a while #49

Closed

Paebbels mentioned this issue Jul 16, 2022

Github server error 502 during Get Release handler #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Releaser] Implement a retry mechanism to survive random failures caused by github #48

[Releaser] Implement a retry mechanism to survive random failures caused by github #48

kikmon commented May 31, 2022

Paebbels commented May 31, 2022

Paebbels commented May 31, 2022

kikmon commented May 31, 2022

kikmon commented Jun 1, 2022

umarcor commented Jun 28, 2022

kikmon commented Jul 7, 2022

Samrose-Ahmed commented Sep 3, 2022

Paebbels commented Jul 29, 2024

[Releaser] Implement a retry mechanism to survive random failures caused by github #48

[Releaser] Implement a retry mechanism to survive random failures caused by github #48

Comments

kikmon commented May 31, 2022

Paebbels commented May 31, 2022

Paebbels commented May 31, 2022

kikmon commented May 31, 2022

kikmon commented Jun 1, 2022

umarcor commented Jun 28, 2022

kikmon commented Jul 7, 2022

Samrose-Ahmed commented Sep 3, 2022

Paebbels commented Jul 29, 2024