Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate hook numbers sending concurrent webhooks #997

Open
jbrockopp opened this issue Aug 26, 2024 · 4 comments
Open

duplicate hook numbers sending concurrent webhooks #997

jbrockopp opened this issue Aug 26, 2024 · 4 comments
Labels
bug Indicates a bug

Comments

@jbrockopp
Copy link
Contributor

Description

There is a known issue that spamming webhooks can produce an error with duplicate build numbers for the server:

duplicate key value violates unique constraint "builds_repo_id_number_key"

To minimize the impact of this issue, we introduced some retry logic in the webhook workflow:

#213 (comment)

NOTE: The retry logic did not remove the behavior completely but it did lessen the occurrences of it.

However, recently we started noticing a slightly different flavor of this error specific to hooks:

{"error":"unable to create webhook OrgName/RepoName/BuildNumber: ERROR: duplicate key value violates unique constraint \"hooks_repo_id_number_key\" (SQLSTATE 23505)"}

We believe this is the same kind of underlying problem where concurrent webhooks are being processed and attempted to be inserted into the database for a repository with the same number. After some exploration, we didn't see any evidence that suggests we have the same retry logic in place for the hooks table like we setup for the builds table:

https://github.com/go-vela/server/blob/main/api/webhook/post.go#L261-L271

We should add additional logic to account for this problem for the hooks table like we did for the builds table.

Value

  • users following the standard webhook workflow no longer face issues with Vela not triggering builds

Useful Information

  1. What is the output of vela --version?

v0.23.4

  1. What operating system is being used?

https://www.flatcar.org/

  1. Any other important details?

The reports we've received thus far have been small so we're still attempting to gather additional data and information. These reports seem to be specific to "~10% of the time we'll notice that when we merge a pull request the push does not get processed by Vela". After some exploration, we've identified that those repositories have automatic branch deletion configured:

https://docs.github.com/en/[email protected]/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-the-automatic-deletion-of-branches

On the GitHub side, we can confirm that this behavior causes two push events being fired at the same exact time:

  • a push to the target branch for the PR
  • a push for the deletion event of the source branch from the PR

Also, these reports are from a more recent time window of "about the last month or so". Unfortunately, we're not able to completely verify exactly when this happened because GitHub doesn't store webhook deliveries that far back.

@jbrockopp jbrockopp added the bug Indicates a bug label Aug 26, 2024
@go-vela go-vela deleted a comment from Lxx-c Aug 26, 2024
@go-vela go-vela deleted a comment Aug 26, 2024
@jbrockopp
Copy link
Contributor Author

@Lxx-c @NaiveShiki we've deleted both of your comments as they seem like bot driven activity rather than real people:

image

However, if we misunderstood the intention of your comments, please feel free to contribute back to the discussion.

@go-vela go-vela locked as spam and limited conversation to collaborators Aug 26, 2024
@go-vela go-vela deleted a comment Aug 26, 2024
@ecrupper
Copy link
Contributor

To add some additional context, I think the fact that this is a recent development points to this change as a potential culprit. Previously, delete pushes were promptly discarded and now we are creating a hook record for them.

I do wonder if this is yet another push (pun intended) to creating a job queue model for handling webhook payloads... I know this is what GitHub recommends:

In order to respond in a timely manner, you may want to set up a queue to process webhook payloads asynchronously. Your server can respond when it receives the webhook, and then process the payload in the background without blocking future webhook deliveries.

@jbrockopp
Copy link
Contributor Author

jbrockopp commented Sep 4, 2024

@ecrupper thanks for that link and extra context!

We received some additional feedback that this behavior may have been occurring for longer than what we thought. From what we've gathered, the repos that had already opted in to the automatic branch deletion feature within GitHub have been experiencing this issue for quite some time ("months"). So the "about a month or so" window we received from this most recent report was the first we had heard of it but other people started chiming in with their own experiences after.

Given all of this, we agree what you shared appears to be the most likely culprit because it was introduced in the v0.23 release of the Vela server and the timelines we've gathered from all of these reports matches around the same time we performed our upgrade internally to that version. We briefly discussed this internally and also agree that this further elevates the need to create a different method for processing webhooks and following GitHub's recommendations sounds like the right path forward for that.

In the meantime, we are planning to add retry logic that follows the same pattern as the build number issue we've faced in the past. We acknowledge that this is a "bandaid fix" but our thought process is it seems like a lower effort code introduction that should workaround the issue and ideally mitigate it from happening completely. This should enable us sufficient time to create a better pattern for processing webhooks in a queue based model like GitHub recommends.

@jbrockopp jbrockopp self-assigned this Sep 4, 2024
@jbrockopp
Copy link
Contributor Author

A PR has been added to mitigate the frequency of this issue:

However, we aren't closing this ticket because it's still possible for it to show up similar to:

#213

@jbrockopp jbrockopp removed their assignment Oct 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Indicates a bug
Projects
None yet
Development

No branches or pull requests

2 participants