-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227
Comments
Note The following contributors may be suitable for this task: whilefoo
gitcoindev
gentlementlegen
|
Thanks for sharing your research. It seems that a final solution isn't clear. I suppose that research can continue under this conversation thread. @whilefoo please look into this. |
Basically, it seems that once the token arrived to its limit, the calls all hang due to the use of As a temporary solution, I could suggest filtering some events like the ones related to |
This is easily reproducible on a local setup as well, and gives more information about failures. Even the |
It seems the primary rate limit is 5000 per hour per org/repo, but we are hitting secondary rate limit which happens when too many request happen at once - 100 at once or too many per minute. One solution would be to space out events, for example if we receive 100 events at once we need to process them one by one and not all at once, having priority in mind too as events that need instant response like commands need to be processed immediately while others like text embeddings can be processed later.
But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it? Our octokit uses plugin-retry plugin which should retry requests after the rate limit is over but I think that @gentlementlegen Didn't we try running the kernel on Azure, or was that reverted? |
Yes it happens because whenever we receive an event we do the following:
In average one event involves ~10 calls to GitHub API (and subsequently we use the same token in all the plugins which can do tons of calls as well). Plus, this is multiplied by each external organization using our bot as well. The more plugins we have the more calls will be done, and this happens for literally any event which is why I was suggesting filtering out We could have some event queue indeed, but we can't delay the calls for too long without having the worker shutting down. And as you said, we needs commands to still be responsive. I think we should find a way to avoid fetching the manifest for all the plugins for each run, which would help lowering API calls.
yes it seems to be but the kernel should then work fine after waiting for some time (when the limit resets) but it seems to work right away when restarting the instance of the kernel, somehow. We have an Azure instance up and running, only configured for |
This secondary rate limit seems like a mess to deal with. So is the plan now to make an events queue and handle them with a static delay timer between each so we can avoid the rate limit? |
Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered. My suggestion as an immediate fix would be to filter events we do not use like |
I suggested multiple priority queues but thinking about this more I realized that it's not feasible because you don't know the priority if you don't know the config, for example a
We should definitely do that. I'm not sure about the credentials, they need to use installation's token but they can't get it by themselves. The most obvious fix is to not store the config in Github but somewhere else like a database. You could build queues on top of this by fetching the config from the database (each plugin would have a priority level) and you would put it in the queue for that priority level. Cloudflare Queues also have retries so if the rate limit is hit, you can schedule to retry after X time and they also have 30 min time limit compared to 30 seconds on normal workers. However I think idea won't be liked because it moves away from Github and creates a dependency on Cloudflare. |
Okay then I will start filtering events, and link the changes here. For the credentials, we share The queue seems to introduce a lot of complexity and fragile logic, I think we'd be better avoiding it for now. |
You're saying that each organization that uses our bot should create their own Github App and share credentials to the plugin via environment variables? |
Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though? And yes it would count against our own token if all the requests they do in their plugin use our token. |
That was a concern from the start but it is not that critical because they can't access other organizations with that token, only the one that installed the plugin and that organization has to trust the plugin otherwise they wouldn't install it. I understand now that our Github App would be only used to fetch configs and manifests and dispatch workflows and organization's App would be used for the plugin which would alleviate the problem with rate limits, however I feel like this would too much friction |
In my mind, the following would happen:
I think this would be beneficial for two main reason:
|
Possible but we haven't had to elevate permissions in a very long time. I think we have it mostly covered. Worst case scenario: if they aren't doing anything payment related they can simply make a GitHub action. The only secret sauce we should be focusing on is providing the infrastructure to essentially map any webhook event to a financial reward and to allow the distribution of that reward.
This is only true if we 1. Accept their changes in a pull to the kernel and/or 2. Install that plugin to our repos. |
New updates regarding the quota:
|
Setting in the environment seems appropriate! Perhaps we can set an array of values and any org/repo slug can be ignored
Come to think of it though, we may even be able to depreciate the issues being opened in that repository because now we simply aggregate them into a json object, although it is kind of nice to see the confirmation when the link back occurs that it is in the directory. |
But the consumer of a third party plugin would have to install their app so you will have so many apps installed, but I think we're still a long way from third party plugins so we can think about this later |
I don't like the idea of installing custom apps for plugins. It's not a good approach |
It seems that lowering the amount of calls didn't really solve the problem, 5he kernel still gets stuck often (today particularly because Github servers seems to be partially down). The rate limit in the logs has always 5k+ calls remaining (first rate limit). When stuck, usually it stays at "trying to fetch configuration from" or "dispatching event for" and then nothing, meaning the octokit call never made it. However no error or logs is shown afterwards. The next changes I will try:
|
I tried what I mentioned above and the following:
The problem is the same, I can see a
Every time this happens, when I redeploy the kernel, it works again for around 1h and then this happens. I don't think this is due to second rate limit either because the amount of requests per second averages to |
Would you say it's safe to blame cloudflare then? Have you considered A/B testing the kernel on another platform like azure or something? |
@0x4007 I should definitely try with another service yes. I don't think cloudflare is to blame because requests using |
Is it possible that the 10ms limit is reached and Cloudflare shuts down the worker? But it's weird to only happen after 1 hour |
What happened
When under heavy load, the kernel sometimes stop forwarding events to plugins. We sometimes notice that when users try to invoke commands but nothing happens afterwards. This gets solved by redeploying, or having an event that would unstuck the kernel.
After lots of tests, it seems to get stuck around these lines:
https://github.com/ubiquity-os/ubiquity-os-kernel/blob/development/src/github/utils/config.ts#L63-L66
What was expected
The kernel should be able to handle heavy traffic, either delaying requests or cancelling them, and should not get stuck perpetually.
How to reproduce
The best way I found to reproduce the issue is to simultaneously post lots of comments at the same time. Here is a script achieving so:
Let this run for a while before you notice no more response from the bot.
Further findings: it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like
/help
will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.If we remove the
waitUntil
function, we get the following error thrown by the Worker run:the script will never generate a response
, which gets silenced withinwaitUntil
when used. This will happen anytime theOctokit
instance will be used, due to the limit being reached and no network call being able to be sent, resulting in a403
error (thrown from GitHub API side)..The text was updated successfully, but these errors were encountered: