Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel gets stuck and does not compute any comment if too many simultaneous requests are sent #227

Open
gentlementlegen opened this issue Dec 30, 2024 · 25 comments

Comments

@gentlementlegen
Copy link
Member

gentlementlegen commented Dec 30, 2024

What happened

When under heavy load, the kernel sometimes stop forwarding events to plugins. We sometimes notice that when users try to invoke commands but nothing happens afterwards. This gets solved by redeploying, or having an event that would unstuck the kernel.

After lots of tests, it seems to get stuck around these lines:
https://github.com/ubiquity-os/ubiquity-os-kernel/blob/development/src/github/utils/config.ts#L63-L66

What was expected

The kernel should be able to handle heavy traffic, either delaying requests or cancelling them, and should not get stuck perpetually.

How to reproduce

The best way I found to reproduce the issue is to simultaneously post lots of comments at the same time. Here is a script achieving so:

import { Octokit } from "@octokit/rest";

const octokit = new Octokit({
  auth: "your gh token",
});

async function postCommentBatch({ owner, repo, issueNumber, batchSize = 10 }) {
  const promises = Array(batchSize)
    .fill()
    .map(() =>
      octokit.rest.issues.createComment({
        owner,
        repo,
        issue_number: issueNumber,
        body: "/help",
      })
    );

  try {
    await Promise.all(promises);
    console.log(`Successfully posted batch of ${batchSize} comments`);
    return true;
  } catch (error) {
    console.error("Error posting batch:", error.message);
    return false;
  }
}

async function postComments({
  owner,
  repo,
  issueNumber,
  timespan = 2000,
  delay = 100,
  batchSize = 10,
}) {
  const startTime = Date.now();
  const endTime = startTime + timespan;

  console.log(`Starting batch comment posting for ${timespan}ms with ${delay}ms delay between batches`);

  while (Date.now() < endTime) {
    try {
      const success = await postCommentBatch({
        owner,
        repo,
        issueNumber,
        batchSize,
      });

      if (success) {
        console.log(`Batch posted at ${new Date().toISOString()}`);
      }

      await new Promise((resolve) => setTimeout(resolve, delay));
    } catch (error) {
      if (error.status === 429) {
        console.error("Rate limit exceeded. Waiting before retrying...");
        await new Promise((resolve) => setTimeout(resolve, 1000));
      } else {
        console.error("Error in batch posting cycle:", error.message);
      }
    }
  }

  console.log("Finished posting all comment batches");
}

postComments({
  owner: "owner",
  repo: "repo",
  issueNumber: 3,
  timespan: 2000,
  delay: 100,
  batchSize: 10,
}).catch(console.error);

Let this run for a while before you notice no more response from the bot.


Further findings: it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like /help will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.

If we remove the waitUntil function, we get the following error thrown by the Worker run: the script will never generate a response, which gets silenced within waitUntil when used. This will happen anytime the Octokit instance will be used, due to the limit being reached and no network call being able to be sent, resulting in a 403 error (thrown from GitHub API side)..

Copy link

ubiquity-os-beta bot commented Dec 30, 2024

Note

The following contributors may be suitable for this task:

whilefoo

78% Match ubiquity/ubiquibot#479

gitcoindev

78% Match ubiquity-os/permit-generation#12

gentlementlegen

77% Match ubiquity-os/ubiquity-os-kernel#120

@0x4007
Copy link
Member

0x4007 commented Dec 30, 2024

Thanks for sharing your research. It seems that a final solution isn't clear. I suppose that research can continue under this conversation thread. @whilefoo please look into this.

@gentlementlegen
Copy link
Member Author

gentlementlegen commented Dec 30, 2024

Basically, it seems that once the token arrived to its limit, the calls all hang due to the use of waitUntil (which otherwise would just throw the error The script will never generate a response) because GitHub API returns a 403 (exceeded secondary rate limit). However even knowing the cause, I don't see how we can work around these limits? Eventually we could look into lowering the amount of calls we make, but it doesn't seem scalable.

As a temporary solution, I could suggest filtering some events like the ones related to workflow because they run a lot (one when workflow starts, one when it ends) which would significantly lower API usage. Still, knowing that we are the only organization using the bot (plus your org and some other user), it means that eventually this problem is bound to come back again if we get more users.

@gentlementlegen
Copy link
Member Author

This is easily reproducible on a local setup as well, and gives more information about failures. Even the getToken fails with a 500 error once the token exhausted the available quota, which means all subsequent requests will fail as well. Interesting enough, restarting the app solves the problem (same as within a deployed worker).

@whilefoo
Copy link
Contributor

It seems the primary rate limit is 5000 per hour per org/repo, but we are hitting secondary rate limit which happens when too many request happen at once - 100 at once or too many per minute.
We will hit this limits even faster using Github-based storage.

One solution would be to space out events, for example if we receive 100 events at once we need to process them one by one and not all at once, having priority in mind too as events that need instant response like commands need to be processed immediately while others like text embeddings can be processed later.

it seems that it stops working once the limit of requests for the GitHub token has been reached. That's why commands like /help will not work although the comment is received, because the kernel cannot post the comment back to the issue. Likewise, any plugin that would need an Action dispatched won't run. However, plugins that run as Workers through fetch will work fine.

But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it?

Our octokit uses plugin-retry plugin which should retry requests after the rate limit is over but I think that waitUntil still has 30 seconds hard limit so it would time out before it could retry.

@gentlementlegen Didn't we try running the kernel on Azure, or was that reverted?

@gentlementlegen
Copy link
Member Author

gentlementlegen commented Jan 1, 2025

Yes it happens because whenever we receive an event we do the following:

  1. fetch the configuration from the repository
  2. fetch the configuration from the organization
  3. for each plugin, fetch the manifest
  4. for each plugin that needs to be triggered, dispatch a workflow event

In average one event involves ~10 calls to GitHub API (and subsequently we use the same token in all the plugins which can do tons of calls as well). Plus, this is multiplied by each external organization using our bot as well.

The more plugins we have the more calls will be done, and this happens for literally any event which is why I was suggesting filtering out workflow related events, because we don't use them and because they get triggered at start of the workflow, end of the workflow, and we have around 4 workflows per push triggered on each of our repos. But this is obviously a short term solution.

We could have some event queue indeed, but we can't delay the calls for too long without having the worker shutting down. And as you said, we needs commands to still be responsive. I think we should find a way to avoid fetching the manifest for all the plugins for each run, which would help lowering API calls.

But Worker plugins would still fail to execute as they use the same installation token passed from the kernel, isn't it?

yes it seems to be but the kernel should then work fine after waiting for some time (when the limit resets) but it seems to work right away when restarting the instance of the kernel, somehow.

We have an Azure instance up and running, only configured for ubiquity-os-beta and we do not use it as the moment, but it is available. Do you suggest this for longer runs?

@0x4007
Copy link
Member

0x4007 commented Jan 1, 2025

This secondary rate limit seems like a mess to deal with. So is the plan now to make an events queue and handle them with a static delay timer between each so we can avoid the rate limit?

@gentlementlegen
Copy link
Member Author

Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered.

My suggestion as an immediate fix would be to filter events we do not use like workflow, and that plugins use their own credentials probably, I do not know if that counts towards the quota but most likely.

@whilefoo
Copy link
Contributor

whilefoo commented Jan 3, 2025

Like @whilefoo said the problem is when commands that need immediate response will need to be handled because if you have 100+ events in the queue they would take too long to be triggered.

I suggested multiple priority queues but thinking about this more I realized that it's not feasible because you don't know the priority if you don't know the config, for example a issue_comment.created needs to be processed fast because it can be a command but it can also be just a normal comment that can be processed in the background by text-embeddings or another plugin that needs it ASAP.

My suggestion as an immediate fix would be to filter events we do not use like workflow, and that plugins use their own credentials probably, I do not know if that counts towards the quota but most likely.

We should definitely do that. I'm not sure about the credentials, they need to use installation's token but they can't get it by themselves.

The most obvious fix is to not store the config in Github but somewhere else like a database. You could build queues on top of this by fetching the config from the database (each plugin would have a priority level) and you would put it in the queue for that priority level. Cloudflare Queues also have retries so if the rate limit is hit, you can schedule to retry after X time and they also have 30 min time limit compared to 30 seconds on normal workers. However I think idea won't be liked because it moves away from Github and creates a dependency on Cloudflare.

@gentlementlegen
Copy link
Member Author

Okay then I will start filtering events, and link the changes here.

For the credentials, we share APP_ID and APP_PRIVATE_KEY in the organization, so the plugins should be able to authenticate themselves. We could change the logic in the SDK for authentication as well, so no changes in the plugins would be needed. I also mentioned that because my concern was that outside organizations receive our token at this time isn't it? I saw in the logs that there are ~3 different orgs that have ubiquity-os installed.
If outside organizations use our own plugins, then the auth would be our bot (using APP_ID and APP_PRIVATE_KEY from our organization). But if they create their own, it makes sense to me that they use their own credentials, otherwise it would count against our own quota and they would have too many permissions.

The queue seems to introduce a lot of complexity and fragile logic, I think we'd be better avoiding it for now.

This was referenced Jan 5, 2025
@gentlementlegen
Copy link
Member Author

I realized that in my organization, I never had these events triggered, and then understood that you can disable these directly within the GitHub App settings. I disabled them for our two bots:
image

@whilefoo
Copy link
Contributor

whilefoo commented Jan 5, 2025

For the credentials, we share APP_ID and APP_PRIVATE_KEY in the organization, so the plugins should be able to authenticate themselves. We could change the logic in the SDK for authentication as well, so no changes in the plugins would be needed. I also mentioned that because my concern was that outside organizations receive our token at this time isn't it? I saw in the logs that there are ~3 different orgs that have ubiquity-os installed. If outside organizations use our own plugins, then the auth would be our bot (using APP_ID and APP_PRIVATE_KEY from our organization). But if they create their own, it makes sense to me that they use their own credentials, otherwise it would count against our own quota and they would have too many permissions.

You're saying that each organization that uses our bot should create their own Github App and share credentials to the plugin via environment variables?
I think that adds a lot of friction and setup on the part of the organization and I'm not sure if this fixes the rate limit. The rate limit is based on the installation token / org so there's no difference in using our bot or their own bot. Installation token only has permissions on that organization so a plugin would have same level of permissions if using our bot or organization's own bot

@gentlementlegen
Copy link
Member Author

Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though?

And yes it would count against our own token if all the requests they do in their plugin use our token.

@whilefoo
Copy link
Contributor

whilefoo commented Jan 5, 2025

Yes that would be my suggestion, but I do agree that it would add friction. Doesn't it feel dangerous that a third party can create its own plugin with our token elevations though?

That was a concern from the start but it is not that critical because they can't access other organizations with that token, only the one that installed the plugin and that organization has to trust the plugin otherwise they wouldn't install it.
The most damage they could do is force push to the repo but that's possible in either cases (our github app or organization's own github app).
There was an idea that third party plugins would have to call the kernel to make actions on Github so our kernel would restrict which operations are allowed.

I understand now that our Github App would be only used to fetch configs and manifests and dispatch workflows and organization's App would be used for the plugin which would alleviate the problem with rate limits, however I feel like this would too much friction

@gentlementlegen
Copy link
Member Author

In my mind, the following would happen:

  • if an external organization just wishes to use our product, they only have to install the bot and link our plugins in their configuration, no extra step would be needed
  • if they wish to develop their custom plugin, they would have to provide authentication methods and use their own github app credentials

I think this would be beneficial for two main reason:

  • they can give different access levels to that plugin, imagine they create something that would take care of billing manager access for example, our plugin does not have read or write access on this so they would need their own token anyway
  • at this moment, they can post any comment in behalf of ubiquity-os bot, which could allow them to post fraudulent reward links for example, even within our repo I suppose since the token would have access

@0x4007
Copy link
Member

0x4007 commented Jan 6, 2025

  • they can give different access levels to that plugin, imagine they create something that would take care of billing manager access for example, our plugin does not have read or write access on this so they would need their own token anyway

Possible but we haven't had to elevate permissions in a very long time. I think we have it mostly covered. Worst case scenario: if they aren't doing anything payment related they can simply make a GitHub action. The only secret sauce we should be focusing on is providing the infrastructure to essentially map any webhook event to a financial reward and to allow the distribution of that reward.

  • at this moment, they can post any comment in behalf of ubiquity-os bot, which could allow them to post fraudulent reward links for example, even within our repo I suppose since the token would have access

This is only true if we 1. Accept their changes in a pull to the kernel and/or 2. Install that plugin to our repos.

@gentlementlegen
Copy link
Member Author

New updates regarding the quota:

  • @0x4007 noticed that it runs when devpool-directory is updated, which happens quite often and is not needed
  • we could filter out this repo, but it might required to be hard-coded or set within the worker environment directly

@0x4007
Copy link
Member

0x4007 commented Jan 6, 2025

Setting in the environment seems appropriate! Perhaps we can set an array of values and any org/repo slug can be ignored

["ubiquity/devpool-directory"]

Come to think of it though, we may even be able to depreciate the issues being opened in that repository because now we simply aggregate them into a json object, although it is kind of nice to see the confirmation when the link back occurs that it is in the directory.

@whilefoo
Copy link
Contributor

whilefoo commented Jan 6, 2025

  • if they wish to develop their custom plugin, they would have to provide authentication methods and use their own github app credentials

But the consumer of a third party plugin would have to install their app so you will have so many apps installed, but I think we're still a long way from third party plugins so we can think about this later

@0x4007
Copy link
Member

0x4007 commented Jan 6, 2025

I don't like the idea of installing custom apps for plugins. It's not a good approach

@gentlementlegen
Copy link
Member Author

It seems that lowering the amount of calls didn't really solve the problem, 5he kernel still gets stuck often (today particularly because Github servers seems to be partially down). The rate limit in the logs has always 5k+ calls remaining (first rate limit). When stuck, usually it stays at "trying to fetch configuration from" or "dispatching event for" and then nothing, meaning the octokit call never made it. However no error or logs is shown afterwards.

The next changes I will try:

  • add a wrapper with explicit timeout to avoid getting stuck forever inside waitUntil
  • add the logger plug-in to Octokit so we can have logs about the ongoing requests
  • explicitly bind the fetch from Cloudflare to Octokit instance

@gentlementlegen
Copy link
Member Author

I tried what I mentioned above and the following:

  • disabling retry plugin
  • removing the custom error logs we had in octokit plugins

The problem is the same, I can see a GET request is being sent but nothing happens afterwards.

request {
  method: 'GET',
  baseUrl: 'https://api.github.com',
  headers: {
    accept: 'application/vnd.github.v3+json',
    'user-agent': 'octokit-core.js/6.1.2 Cloudflare-Workers'
  },
  mediaType: { format: 'raw', previews: [] },
  request: {
    fetch: [Function: bound fetch],
    hook: [Function: bound bound register]
  },
  url: '/repos/{owner}/{repo}/contents/{path}',
  owner: 'ubiquity',
  repo: '.github-private',
  path: '.github/.ubiquity-os.config.yml'
}

Every time this happens, when I redeploy the kernel, it works again for around 1h and then this happens. I don't think this is due to second rate limit either because the amount of requests per second averages to 0.06 requests which is very low.

@0x4007
Copy link
Member

0x4007 commented Jan 11, 2025

Would you say it's safe to blame cloudflare then? Have you considered A/B testing the kernel on another platform like azure or something?

@gentlementlegen
Copy link
Member Author

@0x4007 I should definitely try with another service yes. I don't think cloudflare is to blame because requests using fetch are working normally and events seem to be received just fine, but maybe a mix that they use their own implementation of fetch could be the cause.

@whilefoo
Copy link
Contributor

Is it possible that the 10ms limit is reached and Cloudflare shuts down the worker? But it's weird to only happen after 1 hour

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants