Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change CNI token refresh retry duration to one minute #9539

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

huutomerkki
Copy link

@huutomerkki huutomerkki commented Nov 28, 2024

Description

This PR changes the dynamic timing to constant once per minute refresh rate for the CNI token, as per the discussion in slack about the possible solution for the linked issue.

I have manually verified that following the reproduction steps with these changes included in the calico-node image fixes the issues described in #9235.

Related issues/PRs

Fixes #9235

Todos

  • [?] Tests
  • [?] Documentation
  • [?] Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@huutomerkki huutomerkki requested a review from a team as a code owner November 28, 2024 11:43
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Nov 28, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Nov 28, 2024
@CLAassistant
Copy link

CLAassistant commented Nov 28, 2024

CLA assistant check
All committers have signed the CLA.

@caseydavenport
Copy link
Member

@huutomerkki thanks for raising this PR!

I want to push a little bit on the aggressive timer here - for some of our larger scale users, this is going to result in a large increase in the number of TokenRequest API calls that Calico makes.

We have users running 5k node clusters (and even larger!), which will result 5k requests per 60 seconds, or 83 requests per second across the entire cluster! I think this is probably just not workable for that scale.

Some thoughts (in no particular order):

  • I think refreshing after a failed request is probably the ideal approach, but we need to be careful that we don't allow the CNI plugin to indefinitely propagate its own token. It would need to be some sort of privileged communication back from the CNI plugin to calico/node to trigger this, which might be tricky.
  • Perhaps just a larger timeout (say, every 1h instead of every 6h) would satisfy my concerns, but I think that wouldn't meet your goals IIRC.
  • Perhaps we just need this to be configurable, so that it can maintain its reasonable default of 6h but allow users with stricter requirements to configure it to be smaller. I don't love the idea of exposing config for this, as it's a fairly low level thing that most users don't need to care about, but sometimes that is required.

I think my preferred option would be to find a way to communicate back to the token refresher that it needs to generate a new token, but I think exposing such a channel is risky and has potential for abuse if we're not careful (imagine an attacker getting access and requesting non-stop token refresh attempts. We'd need a backstop to rate limit this for sure). This trigger would likely need to be something like a restricted UDS or file-based trigger that the CNI plugin can write to when needed in response to a permissions error - WDYT?

@huutomerkki
Copy link
Author

@huutomerkki thanks for raising this PR!

I want to push a little bit on the aggressive timer here - for some of our larger scale users, this is going to result in a large increase in the number of TokenRequest API calls that Calico makes.

We have users running 5k node clusters (and even larger!), which will result 5k requests per 60 seconds, or 83 requests per second across the entire cluster! I think this is probably just not workable for that scale.

Some thoughts (in no particular order):

* I think refreshing after a failed request is probably the ideal approach, but we need to be careful that we don't allow the CNI plugin to indefinitely propagate its own token. It would need to be some sort of privileged communication back from the CNI plugin to calico/node to trigger this, which might be tricky.

* Perhaps just a larger timeout (say, every 1h instead of every 6h) would satisfy my concerns, but I think that wouldn't meet your goals IIRC.

* Perhaps we just need this to be configurable, so that it can maintain its reasonable default of 6h but allow users with stricter requirements to configure it to be smaller. I don't love the idea of exposing config for this, as it's a fairly low level thing that most users don't need to care about, but sometimes that is required.

I think my preferred option would be to find a way to communicate back to the token refresher that it needs to generate a new token, but I think exposing such a channel is risky and has potential for abuse if we're not careful (imagine an attacker getting access and requesting non-stop token refresh attempts. We'd need a backstop to rate limit this for sure). This trigger would likely need to be something like a restricted UDS or file-based trigger that the CNI plugin can write to when needed in response to a permissions error - WDYT?

I agree, minute as the refresh rate is a lot for clusters of that size. In the case I have, the requested max refresh time is around 5 minutes, which would be around 10x the current refresh rate over the 24h, and I think I also should increase the rate around the expected expiry time if any of this PR would be going in.

Refreshing the token after a failed request would be the ideal solution in my mind as well, but I think implementing that might be out of my league without some major support.

Adding a config for this would fulfill my goal, and I think it would be something that I could implement.

@caseydavenport
Copy link
Member

Adding a config for this would fulfill my goal, and I think it would be something that I could implement.

Great, lets go with this as an approach for now and I will see about a more proactive refresh token in my spare time if I can manage it :)

For a config option, I suspect it would look something like this:

  • A new environment variable - e.g., CNI_TOKEN_REFRESH_DURATION
  • When unset, the existing behavior (1/4 of the token expiry) would be used.
  • When set, the variable would be respected instead.

Then we'd also want a corresponding change to the Installation API in the tigera/operator repository: https://github.com/tigera/operator/blob/master/api/v1/installation_types.go#L32-L211

This would be the end-user facing API that would configure the environment variable - we can talk about the specifics of what that looks like when we get there :)

@kashifest
Copy link
Contributor

@huutomerkki would you wish to continue to work on this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CNI token refresh can take extra 6hrs to happen after expiration
5 participants