Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement partial resync after IP set update failures #9674

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fasaxc
Copy link
Member

@fasaxc fasaxc commented Jan 3, 2025

Description

Any failure to update an IP set must either be spurious (EBUSY) or caused by a problem with that IP set. Historically, we've done an expensive full resync if we hit either type of error.

This PR changes that:

  • After the first 3 failures, it retries without any resync. This should deal more cheaply with EBUSY-type errors.
  • After that, it does a partial resync of the IP sets that we were attempting to update. This covers all cases I can think of that can cause an IP set update to fail.
  • After the 6th retry it does a full resync as before, just in case.

Related issues/PRs

Might help with CORE-10804

Todos

  • Tests
  • Documentation
  • Release note

Release Note

After failures to program IP sets, the dataplane now does a partial resync of only the affected IP sets. This should reduce the overhead of dealing with a transient failure.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Any failure to update an IP set must either be spurious (EBUSY)
or caused by a problem with _that_ IP set.  Historically, we've
done an expensive full resync if we hit either type of error.

This PR changes that:

- After the first 3 failures, it retries without any resync.
  This should deal more cheaply with EBUSY-type errors.
- After that, it does a partial resync of the IP sets that we
  were attempting to update.  This covers all cases I can think
  of that can cause an IP set update to fail.
- After the 6th retry it does a full resync as before, just in
  case.
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Jan 3, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jan 3, 2025
@fasaxc fasaxc added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants