Don't check every IPAM allocation on every sync. #9654

caseydavenport · 2024-12-30T19:35:24Z

Description

The IPAM GC controller currently iterates every IP address allocation in the cluster every time an IP address is assigned or released. This can result in excessive CPU usage on larger / more active clusters.

Instead, this PR changes the behavior such that we limit the allocations we check to:

Those on nodes whose IPAM state has changed since the last sync.
Those on nodes who have had a Pod Deletion Event since the last sync.

The idea is that this set should be much smaller than every node, and will allow us to incrementally scan the cluster instead of checking the entire cluster on every syncIPAM() call.

We will have a periodic (by default every 5m) sync that will check every node, as a backstop, and it's worth noting that this change on its own means we may be relying on the periodic sync slightly more often in order to determine when IPAM allocations have leaked, as it is possible leaks can occur on nodes without the controller receiving the requisite notifications to notice it.

We might be able to enhance this further by tracking when individual allocations are dirty (and, for example, only clearing them after the grace period proves they are OK or leaked). But I think that's a larger change and this one should do the trick I hope.

Related issues/PRs

Todos

Tests
Documentation
Release note

Release Note

Improve CPU efficiency of IP address garbage collection controller

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

caseydavenport · 2024-12-30T19:37:08Z

This isn't perfect - it's possible for example that on larger clusters with lots of churn, this will still end up iterating a LOT of IPAM allocations. We could probably make this even better if we tracked dirty-ness at the allocation level rather than at the node level.

kube-controllers/pkg/controllers/node/ipam.go

caseydavenport · 2025-01-03T00:28:43Z

After writing this PR up, I've been thinking a bit more about this and I think I'd like to adjust the strategy here.

Here are the 4 main responsibilities of this IPAM controller:

Periodic resync. This is meant to catch leaks, and so should scan the entire IPAM data set.
Whenever one or more nodes are deleted, we clean up blocks that belonged to the deleted node (by scanning the whole IPAM data set).
Whenever an IPAM block is updated, we scan the whole IPAM data set (indexed in a different way than the above) in order to generate IPAM metrics for Prometheus.
When we see that a Node has more blocks than it needs, we clean them up.

For (1), we need to scan the whole IPAM data set. But we can do this relatively infrequently.

For (2), we don't necessarily need to scan the entire IPAM data set - we could just clean up the known affine blocks. Scanning the whole set has some advantages - we might catch a leak slightly sooner if it was in a borrowed IPAM block, but the full scan should catch that eventually anyway.

For (3), we do need to scan the entire IPAM data set, but we can probably stop doing it so frequently. Right now we do it every batch of updates, but Prometheus is polling anyway so there's really no value in doing this more frequently than the Prometheus poll timer. We could probably change this to be every 15s (and only if something changed) without any real UX problem.

For (4), deleting unused IPAM blocks that are affine to existing nodes can be (and already are) performed without needing to scan the entire IPAM data set. We track empty blocks as updates come in, so we know exactly what the set of empty blocks is on demand. We can do an empty block check every sync without much cause for concern, so no need to change this.

I think I would propose:

Scan the full IPAM data set to check for leaks on a periodic resync (i.e., every 5 minutes).
Scan the full IPAM data set to check blocks for deleted nodes whenever one or more nodes are deleted (could be further optimized to check for particular nodes, but not necessary IMO).
Only perform a metrics scan of the dataset if IPAM data has changed, and no more frequently than every 15s.

fasaxc · 2025-01-03T09:27:05Z

@caseydavenport for (3) do we scan an in-memory cache, or do we refresh the whole list from the API server? Seems like we could certainly use a cache for that less critical work.

caseydavenport · 2025-01-03T18:04:57Z

@fasaxc we use an in-memory cache for all of the above! So at least we have that going 😅

caseydavenport · 2025-01-03T18:13:08Z

The problem with (3) I just realized is that we rely on the full IPAM scan to determine which IPs are GC candidates / leaks, which then feeds into the metrics.. so some metrics will end up being gated by the full IPAM scan interval anyway!

That's not really a problem per-se (it's still accurate and reflects our slower time to detection for IPAM GC) but it does break some assumptions in our tests.

I do think this is a good reason to keep the "only scan nodes that have changed" logic though. It maintains quick updates for our metrics, while also reducing the amount of work that happens on each event.

caseydavenport · 2025-01-03T22:37:29Z

My most recent changes keep the "immediate" sync behavior, where we trigger a sync for each block update, each node deletion update, and each pod deletion update (in order to maintain quick metrics updates). What I've done instead is back to the original approach:

Only iterate over nodes that have changed since the last sync.
When we receive an update, wait a little bit longer to see if any more updates can be added to the batch so that we don't call syncIPAM so much.

caseydavenport requested a review from a team as a code owner December 30, 2024 19:35

marvin-tigera added this to the Calico v3.30.0 milestone Dec 30, 2024

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 30, 2024

caseydavenport mentioned this pull request Dec 30, 2024

calico-kube-controller 3.26 high cpu usage frequent ipam syncs #7841

Open

caseydavenport added the docs-not-required Docs not required for this change label Dec 30, 2024

marvin-tigera removed the docs-pr-required Change is not yet documented label Dec 30, 2024

caseydavenport commented Dec 30, 2024

View reviewed changes

kube-controllers/pkg/controllers/node/ipam.go Outdated Show resolved Hide resolved

caseydavenport added 6 commits January 3, 2025 14:28

Track which nodes have changed since last sync

0753be5

Better comments

9007359

Move to ipam_allocation.go

7fc3bc7

Trigger GC on Pod deletion in addition to IPAM updates

53a2573

Fix static checks

9ad681e

Add some more tests

1036335

caseydavenport force-pushed the casey-ipam-gc-efficiency branch from 278f85f to 1036335 Compare January 3, 2025 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't check every IPAM allocation on every sync. #9654

Don't check every IPAM allocation on every sync. #9654

caseydavenport commented Dec 30, 2024 •

edited

Loading

caseydavenport commented Dec 30, 2024 •

edited

Loading

caseydavenport commented Jan 3, 2025

fasaxc commented Jan 3, 2025

caseydavenport commented Jan 3, 2025

caseydavenport commented Jan 3, 2025 •

edited

Loading

caseydavenport commented Jan 3, 2025

Don't check every IPAM allocation on every sync. #9654

Are you sure you want to change the base?

Don't check every IPAM allocation on every sync. #9654

Conversation

caseydavenport commented Dec 30, 2024 • edited Loading

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

caseydavenport commented Dec 30, 2024 • edited Loading

caseydavenport commented Jan 3, 2025

fasaxc commented Jan 3, 2025

caseydavenport commented Jan 3, 2025

caseydavenport commented Jan 3, 2025 • edited Loading

caseydavenport commented Jan 3, 2025

caseydavenport commented Dec 30, 2024 •

edited

Loading

caseydavenport commented Dec 30, 2024 •

edited

Loading

caseydavenport commented Jan 3, 2025 •

edited

Loading