You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs.
3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.
The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds.
So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.
a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations
// TODO: We're effectively iterating every allocation in the cluster on every execution. Can we optimize? Or at least rate-limit?
We previously had unintentional rate limiting, but now we're iterating everything for every sync. We do some batching, but it's clearly not enough.
I think we want to introduce "dirty" tracking here, so that we can reduce this loop to only checking the allocations that we know have changed since the last sync.
We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs.
3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.
The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds.
So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.
a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations
Current Behavior
The difference to 3.24 seems to be that it uses the podcache which was changed in 3.24.6 #7503
3.24.5 instead logs the queries for each of the 9000 ipamhandles which takes several minutes:
So the cache is now effective which reduces the load on the apiserver.
The text was updated successfully, but these errors were encountered: