-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Make cluster cache target the watch cache instead of etcd #617
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Antoni Zawodny <[email protected]>
ea97e08
to
375f743
Compare
Quality Gate passedIssues Measures |
@tosi3k just reading around about the watch cache, it looks like maybe the watch cache doesn't support pagination? kubernetes/kubernetes#102672 If that's still the case, are we going to end up falling back to etcd anyway since we set the page size? And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd? |
Watch cache not having support for pagination (i.e. disregarding the
Not really -
I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :). In conjunction with #616, that would be a short memory usage spike in In general, this will be eventually addressed by kubernetes/enhancements#3142 - this enhancement's enablement was unfortunately reverted last-minute in K8s 1.31 but will be available in 1.32. |
Should we explicitly remove the
Makes sense, I'll just keep an eye on CPU use while testing, and we'll want to pay attention during Argo CD RCs for any complaints. |
Good question - we don't want to drop it since there's a possibility someone disabled the watch cache explicitly (I think there are some K8s distros that do so) by setting the apiserver's |
Cool! Okay next step is for me to do some testing and gather data. I can't make any promises on how soon that'll be, but I want to get this change into 2.13. |
// We set the resource version to 0 below to proactively prevent the | ||
// list API call from reaching etcd and make the server fetch the data | ||
// from the watch cache instead. | ||
opts.ResourceVersion = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear to me.
When RV=0 is set, we're kube-apiserver is ignoring the Limit
option now, so the whole pagination won't be effective anyway.
When this function is called? Is it only on initialization or it's also used later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #616 submitted together with this PR, this would be called only on initialization, "resource version too old" or pathological scenarios like failing to extract RV from the watch event which would require relisting anyway.
Without #616, the list would be always executed every ~10 min by default (same as now).
IMHO it would make more sense to have both changes in a single PR but I was requested to split them to facilitate some testing that @crenshaw-dev wants to do internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming #616 will merge. With that assumption, I'm not even convinced we want this one to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even convinced we want this one to happen.
Why not? Doesn't using the watch cache have its own benefits independent of eliminating the 10-minute re-list operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has, but it comes with the cost of reads being potentially stale. I don't know ArgoCD well enough to tell what consequences it may have.
At the same time, new features are landing in k8s 1.31 and more will land in 1.32 and further where it will no longer matter (ConsistentListFromCache, WatchList/streaming list).
So let's start with the other PR for now and revisit this one in case of complaints (we know that the other PR on its own would effectively address issues that we observed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reads being potentially stale
What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?
let's start with the other PR for now
The other PR reduces ongoing etcd load and the likelihood of hitting etcd compaction errors during long list operations, but it doesn't eliminate the initial load or the possibility of compaction errors.
The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?
In majority of cases, not big risk, but in case cache is visibly lagging (we've had an outage when we observed lagging by ~1h or so), then it might be bigger problem..
The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.
Nope - core k8s controllers generally don't relist at all.
You might be mixing that with the "resync" concept in the informer framework. Resync is not a relist - resync calls a handler for all objects, but it's iterating over them from in-memory cache.
Historically (5y+ ago) we indeed were periodically relisting due to bugs that were still happening, but our machinery was stabilized since then and we're not doing that anymore for core components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we observed lagging by ~1h or so
Damn okay, that makes sense. I assume we'd expect to have to process a ton of events in that case.
But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.
core k8s controllers generally don't relist at all. You might be mixing that with the "resync" concept in the informer framework
I absolutely am confusing those two things, good catch.
In that case I'm much less worried about #616, and I do see why you might find it preferable to just do #616 and skip the watch cache change until ~1.32. I would like to test the changes independently and see where we land.
I'm not sure when I'll have time to test, and I'm very uncomfortable merging any performance changes without testing. So if anyone else has time / a good test environment, I'd appreciate the help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.
616 indeed isn't a fix for that - what it is is that it prevents us going back in time across component restarts.
Lagging is one thing, going back in time is more serious thing to face.
[Effectively an instance of https://github.com/kubernetes/kubernetes/issues/59848 ]
From scalability POV you will certainly see improvement with this change - but I prefer to stay on the safe side first, and potentially consider that later only if really needed.
// We set the resource version to 0 below to proactively prevent the | ||
// list API call from reaching etcd and make the server fetch the data | ||
// from the watch cache instead. | ||
opts.ResourceVersion = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's ever a good idea to list with resourceVersion=0. Even though this allows the API server to use it's object cache, it causes a worse problem. It does not allow limit/pagination/chunking to be used at all and so all objects of that group/kind will be returned in the response, resulting in OOM conditions of the client dealing with the huge response.
We've been wrestling with this problem in workflows over the years, and the solution we landed on is to never allow resourceVersion=0 requests, even if it means hitting etcd (resourceVersion="").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?
I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :).
The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.
This tradeoff I mentioned was from the server's POV, not client's but I agree there's a risk of client OOMing if the amount of resources it tracks is large and the client doesn't get enough memory assigned. Transforming informers could be a possible alleviation of the problem you describe as they provide possibility to reduce the memory footprint due to the objects stored in the client's cache but it would require considerably more work to validate them and migrate to them here if feasible.
Problem statement in argoproj/argo-cd#18838.
This PR addresses the problem partially - starting with delegating the K8s List calls to the watch cache, offloading the etcd.