-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure ContainerHealthy condition is set back to True #15503
base: main
Are you sure you want to change the base?
Ensure ContainerHealthy condition is set back to True #15503
Conversation
Hi @SaschaSchwarze0. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
cc @dprotaso for review. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15503 +/- ##
==========================================
- Coverage 83.53% 83.48% -0.06%
==========================================
Files 219 219
Lines 17427 17456 +29
==========================================
+ Hits 14558 14573 +15
- Misses 2498 2507 +9
- Partials 371 376 +5 ☔ View full report in Codecov by Sentry. |
94af51c
to
1e89b8d
Compare
@dprotaso gentle ping, any objections on this one? It seems ok to me. |
@@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision) | |||
} | |||
} | |||
|
|||
if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaschaSchwarze0 hi, should we relax the condition?
We set the revision as containerUnhealthy when a "permanent" like failure is detected:
// If a container keeps crashing (no active pods in the deployment although we want some)
if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @skonto, sorry for the late response. I was out for some time.
Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas
seems a bit too strict no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.
Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.
My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).
How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.
What your condition can btw lead to is the following:
- You have a revision with two replicas, one is not healthy, the other one is.
- The code in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L86 checks the first pod only. If that one is not healthy, it will set ContainerHealthy=False.
- I think, the code place here runs later. That one will notice that there are two replicas and one is ready, and set ContainerHealthy to True.
Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔
Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.
This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.
if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
@dprotaso any background info to add for the limitations here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics here are terrible, because distributed systems. Sorry!
What users want:
- When my application is able to handle requests,
Ready
should be true - If
Ready
is false, there's a clear condition as to why - If a container crashes on startup, users would like to be about to see the exit message
The real world laughs at this:
- Containers may pass an initial health check and then have a subsequent failure in application code (for example, if a backend dependency becomes unavailable)
- Kubernetes mostly assumes that nodes are interchangeable, but what if pods on one node have different behavior than pods on another node (e.g. network partition)
- There are a bunch of pre-container-executing failures that can happen, like failure to schedule, pull images, mount volumes, etc. I'm not sure that's relevant here, but worth considering in the big picture
The end result is that "does a container become ready" is not really deterministic (and we don't have a good diagnostic on whether failures are permanent or transient), so all we have here are heuristics.
My gut would be relax the check to closer to what Stavros is suggesting: container ready can become true if there is at least one ready container. While it's trivial to build counter-example containers for this rule, I think it's probably close enough for most users.
Sorry if all that is unsatisfying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @skonto is suggesting:
if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas { | |
if deployment.Status.ReadyReplicas >= 1 { |
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good discussion.
I think from the comments I conclude the following:
- There is currently no definition on the ContainerHealthy condition of a Revision.
- We acknowledge that there may be temporary glitches that we do not necessary want ContainerHealthy to turn False.
To address this, based on what I personally think and what I read as opinions from others, would we agree on the following?
- If there is no Pod, we do not change the current status.
- If any Pod is ready, we set it to True.
- If no Pod is ready and if there is at least one non-terminating Pod and if all non-terminating Pods have a container with LastTerminationState.Terminated != nil, then we set it to False.
- If neither (2) nor (3) apply, we also do not change the current status.
The code for (2) would be simple because it can be evaluated simply by looking at the deployment status:
if deployment.Status.AvailableReplicas > 0 {
set ContainerHealthy=true
}
The code for (3) is more complex as it requires a look at the Pods. I guess we would need to make https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L84-L117 more complex (= not just look at the first pod, though it can break out immediatly if a pod is ready). (2) can probably also be solved here.
This should be safe for scale up from 0 to 1. In that case, we will have one Pod, none are ready, but none have containers with restarts. We would not change the current status until this one Pod crashes.
ScaleDown from 1 to 0 also works. The one Pod may have a restart that can be long ago (OOM or similar), it will not be Ready while it terminates, but because it is terminating, that is ignored.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, more discussion in https://cloud-native.slack.com/archives/C04LMU0AX60/p1733333864902479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @skonto is suggesting:
Is that right?
Yes. We tag a revision with containerHealthy=false because all pods are down for a specific condition. After that the only thing that can happen is this condition to be reverted to unblock the revision. If the condition is reverted for one pod it implies that it has been reverted for all pods independently of when those pods will become ready again. Afaik that was the initial design. So we just need a signal for the blocking condition not holding any more. Serving should handle traffic as usual depending on the number of pods that are ready at any given time (my assumption).
There can be the case that some pods are not scheduled yet and we do cover that. If there are no pods (some failure has not created the pods) we do cover that I think since we propagate the deployment status a few lines above (replicaset failures etc).
1e89b8d
to
f2356b7
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SaschaSchwarze0 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e0cc076
to
fb7e958
Compare
Co-authored-by: Matthias Diester <[email protected]>
fb7e958
to
3575484
Compare
I updated the PR based on the discussion in https://cloud-native.slack.com/archives/C04LMU0AX60/p1733333864902479. |
@@ -94,29 +93,47 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision) | |||
} | |||
} | |||
|
|||
// if a Pod is terminating already, we do not care it being not ready because this is expected to be the case | |||
if pod.DeletionTimestamp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to report a potential error though during the exit no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not anymore sure if this is needed. What I had in mind was the following:
- a Pod is created and runs fine
- at some point the user-container goes out of memory and is restarted, from now on it has
LastTerminationState.Terminated
set. - And now it runs fine, an hour, a day maybe.
- And then it gets scaled down.
- The loop runs.
The Pod will not be ready so it will check the container status. Idea was to prevent reporting a failure that happened long time ago.
Though, the only relevant case would be something like a scaleDown from 2 to 1 of the ready Pod that had this OOM long time ago which the other Pod must be down so that the deployment has zero available replicas. Is very unlikely to happen and definitely not a homogenous state of the workload across pods.
continue | ||
} | ||
|
||
// if a Pod is ready, then do not check it. The fact it is ready may not yet be reflected on the Deployment status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous code covered that already as if a pod is ready if t := status.LastTerminationState.Terminated;
would never be true for a container, no?
Not sure how often this inconsistency will happen (deployment status vs pod status) tbh to be sure about the benefits of the optimization but does not hurt either I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario I have in mind is a following:
- a Pod is created and runs fine
- at some point the user-container goes out of memory and is restarted, from now on it has
LastTerminationState.Terminated
set. - The container finshes restarting and becomes ready. Though, this may not yet be reflected in the deployment status due to the asynchronous nature of the controller.
- now this loop runs
The Pod will be Ready but if we never check this we would still iterate the container statuses. And there status.LastTerminationState.Terminated
is not nil
.
I could probably still get rid of it, if I change the loop to:
// Iterate the container status and report the first failure
for _, status := range pod.Status.ContainerStatuses {
if status.Name == resources.QueueContainerName || status.Ready {
continue
}
...
}
Note the added || status.Ready
at the beginning. This might actually be nicer because we then generally do not report a potentially days-ago failure on a ready container (which would matter if on a single pod multiple containers have restarts). Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thinking out loud.
If a pod is ready all containers are running. Let's say we are keep crashing/restarting at some point. The question is then if we ever are distinguishing between old and recent failures.
I think if we wanted to detect this reliably without reporting old failures, we would need a pod watcher, check this that explains it. This way we would know if we have old issues or that if some container is keep restarting but requires to track restart counts over sometime (keep state).
An implicit way to detect the same I suspect is to check if we are in waiting state when status.LastTerminationState.Terminated != nil
, check status.State.Waiting
.
Before a container is restarted there is a backoff time and container is in waiting stay with reason = "CrashLoopBackOff". If this holds then we are in a bad state.
Checking if a pod is ready (at the top) is also a way to avoid old terminations since if we are not ready something is wrong probably
and the LastTerminationState is recent (meaning it is related). So thinking again this your approach seems fine to me.
The question for any approach is if reconciliation will hit the time period where pods are ready only (imagine a crashloop scenario which lasts for some time). We have a test for this: TestContainerExitingMsg
but we don't test the scenario where the container is not exiting permanently but randomly. 🤔
Btw there is a KEP to make crashloopbackoff tunable kubernetes/kubernetes#57291 (comment) which may affect stuff long term.
|
||
podsLoop: | ||
for _, pod := range pods.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that we will not be triggered enough times to go through a reconciliation phase where actually all pods are in the same status so choosing an arbitrary pod will not hurt us? And that would signal a more permanent issue?
Afaik a revision reconciliation will be triggered due to the updates of any of the following: "Certificate, PA, K8s Deployment". I expect multiple reconciliations that will capture eventually a permanent, blocking issue. 🤔 I guess we are shifting the semantics to capture less permanent errors (one failing pod will be enough to report anything)? Do we still want to go for:
// Arbitrarily grab the very first pod, as they all should be crashing
and make sure all crash (and only then report an issue)? 🤔 Or is it that with this PR we capture the error a bit earlier, due to some container exiting, and eventually they will be all crashing anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also worried (maybe not an issue?) about relying on the fact that on one hand we use an informer based lister for the deployments and direct api server call for pods using the go client, see for example. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, I would probably be fine to not try to be as accurate as possible because it is just about the revision status.
What I tried is to properly handle the eventual consistency that Kubernetes has due to its persistence and asynchronous nature of informers and controllers. Reducing those considerations to the minimum to prevent crashes (the old code knew from the deployment spec that there should be more than zero replicas but still ensured that there is an item in the pod item list before accessing the first one) would also be okay for me here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also worried (maybe not an issue?) about relying on the fact that on one hand we use an informer based lister for the deployments and direct api server call for pods using the go client, kubernetes/client-go#1383 (comment). 🤔
Would be independent from my changes, but still interesting as I had not considered that an issue. I am not too firm into how Knative sets up the Kubernetes client. In case of controller-runtime, any object access will immediatly be cached unless you configured controller-runtime to not cache that object type.
If that pod list call really goes to Kubernetes API and if we go back to just looking at the first item, we could at least run the list call with limit=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call is:
pods, err := c.kubeclient.CoreV1().Pods(ns).List(ctx, metav1.ListOptions{LabelSelector: metav1.FormatLabelSelector(deployment.Spec.Selector)})
This is not informer based afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, Knative projects do **NOT**
use controller-runtime lib to create client or use it access the Kubernetes object.
Knative uses:
- Use
kubeclient
to access the kube core or app objects, if we know the info of groupkindversion. - Use
dynamicClient
to access the kube objects, if we do not know the info of groupkindversion, but need to figure out dynamically in the code. For example, https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/kpa/scaler.go#L318 - Use the generated client, to access the CR, customized/created for knative. For example, to get the route, we use https://github.com/knative/serving/blob/main/pkg/reconciler/service/service.go#L248
@SaschaSchwarze0 Are you able to address the comments and wrap up the PR? We are looking forward with this fix. |
@SaschaSchwarze0 -- would you like some help getting this over the finish line? I can PR or commit to your branch if you won't have time to complete it; this is apparently also causing pain for @houshengbo and @yuzisun. |
I really hope this PR can be approved, since it has been causing issues to our env on weekly basis, during the past months. |
/lgtm |
@dprotaso gentle ping, @houshengbo the goal is to get this in for 1.17 that is coming soon but we can backport as well. |
I am now OK with the PR BTW. |
Hey FYI - I'm getting to this PR just trying to unblock dependency updates for the release first. |
@evankanderson @skonto @dprotaso I was reading most of the things that were commented, but did not have the time to make a conclusion out of them so far. Now, reading through all of them, I still think there needs to be a decision on the direction we want to go. I think there are two options (and the current PR is a mixture). (1) Mark ContainerHealthy true when all pods are healthy. This means one would need to loop through all pods like the PR currently does. And one should set ContainerHealthy to true when spec.replicas > 0 && spec.replicas == .status.readyReplicas (I think I had this at some point, but we changed that). The disadvantage of this is that one needs to retrieve all pods because there is no caching. Retrieving all pods has always been done so it would not be a change. (2) Mark ContainerHealthy true when one pod is healthy. This means that the condition to set this can stay at .status.readyReplicas > 0 the way it currently is. We can then recert the logic changes to check all pods but instead only look at 1. We can optimize the List call to run with My vote would be (2). ContainerHealthy would then most likely only be set to False for those "permanent" errors and not for things like an OOM that happens once per day assuming the KSvc's scale is >1. |
Tentatively I would vote for option (2) given current semantics. Here are my thoughts: As a side note, independently of this PR, one thing that we don't consider is the number of min replicas ready for characterizing a revision as healthy or not. For example in a scenario where minScale=2>1 and one pod keeps exiting what I observed is that the revision remains healthy and accepts traffic. However, initially, when we create the revision we will never become ready if we don't achieve the desired minScale. Is that ok? I don't think we have a strictly defined health condition for the revision. In any case the scope of the PR is the bug of never setting back the condition to True and I think we should only consider the fix for the bug here given current Knative semantics. We have:
When we decide to reset the In the future though we might want to allow flexibility for what is considered healthy for better UX and operations. Service is "Healthy" if:
The first two points might be useful for the admin offering Knative as a platfrom. Usually, users of the service only care about these SLOs in the last point. Knative could produce the right signal in the revision status to serve the personas above, especially the admin. External tooling could also detect status and report this info but can't handle/manage the traffic. |
Fixes #15487
Proposed Changes
This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.
Release Note