-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash-looping pods take a long time to terminate/clean #14607
Conversation
Hi @andrew-delph. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrew-delph The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ee655e1
to
6e61d09
Compare
/test istio-latest-no-mesh |
This is definitely not related to the changes. /test istio-latest-no-mesh |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14607 +/- ##
==========================================
+ Coverage 84.20% 86.06% +1.85%
==========================================
Files 213 197 -16
Lines 16633 14936 -1697
==========================================
- Hits 14006 12854 -1152
+ Misses 2280 1774 -506
+ Partials 347 308 -39 ☔ View full report in Codecov by Sentry. |
/retest |
/retest |
routingState := rev.GetRoutingState() | ||
if routingState == v1.RoutingStateActive { | ||
return autoscalingv1alpha1.ReachabilityReachable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another thing to note is that with the pod informer if you remove the changes in the file I think it will mean CrashLooping Pods will be marked unhealthy faster so then we toggle reachability to false. Then I think the autoscaler changes are unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.
Should I still do this?
/ok-to-test |
4547514
to
f4f6b55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at my comment here (#14656 (comment)) I think it's worth revisiting some of the assumptions that lead to the creation of this PR.
Secondly, I notice this PR introduces a regression. If a service has all the traffic pinned to a specific revision when we rollout a new revision it's immediately scaled to zero and considered 'failed'. It should scale to 1 (or initial scale) and then scale down after since it's unreachable.
cond := pa.Status.GetCondition("Active") | ||
pa.Status.MarkInactive(cond.Reason, cond.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably simpler to have an if
block that prevents setting a PA as inactive if it already is inactive.
Curious -do you know offhand what reason/messages get overwritten? I'm wondering if it makes sense to pull this into a separate PR if it helps with surfacing error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting back to me!
I think it was being overwritten as changes made in the PR. I'll have to test that again though. For starters I will create the pr as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created this pr for the computeActiveCondition. I think that it covers all the cases the same. #14940
c60c5cd
to
0d10f98
Compare
0839950
to
60aaaee
Compare
60aaaee
to
be3ec90
Compare
Making changes to the pa active status has becomes difficult. This change breaks down some of the logic so that it is more readable. No changes to test cases were made as it doesn't actually change the logical cases.
be3ec90
to
ee6c0d0
Compare
47347c4
to
7033fe1
Compare
@andrew-delph: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -0,0 +1,4 @@ | |||
# Deadstart test image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an image here to help with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to convert the test to use this image.
// This test case creates a service which can never reach a ready state. | ||
// The service is then udpated with a healthy image and is verified that | ||
// the healthy revision is ready and the unhealhy revision is scaled to zero. | ||
func TestDeadStartToHealthy(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a similar test here - #14909
But I was expecting it to fail without any fixes - but it actually did pass. So then I realized that we already do scale down crashing revisions when they are unreachable.
So we might not need any other changes than what's already in main
- unless you've uncovered a scenario where it isn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case here is still happening for me. #14656 (comment)
The rev-2 will scale down but the first stuck in restarting.
1. Unreachable pa becomes inactive PA computeActiveCondition() will MarkInactive in the case of "Queued" when the pa is Unreachable. 2. Adding DeadStart e2e tests TestDeadStartToHealthy: creates a service which is never able to transition to a ready state by existing immediately. The failed revision will scale to zero once the service is updated with a healthy revision. TestDeadStartFromHealthy: updates a healthy service with an image that can never reach a ready state. The healthy revision remains Ready and the DeadStart revision doesn't not scale down until ProgressDeadline is reached.
2e26b6a
to
fa32dee
Compare
I'm currently getting an issue when the progress deadline is reached but looking it. |
Pods that instantly crash do no scale to 0 until progress deadline is called for the deployment.
Proposed Changes
If pa is not ready and unreachable, It is marked inactive instead of queued.
This will scale the deployment to 0 even if metrics cannot be retrieved.