Fix crash-looping pods take a long time to terminate/clean #14607

andrew-delph · 2023-11-09T18:30:08Z

Pods that instantly crash do no scale to 0 until progress deadline is called for the deployment.

Proposed Changes

If pa is not ready and unreachable, It is marked inactive instead of queued.
This will scale the deployment to 0 even if metrics cannot be retrieved.

knative-prow · 2023-11-09T18:30:19Z

Hi @andrew-delph. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow · 2023-11-09T18:30:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrew-delph
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andrew-delph · 2023-11-14T20:21:44Z

/test istio-latest-no-mesh

ReToCode · 2023-11-15T15:19:41Z

8:41:40PM: ^ Pending: ErrImagePull (message: rpc error: code = Unknown desc = failed to pull and unpack image "quay.io/jetstack/cert-manager-webhook:v1.13.0": failed to copy: httpReadSeeker: failed open: unexpected status code https://quay.io/v2/jetstack/cert-manager-webhook/manifests/sha256:9f9bda751112262bbe0c0d55e8d06f0fc558870535e063f0c065d632199467f2: 504 Gateway Time-out)

This is definitely not related to the changes.

/test istio-latest-no-mesh

codecov · 2023-11-15T15:24:25Z

Codecov Report

Attention: Patch coverage is 87.50000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.06%. Comparing base (72f91e5) to head (1f2944d).
Report is 25 commits behind head on main.

❗ Current head 1f2944d differs from pull request most recent head fa32dee. Consider uploading reports for the commit fa32dee to get more accurate results

Files	Patch %	Lines
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14607      +/-   ##
==========================================
+ Coverage   84.20%   86.06%   +1.85%     
==========================================
  Files         213      197      -16     
  Lines       16633    14936    -1697     
==========================================
- Hits        14006    12854    -1152     
+ Misses       2280     1774     -506     
+ Partials      347      308      -39

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrew-delph · 2023-11-18T17:20:59Z

/retest

andrew-delph · 2023-11-20T20:08:32Z

/retest

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

dprotaso · 2023-11-21T20:38:20Z

pkg/reconciler/revision/resources/pa.go

+	routingState := rev.GetRoutingState()
+	if routingState == v1.RoutingStateActive {
+		return autoscalingv1alpha1.ReachabilityReachable


If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.

I think another thing to note is that with the pod informer if you remove the changes in the file I think it will mean CrashLooping Pods will be marked unhealthy faster so then we toggle reachability to false. Then I think the autoscaler changes are unnecessary

If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.

Should I still do this?

pkg/reconciler/revision/controller.go

pkg/reconciler/autoscaling/kpa/kpa.go

dprotaso · 2023-11-22T16:29:04Z

/ok-to-test
/retest

dprotaso

Take a look at my comment here (#14656 (comment)) I think it's worth revisiting some of the assumptions that lead to the creation of this PR.

Secondly, I notice this PR introduces a regression. If a service has all the traffic pinned to a specific revision when we rollout a new revision it's immediately scaled to zero and considered 'failed'. It should scale to 1 (or initial scale) and then scale down after since it's unreachable.

dprotaso · 2024-02-17T03:59:02Z

pkg/reconciler/autoscaling/kpa/kpa.go

+			cond := pa.Status.GetCondition("Active")
+			pa.Status.MarkInactive(cond.Reason, cond.Message)


Probably simpler to have an if block that prevents setting a PA as inactive if it already is inactive.

Curious -do you know offhand what reason/messages get overwritten? I'm wondering if it makes sense to pull this into a separate PR if it helps with surfacing error messages.

Thanks for getting back to me!
I think it was being overwritten as changes made in the PR. I'll have to test that again though. For starters I will create the pr as you suggest.

I have created this pr for the computeActiveCondition. I think that it covers all the cases the same. #14940

Making changes to the pa active status has becomes difficult. This change breaks down some of the logic so that it is more readable. No changes to test cases were made as it doesn't actually change the logical cases.

knative-prow · 2024-03-15T13:24:05Z

@andrew-delph: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
contour-latest_serving_main	`be3ec90`	link	true	`/test contour-latest`
gateway-api-latest_serving_main	`be3ec90`	link	true	`/test gateway-api-latest`
kourier-stable_serving_main	`be3ec90`	link	true	`/test kourier-stable`
kourier-stable-tls_serving_main	`be3ec90`	link	true	`/test kourier-stable-tls`
contour-tls_serving_main	`be3ec90`	link	true	`/test contour-tls`
https_serving_main	`be3ec90`	link	false	`/test https`
gateway-api-latest-and-contour_serving_main	`be3ec90`	link	false	`/test gateway-api-latest-and-contour`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dprotaso · 2024-03-15T14:12:15Z

test/test_images/deadstart/README.md

@@ -0,0 +1,4 @@
+# Deadstart test image


I added an image here to help with this

#14875

Going to convert the test to use this image.

dprotaso · 2024-03-15T14:14:26Z

test/e2e/dead_start_test.go

+// This test case creates a service which can never reach a ready state.
+// The service is then udpated with a healthy image and is verified that
+// the healthy revision is ready and the unhealhy revision is scaled to zero.
+func TestDeadStartToHealthy(t *testing.T) {


I added a similar test here - #14909

But I was expecting it to fail without any fixes - but it actually did pass. So then I realized that we already do scale down crashing revisions when they are unreachable.

So we might not need any other changes than what's already in main - unless you've uncovered a scenario where it isn't

The case here is still happening for me. #14656 (comment)
The rev-2 will scale down but the first stuck in restarting.

1. Unreachable pa becomes inactive PA computeActiveCondition() will MarkInactive in the case of "Queued" when the pa is Unreachable. 2. Adding DeadStart e2e tests TestDeadStartToHealthy: creates a service which is never able to transition to a ready state by existing immediately. The failed revision will scale to zero once the service is updated with a healthy revision. TestDeadStartFromHealthy: updates a healthy service with an image that can never reach a ready state. The healthy revision remains Ready and the DeadStart revision doesn't not scale down until ProgressDeadline is reached.

andrew-delph · 2024-03-20T15:43:34Z

I'm currently getting an issue when the progress deadline is reached but looking it.

knative-prow bot added area/API API objects and controllers area/autoscale labels Nov 9, 2023

knative-prow bot requested review from KauzClay and nak3 November 9, 2023 18:30

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 9, 2023

andrew-delph force-pushed the issue-12691 branch from ee655e1 to 6e61d09 Compare November 9, 2023 18:37

andrew-delph changed the title ~~Issue 12691~~ crash-looping pods take a long time to terminate/clean Nov 9, 2023

nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 10, 2023

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2023

andrew-delph changed the title ~~crash-looping pods take a long time to terminate/clean~~ [WIP] crash-looping pods take a long time to terminate/clean Nov 20, 2023

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2023

andrew-delph marked this pull request as draft November 20, 2023 00:30

dprotaso reviewed Nov 22, 2023

View reviewed changes

andrew-delph force-pushed the issue-12691 branch from 4547514 to f4f6b55 Compare November 24, 2023 00:49

knative-prow bot added area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Nov 24, 2023

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2023

knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2023

dprotaso reviewed Feb 17, 2024

View reviewed changes

andrew-delph mentioned this pull request Feb 24, 2024

kpa: simplify computeActiveCondition logic #14940

Closed

andrew-delph force-pushed the issue-12691 branch 4 times, most recently from c60c5cd to 0d10f98 Compare March 2, 2024 17:00

knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 3, 2024

andrew-delph force-pushed the issue-12691 branch from 0839950 to 60aaaee Compare March 3, 2024 17:01

andrew-delph force-pushed the issue-12691 branch from 60aaaee to be3ec90 Compare March 14, 2024 21:33

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024

knative-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 14, 2024

kpa: simplify computeActiveCondition logic

7e7ddd6

Making changes to the pa active status has becomes difficult. This change breaks down some of the logic so that it is more readable. No changes to test cases were made as it doesn't actually change the logical cases.

andrew-delph force-pushed the issue-12691 branch from be3ec90 to ee6c0d0 Compare March 14, 2024 21:34

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024

knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 14, 2024

andrew-delph force-pushed the issue-12691 branch 5 times, most recently from 47347c4 to 7033fe1 Compare March 15, 2024 03:24

dprotaso reviewed Mar 15, 2024

View reviewed changes

andrew-delph force-pushed the issue-12691 branch from 2e26b6a to fa32dee Compare March 17, 2024 16:58

andrew-delph closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash-looping pods take a long time to terminate/clean #14607

Fix crash-looping pods take a long time to terminate/clean #14607

andrew-delph commented Nov 9, 2023 •

edited

Loading

knative-prow bot commented Nov 9, 2023

knative-prow bot commented Nov 9, 2023

andrew-delph commented Nov 14, 2023

ReToCode commented Nov 15, 2023

codecov bot commented Nov 15, 2023 •

edited

Loading

andrew-delph commented Nov 18, 2023

andrew-delph commented Nov 20, 2023

dprotaso Nov 21, 2023

dprotaso Nov 22, 2023

andrew-delph Dec 10, 2023

dprotaso commented Nov 22, 2023

dprotaso left a comment

dprotaso Feb 17, 2024

andrew-delph Feb 20, 2024

andrew-delph Feb 24, 2024 •

edited

Loading

knative-prow bot commented Mar 15, 2024 •

edited

Loading

dprotaso Mar 15, 2024

andrew-delph Mar 20, 2024

dprotaso Mar 15, 2024 •

edited

Loading

andrew-delph Mar 15, 2024

andrew-delph commented Mar 20, 2024

		cond := pa.Status.GetCondition("Active")
		pa.Status.MarkInactive(cond.Reason, cond.Message)

Fix crash-looping pods take a long time to terminate/clean #14607

Fix crash-looping pods take a long time to terminate/clean #14607

Conversation

andrew-delph commented Nov 9, 2023 • edited Loading

Proposed Changes

knative-prow bot commented Nov 9, 2023

knative-prow bot commented Nov 9, 2023

andrew-delph commented Nov 14, 2023

ReToCode commented Nov 15, 2023

codecov bot commented Nov 15, 2023 • edited Loading

Codecov Report

andrew-delph commented Nov 18, 2023

andrew-delph commented Nov 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dprotaso commented Nov 22, 2023

dprotaso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrew-delph Feb 24, 2024 • edited Loading

Choose a reason for hiding this comment

knative-prow bot commented Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dprotaso Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrew-delph commented Mar 20, 2024

andrew-delph commented Nov 9, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

andrew-delph Feb 24, 2024 •

edited

Loading

knative-prow bot commented Mar 15, 2024 •

edited

Loading

dprotaso Mar 15, 2024 •

edited

Loading