Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defective revision can lead to pods never being removed #13677

Closed
DavidR91 opened this issue Feb 6, 2023 · 8 comments
Closed

Defective revision can lead to pods never being removed #13677

DavidR91 opened this issue Feb 6, 2023 · 8 comments
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-user-input Issues which are waiting on a response from the reporter

Comments

@DavidR91
Copy link

DavidR91 commented Feb 6, 2023

In what area(s)?

/area autoscale

What version of Knative?

Repro'ed in
1.3
1.5
1.9

(This repros with istio as networking + ingress 1.12.9. Besides using the operator for install the configuration is very vanilla but I can provide more details if useful)

Expected Behavior

Deployments that pass their initial progress deadline but contain pods that start to crashloop should eventually be scaled down and removed.

(Note: Scale to zero assumed)

Actual Behavior

If there is buffered traffic for a revision of a service, and the service passed its initial deployment progress deadline, knative will keep the revision's deployments alive forever with no obvious way to scale them down or remove them (keeping around the pods in a crashlooping state)

Example use case case encountered: a revision contains a pod with e.g. an address of an external resource like a database. The service is working with this revision for some time, and then the external resource address is changed (causing the pod to startup but the container to not serve requests and eventually enter restart loops). A new revision is created to amend this - but if there is any outstanding traffic for the old revision, the old defective pods are kept around and never scaled down.

The state of the PodAutoscaler in this instance becomes Ready=Unknown Reason=Queued with status messages to the effect of Requests to the target are being buffered as resources are provisioned

Removing the service is not a solution because the newest revision is correctly serving traffic.

modified

Steps to Reproduce the Problem

  • Create a container that serves HTTP traffic correctly but ceases to start listening/functioning based on external criteria
    • Simple example is to sleep for 5 seconds and exit before the listener starts if the minute of the current hour is >30
  • Create a service + revision for the container
  • Send traffic to the service while the external criteria allows the container to operate
    • Make sure the service passes its initial deployment deadline (~10 mins)
  • Wait for it to scale back down to zero
  • Send traffic to the service now that the external criteria prevents it starting
  • The deployment will scale up and all the created pods will crashloop
  • Create a new revision that corrects the issue, and drive traffic to the service again
  • The service's new revision will start and serve traffic but the deployment and pods of the old defective revision will stick around with no clear way to remove them

NOTE: You can obviously delete the revision, but this is not a solution for services which have only a single revision (do we have to delete the entire service to kill these pods?). This bug is partly a question of whether knative is actually designed to be able to clean up this scenario, or whether it would rest on a human operator or additional orchestrator to resolve.

@DavidR91 DavidR91 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 6, 2023
@dprotaso
Copy link
Member

dprotaso commented Feb 9, 2023

/triage accepted

@knative-prow knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Feb 9, 2023
@dprotaso dprotaso added this to the v1.10.0 milestone Feb 10, 2023
@jsanin-vmw
Copy link

/assign

@jsanin-vmw
Copy link

/unassign

@jsanin-vmw
Copy link

/assign

@jsanin
Copy link

jsanin commented Feb 1, 2024

PR 14573 aims to fix this issue.

The proposed fix is based on the TimeoutSeconds field in the Revision. After this timeoutSeconds has gone by there should not be any pending requests in the activator and the Unreachable revision can scale down with no risk of requests not being processed.

The default value for timeoutSeconds is 300, so the pods on the failing revision will only scale down after this time. TimeoutSeconds can be changed of course.

@dprotaso
Copy link
Member

@DavidR91 I've been trying to reproduce this issue to verify the proposed fixed - but in my testings I'm seeing the revision pod scale down once the activator times out the request.

Do you have a consistent way to trigger this issue? Can you confirm requests are being timed out by the activator?

@dprotaso dprotaso modified the milestones: v1.13.0, v1.14.0 Feb 17, 2024
@dprotaso dprotaso added triage/needs-user-input Issues which are waiting on a response from the reporter and removed triage/accepted Issues which should be fixed (post-triage) labels Feb 17, 2024
@dprotaso dprotaso removed this from the v1.14.0 milestone Feb 20, 2024
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024
@dprotaso
Copy link
Member

Closing this out due to lack of user-input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
4 participants