Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix defective revision can lead to pods never being removed #14573

Closed

Conversation

jsanin-vmw
Copy link

@jsanin-vmw jsanin-vmw commented Oct 30, 2023

Fixes #13677

Proposed Changes

  • Requests in the activator times out by checking the spec.revisionTimeout field at the Revision
  • The idea here is to allow scale to 0 when revisionTime has passed, instead of allowing this only when the progressDeadline has passed.
  • After revisionTimeout seconds there are not requests queued at the activator, because they already timed out. So, there is no need to not allow scaling to 0
  • There another edge case when revisionTimeout is less than the StableWindow (when in panic mode) , that prevents a revision from scaling to 0. See changes in the autoscaler.go

Release Note

https://github.com/knative/serving/issues/13677 fixed

@knative-prow
Copy link

knative-prow bot commented Oct 30, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers labels Oct 30, 2023
@knative-prow knative-prow bot requested review from kauana and KauzClay October 30, 2023 21:12
@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (fd8f461) 85.99% compared to head (b570ea6) 85.74%.
Report is 180 commits behind head on main.

Files Patch % Lines
pkg/autoscaler/scaling/autoscaler.go 28.57% 3 Missing and 2 partials ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 0.00% 4 Missing ⚠️
pkg/reconciler/autoscaling/kpa/scaler.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14573      +/-   ##
==========================================
- Coverage   85.99%   85.74%   -0.26%     
==========================================
  Files         197      198       +1     
  Lines       14916    15143     +227     
==========================================
+ Hits        12827    12984     +157     
- Misses       1777     1833      +56     
- Partials      312      326      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jsanin-vmw jsanin-vmw force-pushed the js-defective-revision-13677-2 branch 2 times, most recently from 623d48b to 16007ba Compare October 30, 2023 21:54
@jsanin-vmw jsanin-vmw changed the title Js defective revision 13677 2 Fix defective revision can lead to pods never being removed Oct 30, 2023
@jsanin-vmw jsanin-vmw marked this pull request as ready for review October 30, 2023 22:07
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2023
@knative-prow knative-prow bot requested a review from ReToCode October 30, 2023 22:07

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…adline

also added a condition in the Scale function in which in case we are not
receiving data, check if you had panicked and the stable windows has passed,
and continue to check again further conditions
@jsanin-vmw jsanin-vmw force-pushed the js-defective-revision-13677-2 branch from 16007ba to 905e516 Compare October 30, 2023 22:11
Version: "v1",
Resource: "revisions",
}
uRevision, err := ks.dynamicClient.Resource(gvr).Namespace(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to be fetching revisions when we're making scaling decisions - that'll overload the API server at scale and possible trigger client side rate limiting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.
in case this proposal is viable, where is a good place in the code to fetch the revisionTimeout?

just to note that this is called only when we are scaling to 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this proposal is viable, where is a good place in the code to fetch the revisionTimeout?

Two options

  • we add a field to the Knative PodAutoscaler type
  • we propagate this using an annotation on the PodAutoscaler type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we already have an annotation

func (pa *PodAutoscaler) ProgressDeadline() (time.Duration, bool) {
// the value is validated in the webhook
return pa.annotationDuration(serving.ProgressDeadlineAnnotation)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly seems like it's a user knob that we propagate to the deployment - but we aren't setting it on all PodAutoscalers (eg. when the user doesn't set it)

ProgressDeadlineAnnotationKey = GroupName + "/progress-deadline"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details here: #12743

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new annotation serving.knative.dev/revision-timeout for the pod autoscaler, this has the RevisionTimeout field on the RevisionSpec. With this change I am not fetching the revision from the API server

Comment on lines +173 to +174
// We are not receiving data, but we panicked before, and the panic stable window has gone, we want to continue.
// We want to un-panic and scale down to 0 if the rest of the conditions allow it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with when and how things panic - can you elaborate on this?

Why do we only do this when the scaler panics - why not do something like this when it doesn't panic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario is something like this:

  • we have a failing revision (which is what was reported in the bug)
  • this revision got a burst in requests, entered in panic mode.
  • the stable window is 60secs by default.
  • the revision has a custom revisionTimeout less than 60secs ( by default the revisionTimeout is 300 secs). This means requests in the activator will timeout after revisionTimeout.
  • when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the desiredPanicPodCount will be applied instead.
  • no metrics will be reported anymore, this condition will always be true:
if err != nil 
   if errors.Is(err, metrics.ErrNoData)

a return invalidSR will be returned always and no change to the decider.Status.DesiredScale will be made

This causes the failing revision never will be scale down to 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when requests timed out, metrics will be reported. As there are not requests pending, the autoscaler would want to scale down to 0, but because we are still in panic mode, this will be ignored and the desiredPanicPodCount will be applied instead.

what is the value of desiredPanicPodCount ? I'm wondering in this scenario if we should opt of panic'ing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to say, it depends on the calculations done in the function. In line 205 you can see this

	// We want to keep desired pod count in the  [maxScaleDown, maxScaleUp] range.
	desiredStablePodCount := int32(math.Min(math.Max(dspc, maxScaleDown), maxScaleUp))
	desiredPanicPodCount := int32(math.Min(math.Max(dppc, maxScaleDown), maxScaleUp))

dppc is calculated like (line 194)

	dspc := math.Ceil(observedStableValue / spec.TargetValue)
	dppc := math.Ceil(observedPanicValue / spec.TargetValue)

and these values comes from the metricClient (line 163)

observedStableValue, observedPanicValue, err = a.metricClient.StableAndPanicConcurrency(metricKey, now)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run a manual test again.
I setup a ksvc revision with containerConcurrency: 2. This is a failing revision, it will not start up.
This revision also has timeoutSeconds: 30, which means that request at the activator will timeout after 30 seconds.

I created 4 concurrent requests, which made the autoscaler to determine that 3 pods were needed.

For some seconds these were the values for

desiredStablePodCount = 3, desiredPanicPodCount = 3

I saw as well the message Operating in panic mode. during this period of time.

This block of code makes sure that a.maxPanicPods is the greater of desiredStablePodCount and desiredPanicPodCount. Which at this point is 3.

At some point the values changed to

desiredStablePodCount = 3, desiredPanicPodCount = 2

and finally to

desiredStablePodCount = 3, desiredPanicPodCount = 0

then it moved to

desiredStablePodCount = 2, desiredPanicPodCount = 0

but this decision is skipped due to
Skipping pod count decrease from 3 to 2. (see condition in the above code)

then, even though the values are:

desiredStablePodCount = 0, desiredPanicPodCount = 0

I still see the skipping Skipping pod count decrease from 3 to 0. Because we are in panic mode and we do not decrease values when we are in panic mode.
We are getting to desiredStablePodCount = 0 because there are no more requests at the activator (they timed out)

Without the changes in this PR. No more metrics will be received from now on for this revision. And this block of code will always be true.

	if err != nil {
		if errors.Is(err, metrics.ErrNoData) {
			logger.Debug("No data to scale on yet")
		} else {
			logger.Errorw("Failed to obtain metrics", zap.Error(err))
		}
		return invalidSR
	}

And the revision will remain in panic mode from now on.

The new condition aims for allowing another execution of the function if the revision panicked and the stable window has gone so the revision will scale down.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this change is needed. Let me know your thoughts about the above explanation.

Comment on lines 192 to 194
if revTimeout, err := ks.revisionTimeout(ctx, pa); err == nil && revTimeout > 0 {
activationTimeout = time.Duration(revTimeout) * time.Second
} else if progressDeadline, ok := pa.ProgressDeadline(); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary to fix the issue? I'd imagine failed revisions would scale down after ProgressDeadline is hit?

In our config map deployment we say that we wait ProgressDeadline until we considered a revision failed. These changes would break that existing behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes would break that existing behaviour.

This could be why the upgrade tests are failing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine failed revisions would scale down after ProgressDeadline is hit?

yeah. The failed revision will scale down eventually. After 10min (progress deadline). This is what the issue is reporting: why pods on the failed revision are still around.

After revisionTimeout passed, there should no be any requests sit at the activator, hence no reason to keep pods around.

I think the current condition comparing progressDeadline should have been changed when the revisionTimeout field was introduced.

Copy link

@jsanin jsanin Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are new changes in this code. I check the revision-timeout when the revision is Unreachable. Could you review this again ?

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Scale down to zero if revision timeout has gone and the revision is unreachable
Copy link

linux-foundation-easycla bot commented Jan 31, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

knative-prow bot commented Jan 31, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsanin-vmw
Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 31, 2024
@jsanin
Copy link

jsanin commented Feb 1, 2024

/retest

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my PR comments - I'm having a hard time reproducing the original bug (see: #13677 (comment)) . When the activator times out the request I'm seeing bad revisions scale down after the progress deadline.

I don't think it's necessary to continue this PR until we circle back with the original reporter for more information.

Comment on lines +204 to +206
} else if revTimeout, ok := pa.RevisionTimeout(); ok && pa.CanFailActivationOnUnreachableRevision(now, revTimeout) {
logger.Info("Activation has timed out after revision-timeout ", revTimeout)
return desiredScale, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending time thinking about this I don't think we should be failing the revision based on the desired request timeout. The user should be setting the progressdeadline annotation directly if they want to lower this value.

Also there are valid use cases where spinning up a new 'unreachable' revision is the norm. eg. I have all my traffic pointing to an older revision and I'm applying changes to my service spec.template.spec to rollout new changes by creating new revisions. In my manual testing the changes in this PR will cause the newer unreachable revisions to fail - when normally they should spin up and then scale to zero. The change here short-circuits that.

@dprotaso
Copy link
Member

Going to close this out - thanks for the patience and help 🙏 this is an area of the code I'm still becoming more familiar with

@dprotaso dprotaso closed this Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Defective revision can lead to pods never being removed
3 participants