Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject series/write requests when max pending request limit is hit #114

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

hczhu-db
Copy link
Collaborator

@hczhu-db hczhu-db commented Dec 15, 2024

This is to prevent Receive server from begin overloaded

Tested in dev-aws-eu-west-1

[dev-aws-eu-west-1] [pantheon] [pantheon-db-rep0-0] > logs | rg pending
ts=2024-12-17T17:38:14.728273143Z caller=receive.go:272 level=info name=pantheon-db component=receive msg="set max pending gRPC write request in limiter" max_pending_requests=1000
image

@hczhu-db hczhu-db force-pushed the load-shedding branch 8 times, most recently from 174dd33 to d52aa34 Compare December 15, 2024 23:29
level.Info(logger).Log("msg", "set max pending gRPC write request in limiter", "max_pending_requests", conf.maxPendingGrpcWriteRequests)
}
limiter, err := receive.NewLimiterWithOptions(
conf.writeLimitsConfig,
Copy link
Collaborator

@jnyi jnyi Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider reuse the writeLimitsConfig so less interface changes?

Copy link
Collaborator Author

@hczhu-db hczhu-db Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it'd be ideal to have this config field in writeLimitsConfig, but DB pods don't load writeLimitsConfig at all. Pantheon-writer pods load that config. I'll have to keep it this way.
The interface is not changed. receive.NewLimiter() stays the same. I added another function receive.NewLimiterWithOptions().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, that's fair, do you wanna add some unit tests for limiter to test the load shedding behavior?

Copy link
Collaborator

@jnyi jnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, great job, i think we should also track remote write pending writes using limiter

// Value 0 disables the feature.
maxPendingRequests int32
pendingRequests atomic.Int32
maxPendingRequestLimitHit prometheus.Counter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Shall we consider adding an alert around this metrics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely once the counter is there.

@@ -1083,12 +1073,15 @@ func quorumReached(successes []int, successThreshold int) bool {

// RemoteWrite implements the gRPC remote write handler for storepb.WriteableStore.
func (h *Handler) RemoteWrite(ctx context.Context, r *storepb.WriteRequest) (*storepb.WriteResponse, error) {
if h.Limiter.ShouldRejectNewRequest() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, let's add a unit test for this behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that in a follow-up PR while testing it in Dev. It's quite tricky to write a unit test for such a feature. I want to see how useful it's in Dev before spending time on it.

Copy link
Collaborator

@jnyi jnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you for doing this!

@hczhu-db hczhu-db merged commit 2fecf4d into db_main Dec 17, 2024
14 checks passed
@hczhu-db hczhu-db deleted the load-shedding branch December 17, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants