Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The TSO request may have a high latency after the leader changes #8835

Open
rleungx opened this issue Nov 20, 2024 · 2 comments
Open

The TSO request may have a high latency after the leader changes #8835

rleungx opened this issue Nov 20, 2024 · 2 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@rleungx
Copy link
Member

rleungx commented Nov 20, 2024

Enhancement Task

If the TSO request fails, it will try to update the members to get the new leader.

pd/client/tso_dispatcher.go

Lines 403 to 416 in 41ec8dc

err = td.processRequests(stream, tsoBatchController, done)
// If error happens during tso stream handling, reset stream and run the next trial.
if err == nil {
// A nil error returned by `processRequests` indicates that the request batch is started successfully.
// In this case, the `tsoBatchController` will be put back to the pool when the request is finished
// asynchronously (either successful or not). This infers that the current `tsoBatchController` object will
// be asynchronously accessed after the `processRequests` call. As a result, we need to use another
// `tsoBatchController` for collecting the next batch. Do to this, we set the `tsoBatchController` to nil so that
// another one will be fetched from the pool at the beginning of the batching loop.
// Otherwise, the `tsoBatchController` won't be processed in other goroutines concurrently, and it can be
// reused in the next loop safely.
tsoBatchController = nil
} else {
exit := !td.handleProcessRequestError(ctx, bo, streamURL, cancel, err)

svcDiscovery.ScheduleCheckMemberChanged()

And there is a backoff, which the minimum time is 100ms

func (c *pdServiceDiscovery) updateMemberLoop() {
defer c.wg.Done()
ctx, cancel := context.WithCancel(c.ctx)
defer cancel()
ticker := time.NewTicker(memberUpdateInterval)
defer ticker.Stop()
bo := retry.InitialBackoffer(updateMemberBackOffBaseTime, updateMemberTimeout, updateMemberBackOffBaseTime)
for {
select {
case <-ctx.Done():
log.Info("[pd] exit member loop due to context canceled")
return
case <-ticker.C:
case <-c.checkMembershipCh:
}
failpoint.Inject("skipUpdateMember", func() {
failpoint.Continue()
})
if err := bo.Exec(ctx, c.updateMember); err != nil {
log.Error("[pd] failed to update member", zap.Strings("urls", c.GetServiceURLs()), errs.ZapError(err))
}
}
}

At the same time, the request can still be put into the channel and wait for handling:

c.getDispatcher().push(request)

And the request might be affected by the backoff because we need to wait for the stream to be re-established.

@rleungx rleungx added the type/enhancement The issue or PR belongs to an enhancement. label Nov 20, 2024
@AndreMouche
Copy link
Member

And the request might be affected by the backoff because we need to wait for the stream to be re-established.

I think this should meet expectations, if not waiting for re-connection, should it return an error directly?

@rleungx
Copy link
Member Author

rleungx commented Dec 20, 2024

Dispatching TSO requests and stream reconnection are asynchronized. If we return an error, it also needs to wait until the PD leader is switched and the stream is reconnected. We previously added a backoff to prevent PD from being overwhelmed by GetMember requests.

I think one way to mitigate the issue is to reduce the backoff interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants