Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make slow store filtering the highest priority in replica selector v2 #1267

Merged

Conversation

MyonKeminta
Copy link
Contributor

Make slow store filtering the highest priority in replica selector v2.

With some minor changes:

  • Add logs about health state change
  • Disable health status tick of region cache in replica_selector_test
  • Fix slow score resetting in test code
  • Special type storeSelectionScore and String method which might help in logging
  • A test about multiple replica in the same AZ, testing priority between label matching and retrying leader
  • Minor renaming, typo fixing and comment fixing

There are some behavior inconsistencies found between v1 and v2 and maybe we need some discussion. I want to finish this PR within April 8 if possible so that this can be included in release-8.1 branch.

The inconsistencies are all about replica read mode without stale read and labels are specified, in which cases this branch in calculateScore is executed:

https://github.com/MyonKeminta/client-go/blob/05eec3f6f7a5e49b7aff92862716253afa882da5/internal/locate/replica_selector.go#L367-L370

And what it causes are:

  1. In replica read mode without stale read, 1 store out of 3 has matching labels, and the first request to it failed. v2 retries on leader, but v1 may retry on any replica.
  2. In replica read mode without stale read, 2 stores (1 of which is the leader) out of 6 have matching labels, then v2 will always send the first request to the leader, while v1 won't.
  3. In replica read mode without stale read, 2 stores (both follower) out of 6 have matching labels, and requests to them both failed. Then the next attempt will always be sent to the leader in v2, while v1 won't.

@MyonKeminta MyonKeminta requested review from zyguan and cfzjywxk and removed request for zyguan April 7, 2024 10:36
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Apr 7, 2024

/cc @crazycs520 PTAL

s.True(s.runCaseAndCompare(ca))

s.T().Logf("test case: stale read: %v, with label: %v, slow: true", staleRead, withLabel)
expectedFirstStore := 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the expectedFirstStore also be 3 if labels are not set and store 1 is slow with kv.ReplicaReadMixed?

Copy link
Contributor

@zyguan zyguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR LGTM.

I'm wondering whether one single strategy can meet all user cases: what if some users are latency insensitive but very care about cross AZ traffic, or RTT is known to be unacceptable when cross AZ. Maybe we need to rethink the UI of replica selector (tidb_replica_read) later, how to make it reasonable and user-friendly.

@MyonKeminta
Copy link
Contributor Author

MyonKeminta commented Apr 8, 2024

@crazycs520 @cfzjywxk @zyguan I'm considering removing TestMultiReplicaInOneAZ from this PR (which is irrelevant to the title of this PR) so that this PR can be merged within today hopefully. The problem revealed in that test case will be another topic.

FYI the failure message of that test case (code here)

    replica_selector_test.go:3000:
                Error Trace:    /home/jenkins/client-go/internal/locate/replica_selector_test.go:3000
                                                        /home/jenkins/client-go/internal/locate/replica_selector_test.go:2938
                                                        /home/jenkins/client-go/internal/locate/replica_selector_test.go:1884
                Error:          Not equal:
                                expected: []string{"{addr: store4, replica-read: true, stale-read: false}"}
                                actual  : []string{"{addr: store1, replica-read: true, stale-read: false}"}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,3 @@
                                 ([]string) (len=1) {
                                - (string) (len=53) "{addr: store4, replica-read: true, stale-read: false}"
                                + (string) (len=53) "{addr: store1, replica-read: true, stale-read: false}"
                                 }
                Test:           TestMultiReplicaInOneAZ
                Messages:       enable_forwarding: false
                                version: v1
                                {
                                        req: Get
                                        read_type: mixed
                                        stale_read: false
                                        timeout: 0s
                                        busy_threshold_ms: 0
                                        label: id->1
                                        access_err:
                                        access_path: {addr: store1, replica-read: true, stale-read: false}
                                        resp_err:
                                        resp_region_err:
                                        backoff_cnt: 0
                                        backoff_detail:
                                        region_is_valid: true
                                }
                                sender: {rpcError:<nil>, replicaSelector: replicaSelector{state: accessFollower, cacheRegionIsValid: false, replicaStatus: [peer: 4, store: 1, isEpochStale: false, attempts: 1, rep
lica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 5, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-s
tate: reachable peer: 6, store: 3, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 8, store: 4, isEpochStale: false, attempts: 0, r
eplica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 9, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness
-state: reachable peer: 10, store: 6, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}
    replica_selector_test.go:3000:
                Error Trace:    /home/jenkins/client-go/internal/locate/replica_selector_test.go:3000
                                                        /home/jenkins/client-go/internal/locate/replica_selector_test.go:2938
                                                        /home/jenkins/client-go/internal/locate/replica_selector_test.go:1884
                Error:          Not equal:
                                expected: []string{"{addr: store4, replica-read: false, stale-read: true}"}
                                actual  : []string{"{addr: store1, replica-read: false, stale-read: true}"}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,3 @@
                                 ([]string) (len=1) {
                                - (string) (len=53) "{addr: store4, replica-read: false, stale-read: true}"
                                + (string) (len=53) "{addr: store1, replica-read: false, stale-read: true}"
                                 }
                Test:           TestMultiReplicaInOneAZ
                Messages:       enable_forwarding: false
                                version: v1
                                {
                                        req: Get
                                        read_type: mixed
                                        stale_read: true
                                        timeout: 0s
                                        busy_threshold_ms: 0
                                        label: id->1
                                        access_err:
                                        access_path: {addr: store1, replica-read: false, stale-read: true}
                                        resp_err:
                                        resp_region_err:
                                        backoff_cnt: 0
                                        backoff_detail:
                                        region_is_valid: true
                                }
                                sender: {rpcError:<nil>, replicaSelector: replicaSelector{state: accessFollower, cacheRegionIsValid: false, replicaStatus: [peer: 4, store: 1, isEpochStale: false, attempts: 1, rep
lica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 5, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-s
tate: reachable peer: 6, store: 3, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 8, store: 4, isEpochStale: false, attempts: 0, r
eplica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 9, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness
-state: reachable peer: 10, store: 6, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}

@MyonKeminta MyonKeminta marked this pull request as ready for review April 8, 2024 04:14
@cfzjywxk cfzjywxk merged commit 642a09b into tikv:master Apr 8, 2024
10 checks passed
@MyonKeminta MyonKeminta deleted the m/promote-priority-of-slow-score-filtering branch April 8, 2024 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants