Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance/health feedback to client and the Unified Health Controller #16297

Closed
8 tasks done
MyonKeminta opened this issue Jan 5, 2024 · 0 comments · Fixed by #17008
Closed
8 tasks done

Performance/health feedback to client and the Unified Health Controller #16297

MyonKeminta opened this issue Jan 5, 2024 · 0 comments · Fixed by #17008
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@MyonKeminta
Copy link
Contributor

MyonKeminta commented Jan 5, 2024

Development Task

Currently, there are several ways to evaluate whether a TiKV node is in healthy state. They are:

  • SlowScore which is a way to evaluate whether raftstore is in a abnormally slow state. It can be informed to PD which enables PD to trigger the evict-slow-store-scheduler to avoid write requests to be sent to the store.
  • SlowTrend, which is improved from SlowScore
  • gRPC HealthService, which is a way built-in in gRPC for indicating whether the service is available to serve.

And all these things are managed in PdWorker.

We found the evict-slow-store-scheduler mentioned before is quite helpful to solve the problem that some TiKV nodes encountering slow IO might significantly affect the whole cluster's performance. However, there can still be problem if follower read is being used in the cluster, since it doesn't need to be processed on leader.

To solve the problem, we want the client-go to know whether each TiKV node is abnormal and avoid sending follower read requests to the problematic ones. We are now considering making it able to send some of the health information to the client via kv responses, so that the client has a more efficient and up-to-time information about the TiKV nodes' status and adjust the policy to select replicas.

We are planning to add a component named Unified Health Controller, which will be the unified entrance for managing and accessing the health status of the TiKV node. The SlowScore, SlowTrend and gRPC HealthService mentioned above should be moved to it, and PdWorker will still be responsible for updating them. The component itself should be outside PdWorker, which enables us to access it elsewhere, or add more information to it that are not proper to be updated in PdWorker (e.g., readpool's stats).

TiKV

PD

  • PDClient support getting StoreStats field in GetStore. It's already included by the RPC protocol but not returned by the golang PDClient.

client-go

Other related works in client-go repo: tikv/client-go#1104

TiDB

Dependencies of TiDB repo needs to be updated several times, but there isn't any major development task in TiDB repo.
Ref: pingcap/tidb#51412

Next step

  • Including read path status (e.g. read pool busy)
  • TBD.
@MyonKeminta MyonKeminta added the type/enhancement The issue or PR belongs to an enhancement. label Jan 5, 2024
ti-chi-bot bot added a commit that referenced this issue Feb 2, 2024
…Service from PdWorker to it (#16456)

ref #16297

Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it

Signed-off-by: MyonKeminta <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Feb 20, 2024
…16498)

ref #16297

Support sending health feedback information to the client via BatchCommandResponse

Signed-off-by: MyonKeminta <[email protected]>
dbsid pushed a commit to dbsid/tikv that referenced this issue Mar 24, 2024
…Service from PdWorker to it (tikv#16456)

ref tikv#16297

Add module health_controller and move SlowScore, SlowTrend, HealthService from PdWorker to it

Signed-off-by: MyonKeminta <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: dbsid <[email protected]>
dbsid pushed a commit to dbsid/tikv that referenced this issue Mar 24, 2024
…ikv#16498)

ref tikv#16297

Support sending health feedback information to the client via BatchCommandResponse

Signed-off-by: MyonKeminta <[email protected]>
Signed-off-by: dbsid <[email protected]>
ti-chi-bot bot added a commit that referenced this issue May 16, 2024
…ling RPC (#17008)

close #16297

This PR makes TiKV support explicitly getting health feedback information by calling RPC. Both non-batched mode and batched mode (using BatchCommands stream) are supported.
There's some special behavior when used in batched RPC mode: The BatchCommandsResponse that contains the response of getting health feedback will always have feedback information attached (in the same way as how it's attached without being explicitly requested), and the attached information and the information carried in each single GetHealthFeedbackResponse-s is the same.

Signed-off-by: MyonKeminta <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant