Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: consider leader score when evict leader #8912

Closed
wants to merge 5 commits into from

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Dec 12, 2024

What problem does this PR solve?

Issue Number: Close #8895

What is changed and how does it work?

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2024
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2024
@@ -385,6 +378,22 @@ func scheduleEvictLeaderOnce(r *rand.Rand, name string, cluster sche.SchedulerCl
return ops
}

func createOperatorWithSort(name string, cluster sche.SchedulerCluster, candidates *filter.StoreCandidates, region *core.RegionInfo) (*operator.Operator, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming it to CreateTransferLeaderOperatorToLowestScoreStore and moving it to operator.go?

pkg/schedule/filter/comparer.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.

Project coverage is 76.30%. Comparing base (75cdec4) to head (8a695a5).
Report is 1099 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8912      +/-   ##
==========================================
+ Coverage   74.91%   76.30%   +1.38%     
==========================================
  Files         416      465      +49     
  Lines       42103    70554   +28451     
==========================================
+ Hits        31543    53835   +22292     
- Misses       7810    13368    +5558     
- Partials     2750     3351     +601     
Flag Coverage Δ
unittests 76.30% <92.30%> (+1.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: okJiang
Once this PR has been reviewed and has the lgtm label, please ask for approval from lhy1024, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-01-09 02:55:37.646585939 +0000 UTC m=+408680.935417644: ☑️ agreed by okJiang.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 9, 2025
leaderSchedulePolicy := conf.GetLeaderSchedulePolicy()
return func(a, b *core.StoreInfo) int {
// TODO: we should use the real time delta data to calculate the score.
sa := a.LeaderScore(leaderSchedulePolicy, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leader score is not very accurate, so it leads to the leader count of the lowest score goes up too much. How about considering the the running operators influence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: lhy1024 <[email protected]>
return NewBuilder(desc, ci, region, SkipOriginJointStateCheck).
SetLeader(targetStoreID).
SetLeaders(targetStoreIDs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need targetStoreIDs, which is used in evict_leader.go before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an optimization, why we don't need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetLeaders checks learner and unhealthy peer. https://github.com/tikv/pd/blob/master/pkg%2Fschedule%2Foperator%2Fbuilder.go#L301-L301

SetLeader also check these. https://github.com/tikv/pd/blob/master/pkg%2Fschedule%2Foperator%2Fbuilder.go#L284-L284

SetLeaders sort target stores according to store id. This PR sort target stores according to score.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetLeaderStoreIDs is used previously, but removed by this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use it in evict leader scheduler previously to select targets. It is replaced with this pr.

Copy link
Member

@rleungx rleungx Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allowed multiple targets in the op step before, this PR changes it which might be slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only used in evict leader, after this pr there is no other scheduler using it. So I remove it.
If we will use it in the future, I will recover it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was O(n) previously and it is O(nlogn) now.
Considering that the number of sorts is usually two (three replicas) or four (five replicas), the difference is not too big.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jan 9, 2025

As @rleungx said, tikv/tikv#11063 let tikv pick the fastest store to transfer leader to and may lead to leader unbalanced when some store is slow.

@lhy1024 lhy1024 closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evicting slow stores may trigger redundant scheduling
4 participants