-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queueing pods belong to the same podGroup as an unit #661
Conversation
Signed-off-by: kerthcet <[email protected]>
@kerthcet: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kerthcet The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind documentation |
An initial thought, the first commit is just a format, see the 2nd one as the main change 37ae422. |
Signed-off-by: kerthcet <[email protected]>
02f6ef8
to
37ae422
Compare
@Huang-Wei can you help to take a look? |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
|
||
// ScheduleTimeoutSeconds defines the maximal time of members/tasks to wait before run the pod group; | ||
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` | ||
// MinMember defines the minimal number of members/tasks to run the pod group; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you either change the indent back to tab, or 4 spaces?
For Pods belong to a same podGroup should be popped out of the scheduling queue one by one, | ||
which requires an efficient and wise algorithm. This is quite different with plain Pods, | ||
so we need to reimplement the QueueSort plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Pods within the same PodGroup, they should be dequeued one by one from the scheduling queue, necessitating an efficient and strategic algorithm. This requirement differs significantly from plain Pods, thus prompting the need for a reimplementation of the QueueSort plugin.
Basically, we need to watch for all the podGroups and maintain a cache for them | ||
in coscheduling's core package: | ||
|
||
- when podGroup created, we'll create a `queuedPodGroup` in the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a podGroup gets created, create a queuedPodGroup
in the cache
// ... | ||
// <new> | ||
// key is the podGroup name, value is the queued podGroup info. | ||
podGroups map[string]*queuedPodGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ... | |
// <new> | |
// key is the podGroup name, value is the queued podGroup info. | |
podGroups map[string]*queuedPodGroup | |
// ... | |
// <new> | |
// key is the podGroup name, value is the queued podGroup info. | |
podGroups map[string]*queuedPodGroup |
(applies elsewhere in the code snippets)
type PodGroupManager struct { | ||
// ... | ||
// <new> | ||
// key is the podGroup name, value is the queued podGroup info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key should be <namespace name>/<podGroup name>
type queuedPodGroup struct { | ||
// Timestamp is the podGroup's queued time. | ||
// - timestamp will be initialized when the first pod enqueues | ||
// - timestamp will be renewed when a pod re-enqueues for failing the scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you reword a bit?
do you mean "... renewed once a pod that belongs to a previously failed podGroup gets re-enqueued"?
the former podGroup was rebuilt, instead of waiting for the latter podGroup scheduling first, we hope the pod recover in prior, | ||
although this may lead to the latter podGroup failed in scheduling. | ||
|
||
Another risk is we'll initialize or renew the timestamp in the `Less()` function, which will lead to the performance degradation in queueing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a big no. We should not mutate timestamp in a function that relies on timestamp to do sorting. It'd cause unstable sorting result and make the sorting behavior unpredictable and hard to debug.
to the same podGroup will be placed together as a unit. | ||
3. Schedule the pods of the podGroup one by one, if anyone fails in the scheduling cycle, we'll set the status=queueingFailed in PostFilter, | ||
and the podGroup will enter into the backoff queue if configured. | ||
4. When the failed Pod re-enqueues, we'll renew the queuedPodGroup timestamp based on the `queueingFailed` status and also set the status to `queueing`, the new timestamp should be `now+backoffTime` to avoid breaking the podGroup who starts scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to renew during QueueSort()? If so, we shouldn't do that. See a comment below.
I'd suggest is to renew the timestamp at the end when a group of pods failed as a unit; instead of the beginning when the group starts its next scheduling attempt.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind document
What this PR does / why we need it:
Which issue(s) this PR fixes:
xref: #658
Special notes for your reviewer:
Does this PR introduce a user-facing change?