-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382
base: master
Are you sure you want to change the base?
Conversation
Will review it later. /cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402 |
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this @Doris-xm! I left some initial comments for you.
Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)
FYI: KubeCon 2024 NA Talk by Andrey and Yuki
resources: | ||
- trainjobs/status | ||
verbs: | ||
- get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- get | |
- get | |
Please add a blank line in the end of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your careful review!
- name: training-operator-v2-webhook-cert | ||
namespace: kubeflow-system | ||
options: | ||
disableNameSuffixHash: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disableNameSuffixHash: true | |
disableNameSuffixHash: true | |
Same as above
kind: Kustomization | ||
namespace: kubeflow | ||
resources: | ||
- ../../base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There does not exist a kustomization.yaml
file under ../../base
dir. So we may need to reference those resources in subdir explicitly.
Like:
- ../../base/crds | |
- ../../base/manager | |
- ../../base/rbac | |
- ../../base/webhook |
resources: | ||
- ../../base | ||
- kubeflow-training-roles.yaml | ||
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments to make this item consistent with:
training-operator/manifests/v2/overlays/standalone/kustomization.yaml
Lines 10 to 11 in be2e29e
# TODO (andreyvelich): JobSet should support kubeflow-system namespace. | |
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering for Kubeflow Platform installation, should we actually deploy JobSet controller in the kubeflow
namespace ?
@kannon92 also made it possible: kubernetes-sigs/jobset#719
cc @kubeflow/wg-manifests-leads
metadata: | ||
name: kubeflow-training-admin-v2 | ||
labels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This label will conflict with:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true" |
Please rename it to rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin-v2
to avoid conflicts.
aggregationRule: | ||
clusterRoleSelectors: | ||
- matchLabels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true" | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
metadata: | ||
name: kubeflow-training-view-v2 | ||
labels: | ||
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
- apiGroups: | ||
- "" | ||
resources: | ||
- events | ||
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any event resources in Training V2. Maybe we should remove them
training-operator/pkg/controller.v2/trainjob_controller.go
Lines 69 to 72 in be2e29e
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/status,verbs=get;update;patch | |
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch | |
training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go
Lines 106 to 114 in be2e29e
// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs/status,verbs=get;update;patch | |
// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs/finalizers,verbs=update | |
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;delete | |
// +kubebuilder:rbac:groups=autoscaling,resources=horizontalpodautoscalers,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups=scheduling.volcano.sh,resources=podgroups,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;delete | |
// +kubebuilder:rbac:groups="",resources=events,verbs=get;list;watch;create;update;patch;delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL👀 @kubeflow/wg-training-leads. Do you think we should reserve events
resource in the manifests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! It seems event
resources are required for EventRecorder
used in TrainJobReconciler
to log events related to training jobs.
training-operator/pkg/controller.v2/trainjob_controller.go
Lines 53 to 58 in be2e29e
type TrainJobReconciler struct { log logr.Logger client client.Client recorder record.EventRecorder runtimes map[string]jobruntimes.Runtime }
However, the event
resource permissions are not included in the kubebuilder
annotations in the trainjob_controller.go
training-operator/pkg/controller.v2/trainjob_controller.go
Lines 69 to 71 in be2e29e
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs,verbs=get;list;watch;create;update;patch;delete // +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/status,verbs=get;update;patch // +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch
I'm not sure I understand it correctly. Appreciate your clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM! However, in controller v2, we haven't called r.recorder.Eventf()
yet. We just initialize it in:
training-operator/pkg/controller.v2/trainjob_controller.go
Lines 53 to 67 in be2e29e
type TrainJobReconciler struct { | |
log logr.Logger | |
client client.Client | |
recorder record.EventRecorder | |
runtimes map[string]jobruntimes.Runtime | |
} | |
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runtimes map[string]jobruntimes.Runtime) *TrainJobReconciler { | |
return &TrainJobReconciler{ | |
log: ctrl.Log.WithName("trainjob-controller"), | |
client: client, | |
recorder: recorder, | |
runtimes: runtimes, | |
} | |
} |
Maybe we could add events
resource's permission to the Kubeflow Manifests when we call the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, events will be required in the SDK, like for V1: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L1309
We haven't implemented it yet in V2, but we will do it in the future.
- clustertrainingruntimes | ||
- trainingruntimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also edit trainingruntimes
and clustertrainingruntimes
in the Kubeflow platform. I think, their perssmission granularity should be the same with trainjobs
:)
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. Thank you for your great contributions!
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team
@@ -0,0 +1,20 @@ | |||
apiVersion: kustomize.config.k8s.io/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this overlay: kubeflow-platform
to make it consistent with the naming: https://www.kubeflow.org/docs/started/introduction/#what-is-kubeflow-platform.
resources: | ||
- ../../base | ||
- kubeflow-training-roles.yaml | ||
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering for Kubeflow Platform installation, should we actually deploy JobSet controller in the kubeflow
namespace ?
@kannon92 also made it possible: kubernetes-sigs/jobset#719
cc @kubeflow/wg-manifests-leads
- apiGroups: | ||
- "" | ||
resources: | ||
- events | ||
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, events will be required in the SDK, like for V1: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L1309
We haven't implemented it yet in V2, but we will do it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM!
@@ -0,0 +1,20 @@ | |||
apiVersion: kustomize.config.k8s.io/v1beta1 | |||
kind: Kustomization | |||
namespace: kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask a question? Since we have defined the namespace for manager and webhook:
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests. | |
namespace: kubeflow-system |
Does kubeflow
namespace take any effect here? WDYT👀 @kubeflow/wg-training-leads @Doris-xm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviewing. It seems that namespace: kubeflow
in the overlay won't override namespace: kubeflow-system
defined for manager and webhook. I'm wondering if we should unify the namespace as kubeflow
when installing them within Kubeflow Platform, aligning with other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should make our namespace configurable and use kubeflow
namespace for Kubeflow Platform installation for now.
By default the Kubeflow Training V2 will use the kubeflow-system
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Doris-xm FYI, I think what @andreyvelich means is that we should address these TODO
:
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests. | |
namespace: kubeflow-system |
training-operator/manifests/v2/base/webhook/kustomization.yaml
Lines 13 to 14 in 1dfa40c
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests. | |
namespace: kubeflow-system |
And specify their namespace in overlays
and kubeflow-platform
dir. Because we need to make namesapce configurable and install these components in different namspace:
- Standalone
training-operator
: Install all components inkubeflow-system
namespace - Kubeflow Platform: Install all components in
kubeflow
namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your detailed explanation! I've moved the namespace configuration from base
to overlays
. Appreciate your time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your great contribution and your patience!
I think, we should also add namespace configuration here:
apiVersion: kustomize.config.k8s.io/v1beta1 | |
kind: Kustomization | |
resources: |
apiVersion: kustomize.config.k8s.io/v1beta1 | |
kind: Kustomization | |
resources: |
to install these components in kubeflow-system
namespace.
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
9518f7b
to
1f1b0c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for updating this:)
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Electronic-Waste The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests. | ||
namespace: kubeflow-system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this is going to work since we don't copy JobSet manifests in the Kubeflow Training ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the JobSet manifest is referred from a remote URL, so the namespace
configuration in the base or overlays cannot override the JobSet manifests, I think.
From the changlog, it seems that the latest release hasn’t included the changes in kubernetes-sigs/jobset/#719. We may need to manually modify the namespace setting for JobSet. I'm not sure if I need to deal with it in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I tried it with:
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml -n kubeflow-test
namespace/jobset-system serverside-applied
customresourcedefinition.apiextensions.k8s.io/jobsets.jobset.x-k8s.io serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-manager-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-manager-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-proxy-rolebinding serverside-applied
mutatingwebhookconfiguration.admissionregistration.k8s.io/jobset-mutating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration serverside-applied
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
It does not work. Maybe we should modify the jobset manifests to make namespace configurable.
And just a question, do we really need to install all components in the same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently @kannon92 made a PR to support other namespaces in JobSet: kubernetes-sigs/jobset#719.
Probably, we should use kubeflow
namespace to install the JobSet operator within Kubeflow Platform if @kubeflow/wg-training-leads or @kubeflow/wg-manifests-leads don't have any other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this PR seems not to be included in the current release of jobset:
kubernetes-sigs/jobset@release-6.0...v0.7.2
In the manifest, the namespace is hardcoded into jobset-system:
https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. For me, I just don't know how to overlay namespace items outside of the .metadata
field if we only specify namespace field in kustomization.yaml. Here is the example of which namespace field lies outside of the .metadata
field:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the var references, similar to this example. We will tell Kustomize how to override the namespace value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. That's what I want to do with the JobSet manifests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Doris-xm So now we can wait for the next release of JobSet. And make some appropriate changes to the manifest after that:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the insightful discussion! I’ve learned a lot from this. Much appreciated!
Pull Request Test Coverage Report for Build 12742381626Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
What this PR does / why we need it:
This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.
Which issue(s) this PR fixes :
Fixes #2381
Checklist: