Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Doris-xm
Copy link

@Doris-xm Doris-xm commented Jan 9, 2025

What this PR does / why we need it:

This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.

Which issue(s) this PR fixes :
Fixes #2381

Checklist:

  • Docs included if any changes are user facing

@Electronic-Waste
Copy link
Member

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

@google-oss-prow google-oss-prow bot requested a review from a team January 9, 2025 15:25
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this @Doris-xm! I left some initial comments for you.

Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)

FYI: KubeCon 2024 NA Talk by Andrey and Yuki

resources:
- trainjobs/status
verbs:
- get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- get
- get

Please add a blank line in the end of file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your careful review!

- name: training-operator-v2-webhook-cert
namespace: kubeflow-system
options:
disableNameSuffixHash: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disableNameSuffixHash: true
disableNameSuffixHash: true

Same as above

kind: Kustomization
namespace: kubeflow
resources:
- ../../base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does not exist a kustomization.yaml file under ../../base dir. So we may need to reference those resources in subdir explicitly.

Like:

- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/webhook

resources:
- ../../base
- kubeflow-training-roles.yaml
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to make this item consistent with:

# TODO (andreyvelich): JobSet should support kubeflow-system namespace.
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering for Kubeflow Platform installation, should we actually deploy JobSet controller in the kubeflow namespace ?
@kannon92 also made it possible: kubernetes-sigs/jobset#719
cc @kubeflow/wg-manifests-leads

metadata:
name: kubeflow-training-admin-v2
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label will conflict with:

rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"

Please rename it to rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin-v2 to avoid conflicts.

aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines 19 to 20
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

metadata:
name: kubeflow-training-view-v2
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines 59 to 66
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any event resources in Training V2. Maybe we should remove them

// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch

// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=kubeflow.org,resources=pytorchjobs/finalizers,verbs=update
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;delete
// +kubebuilder:rbac:groups=autoscaling,resources=horizontalpodautoscalers,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=scheduling.volcano.sh,resources=podgroups,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=events,verbs=get;list;watch;create;update;patch;delete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL👀 @kubeflow/wg-training-leads. Do you think we should reserve events resource in the manifests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! It seems event resources are required for EventRecorder used in TrainJobReconciler to log events related to training jobs.

type TrainJobReconciler struct {
log logr.Logger
client client.Client
recorder record.EventRecorder
runtimes map[string]jobruntimes.Runtime
}

However, the event resource permissions are not included in the kubebuilder annotations in the trainjob_controller.go

// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch

I'm not sure I understand it correctly. Appreciate your clarification.

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM! However, in controller v2, we haven't called r.recorder.Eventf() yet. We just initialize it in:

type TrainJobReconciler struct {
log logr.Logger
client client.Client
recorder record.EventRecorder
runtimes map[string]jobruntimes.Runtime
}
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runtimes map[string]jobruntimes.Runtime) *TrainJobReconciler {
return &TrainJobReconciler{
log: ctrl.Log.WithName("trainjob-controller"),
client: client,
recorder: recorder,
runtimes: runtimes,
}
}

Maybe we could add events resource's permission to the Kubeflow Manifests when we call the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, events will be required in the SDK, like for V1: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L1309
We haven't implemented it yet in V2, but we will do it in the future.

Comment on lines 37 to 38
- clustertrainingruntimes
- trainingruntimes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also edit trainingruntimes and clustertrainingruntimes in the Kubeflow platform. I think, their perssmission granularity should be the same with trainjobs:)

@Electronic-Waste
Copy link
Member

/rerun-all

@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Jan 10, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Thank you for your great contributions!

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team

@@ -0,0 +1,20 @@
apiVersion: kustomize.config.k8s.io/v1beta1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this overlay: kubeflow-platform to make it consistent with the naming: https://www.kubeflow.org/docs/started/introduction/#what-is-kubeflow-platform.

resources:
- ../../base
- kubeflow-training-roles.yaml
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering for Kubeflow Platform installation, should we actually deploy JobSet controller in the kubeflow namespace ?
@kannon92 also made it possible: kubernetes-sigs/jobset#719
cc @kubeflow/wg-manifests-leads

Comment on lines 59 to 66
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, events will be required in the SDK, like for V1: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L1309
We haven't implemented it yet in V2, but we will do it in the future.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM!

@@ -0,0 +1,20 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask a question? Since we have defined the namespace for manager and webhook:

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

Does kubeflow namespace take any effect here? WDYT👀 @kubeflow/wg-training-leads @Doris-xm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviewing. It seems that namespace: kubeflow in the overlay won't override namespace: kubeflow-system defined for manager and webhook. I'm wondering if we should unify the namespace as kubeflow when installing them within Kubeflow Platform, aligning with other modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should make our namespace configurable and use kubeflow namespace for Kubeflow Platform installation for now.
By default the Kubeflow Training V2 will use the kubeflow-system namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm FYI, I think what @andreyvelich means is that we should address these TODO:

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

And specify their namespace in overlays and kubeflow-platform dir. Because we need to make namesapce configurable and install these components in different namspace:

  1. Standalone training-operator: Install all components in kubeflow-system namespace
  2. Kubeflow Platform: Install all components in kubeflow namespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed explanation! I've moved the namespace configuration from base to overlays. Appreciate your time!

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great contribution and your patience!

I think, we should also add namespace configuration here:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:

to install these components in kubeflow-system namespace.

Xinmin Du added 5 commits January 13, 2025 15:16
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
@Doris-xm Doris-xm force-pushed the add-overlay-manifest-v2 branch from 9518f7b to 1f1b0c2 Compare January 13, 2025 07:17
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for updating this:)

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402

@google-oss-prow google-oss-prow bot added the lgtm label Jan 13, 2025
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Electronic-Waste
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines -3 to -4
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this is going to work since we don't copy JobSet manifests in the Kubeflow Training ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the JobSet manifest is referred from a remote URL, so the namespace configuration in the base or overlays cannot override the JobSet manifests, I think.

From the changlog, it seems that the latest release hasn’t included the changes in kubernetes-sigs/jobset/#719. We may need to manually modify the namespace setting for JobSet. I'm not sure if I need to deal with it in this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I tried it with:

$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml -n kubeflow-test

namespace/jobset-system serverside-applied
customresourcedefinition.apiextensions.k8s.io/jobsets.jobset.x-k8s.io serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-manager-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-manager-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-proxy-rolebinding serverside-applied
mutatingwebhookconfiguration.admissionregistration.k8s.io/jobset-mutating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration serverside-applied
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.

It does not work. Maybe we should modify the jobset manifests to make namespace configurable.

And just a question, do we really need to install all components in the same namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently @kannon92 made a PR to support other namespaces in JobSet: kubernetes-sigs/jobset#719.
Probably, we should use kubeflow namespace to install the JobSet operator within Kubeflow Platform if @kubeflow/wg-training-leads or @kubeflow/wg-manifests-leads don't have any other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this PR seems not to be included in the current release of jobset:

kubernetes-sigs/jobset@release-6.0...v0.7.2

In the manifest, the namespace is hardcoded into jobset-system:

https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. For me, I just don't know how to overlay namespace items outside of the .metadata field if we only specify namespace field in kustomization.yaml. Here is the example of which namespace field lies outside of the .metadata field:

截屏2025-01-14 10 28 31

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the var references, similar to this example. We will tell Kustomize how to override the namespace value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. That's what I want to do with the JobSet manifests!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm So now we can wait for the next release of JobSet. And make some appropriate changes to the manifest after that:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the insightful discussion! I’ve learned a lot from this. Much appreciated!

@andreyvelich andreyvelich changed the title Add the manifests overlay for Kubeflow Training V2 KEP-2170: Add the manifests overlay for Kubeflow Training V2 Jan 14, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 12742381626

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12718184264: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add Installation for Kubeflow Platform
5 participants