Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support addtional filter for extended resource pod #961

Closed
wants to merge 2 commits into from

Conversation

fishingfly
Copy link

@fishingfly fishingfly commented Jul 29, 2024

This PR is related to issue #960

How to use

containers:
        - name: kured
          image: ghcr.io/kubereboot/kured:1.16.0
          command:
            - /usr/bin/kured
            - --reboot-sentinel=/sentinel/reboot-required
+          - --drain-resource-name-with-prefixes=nvidia.com/gpu,nvidia.com/mig

By using drain-resource-name-with-prefixes=nvidia.com/gpu,nvidia.com/mig flag, we can only evict pods with nvidia gpu resourece or mig device resource.

@fishingfly fishingfly changed the title Support addtionnal filter for extended resource pod Support addtional filter for extended resource pod Jul 30, 2024
@yeahdongcn
Copy link

This looks super useful for GPU clusters.

@evrardjp
Copy link
Collaborator

evrardjp commented Aug 7, 2024

So the current pod filter is completely not okay for you?

@fishingfly
Copy link
Author

So the current pod filter is completely not okay for you?

Our scenario is in a gpu cluster and we only want to evict all gpu pods before the machine is restarted. The current pod filter is based on Label. If we use the current pod filter, the modification we need to make is: add a fixed label to all gpu pods. We do not want to modify the user's pods.

@evrardjp
Copy link
Collaborator

evrardjp commented Aug 8, 2024

"We do not want to modify the user's pods." That's the key phrase. Got it now.

I need to ponder a bit on the code now. Can't say when I will be able to do this though, so I won't block any progress happening in the meantime.

I have no strong opinion about the fact it should be included or not. FYI, I am naturally leaning to not include new features for ease of maintenance in the long run (saying no is temporary, saying yes is definitive!) , especially if it's a feature can be circumvented by training users. But I understand your position and your case. If the code can be written in a way that's simple to maintain, that's all good.

@maratrixx
Copy link

Look forward to accepting it soon. thx ~

Copy link

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants