Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Snapshot users' workspace into distributed TrainJob workload #2347

Open
andreyvelich opened this issue Dec 10, 2024 · 3 comments
Open

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Dec 10, 2024

What you would like to be added?

As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): #2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.

During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.

However, we can consider various approaches, for instance:

  • Using distributed cache.
  • Using kubectl cp.

Why is this needed?

This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@andreyvelich
Copy link
Member Author

cc @shravan-achar @akshaychitneni

@shravan-achar
Copy link

Does this require a KEP?

@andreyvelich
Copy link
Member Author

Yeah, we need to create a KEP, since we might require API changes for this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants