Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

Closed
1 of 2 tasks
kanwang opened this issue Dec 27, 2024 · 6 comments
Closed
1 of 2 tasks
Labels
enhancement New feature or request triage

Comments

@kanwang
Copy link

kanwang commented Dec 27, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

I was chatting with Michael and Pradeep and I think this is one of the things that will make early stage ray poc easier. I know ray removed redis as default option some time ago, but for us, when successfully running ray for the first time and see ray being restarted in the middle of training was a great experience. we should probably consider including redis + fault tolerance as part of the helm chart and easier for users to enable?

Use case

I think it will be a great experience to have fault tolerance enabled by default (or at least having one option to enable it), especially in k8s environment.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kanwang kanwang added enhancement New feature or request triage labels Dec 27, 2024
@kevin85421
Copy link
Member

I think it will be a great experience to have fault tolerance enabled by default (or at least having one option to enable it), especially in k8s environment.

Did you use Ray Serve?

@kanwang
Copy link
Author

kanwang commented Dec 30, 2024

yes we use ray serve (RayService), but we do have other use cases as well (using RayJob or RayCluster directly).

For this feature, I think it's important for RayService but also important for RayClusters. When we started POC ray, we started with RayCluster and directly running distributed training jobs. so when head restart, all work load get interrupted, and we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc). We were able to address them by now, but I feel it was a pretty bad POC experience (losing all job history when you don't know enough about Ray makes people doubting about if ray is the right choice).

@kevin85421
Copy link
Member

kevin85421 commented Dec 30, 2024

When we started POC ray, we started with RayCluster and directly running distributed training jobs. so when head restart, all work load get interrupted, and we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc).

Ray currently lacks a robust solution for GCS fault tolerance in workloads other than Ray Serve. It's not recommended to use GCS fault tolerance with other workloads for now (ref).

image

we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc).

@nikitavemuri is working on the persistent dashboard. @nikitavemuri, would you mind sharing some GitHub issues or design docs for @kanwang to take a look? Thanks!

@kanwang
Copy link
Author

kanwang commented Dec 30, 2024

@nikitavemuri is working on the persistent dashboard

that's great to know!

It's not recommended to use GCS fault tolerance with other workloads for now (ref).

oh interesting. Mind sharing any further details on why is it the case? So for some production use case I think we can definitely use RayJob and cluster reliability is less of a concern. For some use cases, we still want to use a pre-configured cluster and share the cluster across multiple jobs. Is there any recommendation for that? E.g. something so that we can retry failed jobs in case it was caused by a head/cluster restart?

@kevin85421
Copy link
Member

For some use cases, we still want to use a pre-configured cluster and share the cluster across multiple jobs. Is there any recommendation for that?

Currently, Ray does not recommend multiple users sharing the same Ray cluster. For example, Ray currently does not natively support rolling upgrades or priority scheduling. Additionally, our tests for GCS FT mainly focus on integration with Ray Serve, so I am not sure whether it works well with other workloads.

@kevin85421
Copy link
Member

Close this one because currently GCS FT is only for Ray Serve. @kanwang feel free to DM me on Ray Slack if you have any further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

2 participants