[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

kanwang · 2024-12-27T19:31:07Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

I was chatting with Michael and Pradeep and I think this is one of the things that will make early stage ray poc easier. I know ray removed redis as default option some time ago, but for us, when successfully running ray for the first time and see ray being restarted in the middle of training was a great experience. we should probably consider including redis + fault tolerance as part of the helm chart and easier for users to enable?

Use case

I think it will be a great experience to have fault tolerance enabled by default (or at least having one option to enable it), especially in k8s environment.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

kevin85421 · 2024-12-27T20:19:36Z

I think it will be a great experience to have fault tolerance enabled by default (or at least having one option to enable it), especially in k8s environment.

Did you use Ray Serve?

kanwang · 2024-12-30T17:18:04Z

yes we use ray serve (RayService), but we do have other use cases as well (using RayJob or RayCluster directly).

For this feature, I think it's important for RayService but also important for RayClusters. When we started POC ray, we started with RayCluster and directly running distributed training jobs. so when head restart, all work load get interrupted, and we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc). We were able to address them by now, but I feel it was a pretty bad POC experience (losing all job history when you don't know enough about Ray makes people doubting about if ray is the right choice).

kevin85421 · 2024-12-30T17:44:06Z

When we started POC ray, we started with RayCluster and directly running distributed training jobs. so when head restart, all work load get interrupted, and we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc).

Ray currently lacks a robust solution for GCS fault tolerance in workloads other than Ray Serve. It's not recommended to use GCS fault tolerance with other workloads for now (ref).

we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc).

@nikitavemuri is working on the persistent dashboard. @nikitavemuri, would you mind sharing some GitHub issues or design docs for @kanwang to take a look? Thanks!

kanwang · 2024-12-30T20:06:28Z

@nikitavemuri is working on the persistent dashboard

that's great to know!

It's not recommended to use GCS fault tolerance with other workloads for now (ref).

oh interesting. Mind sharing any further details on why is it the case? So for some production use case I think we can definitely use RayJob and cluster reliability is less of a concern. For some use cases, we still want to use a pre-configured cluster and share the cluster across multiple jobs. Is there any recommendation for that? E.g. something so that we can retry failed jobs in case it was caused by a head/cluster restart?

kevin85421 · 2025-01-02T22:13:08Z

For some use cases, we still want to use a pre-configured cluster and share the cluster across multiple jobs. Is there any recommendation for that?

Currently, Ray does not recommend multiple users sharing the same Ray cluster. For example, Ray currently does not natively support rolling upgrades or priority scheduling. Additionally, our tests for GCS FT mainly focus on integration with Ray Serve, so I am not sure whether it works well with other workloads.

kevin85421 · 2025-01-09T08:45:23Z

Close this one because currently GCS FT is only for Ray Serve. @kanwang feel free to DM me on Ray Slack if you have any further questions!

kanwang added enhancement New feature or request triage labels Dec 27, 2024

kevin85421 closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

kanwang commented Dec 27, 2024

kevin85421 commented Dec 27, 2024

kanwang commented Dec 30, 2024

kevin85421 commented Dec 30, 2024 •

edited

Loading

kanwang commented Dec 30, 2024

kevin85421 commented Jan 2, 2025

kevin85421 commented Jan 9, 2025

[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691

Comments

kanwang commented Dec 27, 2024

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

kevin85421 commented Dec 27, 2024

kanwang commented Dec 30, 2024

kevin85421 commented Dec 30, 2024 • edited Loading

kanwang commented Dec 30, 2024

kevin85421 commented Jan 2, 2025

kevin85421 commented Jan 9, 2025

kevin85421 commented Dec 30, 2024 •

edited

Loading