-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] kuberay's helm chart should include redis + fault tolerance by default. #2691
Comments
Did you use Ray Serve? |
yes we use ray serve (RayService), but we do have other use cases as well (using RayJob or RayCluster directly). For this feature, I think it's important for RayService but also important for RayClusters. When we started POC ray, we started with RayCluster and directly running distributed training jobs. so when head restart, all work load get interrupted, and we lost historical information about past jobs (job arguments for parameter sweep, job logs, job metrics, etc). We were able to address them by now, but I feel it was a pretty bad POC experience (losing all job history when you don't know enough about Ray makes people doubting about if ray is the right choice). |
Ray currently lacks a robust solution for GCS fault tolerance in workloads other than Ray Serve. It's not recommended to use GCS fault tolerance with other workloads for now (ref).
@nikitavemuri is working on the persistent dashboard. @nikitavemuri, would you mind sharing some GitHub issues or design docs for @kanwang to take a look? Thanks! |
that's great to know!
oh interesting. Mind sharing any further details on why is it the case? So for some production use case I think we can definitely use RayJob and cluster reliability is less of a concern. For some use cases, we still want to use a pre-configured cluster and share the cluster across multiple jobs. Is there any recommendation for that? E.g. something so that we can retry failed jobs in case it was caused by a head/cluster restart? |
Currently, Ray does not recommend multiple users sharing the same Ray cluster. For example, Ray currently does not natively support rolling upgrades or priority scheduling. Additionally, our tests for GCS FT mainly focus on integration with Ray Serve, so I am not sure whether it works well with other workloads. |
Close this one because currently GCS FT is only for Ray Serve. @kanwang feel free to DM me on Ray Slack if you have any further questions! |
Search before asking
Description
I was chatting with Michael and Pradeep and I think this is one of the things that will make early stage ray poc easier. I know ray removed redis as default option some time ago, but for us, when successfully running ray for the first time and see ray being restarted in the middle of training was a great experience. we should probably consider including redis + fault tolerance as part of the helm chart and easier for users to enable?
Use case
I think it will be a great experience to have fault tolerance enabled by default (or at least having one option to enable it), especially in k8s environment.
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: