Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Decoupling dashboard and dashboard lifetime from Ray Cluster #46444

Open
Superskyyy opened this issue Jul 5, 2024 · 4 comments
Open
Labels
dashboard Issues specific to the Ray Dashboard enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@Superskyyy
Copy link
Contributor

Superskyyy commented Jul 5, 2024

Description

With Ray starting to support the virtual cluster (vCluster) concept and we are seeing advanced multi-cluster per user setups, the Ray dashboard components should not be bound to a single Ray cluster's lifetime anymore, since it makes multi-tenancy sharing and telemetry data persistence complex to implement. Plus that the dashboard would go down together if the head node goes down (fate-sharing), making it difficult to backtrack what happened (and what was executing) during a major incident. @liuxsh9 @Bye-legumes @nemo9cby

Use case

Doing so will bring below benefits:

  1. Dashboard can optionally read from a persistence history server (observability database) instead of pulling directly from a running GCS. (GCS/HA redis writes to persistence store)
  2. Dashboard side overhead will not accidentally bring down the head node.
  3. Users can attach their own external monitoring platforms same way as job dashboard, to manage large amount of clusters.
  4. Each user gets their dashboard, which can be multi physical cluster or vclusters.
  5. Allow checking dashboard even after a cluster was preempted/shutdown.
@Superskyyy Superskyyy added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 5, 2024
@Bye-legumes
Copy link
Contributor

Bye-legumes commented Jul 5, 2024

similar issue #45940
Maybe we can decouple in this way so that we can achieve persist storage and the dashboard can control the tasks.
image

@anyscalesam anyscalesam added the dashboard Issues specific to the Ray Dashboard label Jul 8, 2024
@yucai
Copy link
Contributor

yucai commented Jul 11, 2024

Share we have a abstraction layer in front of DataBase? So that, different DB solution can be used.
@anyscalesam , @Bye-legumes I heard ByteDance had some solution already, kindly share with me if you have. Thanks a lot!

@anyscalesam
Copy link
Contributor

anyscalesam commented Jul 15, 2024

let's grab time to chat more about this cc @alanwguo

UPDATE: focus on getting Export API working first which is the natural pre-req to this. REP in progress with @MissiontoMars @nikitavemuri

@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2024
@andremoeller
Copy link

andremoeller commented Dec 6, 2024

We're running into a similar problem, but without vCluster. We're using kuberay to run RayJobs, which use an ephemeral RayCluster for the duration of the job. After the job finishes, the cluster is destroyed. The dashboard can only be accessed during the duration of the job, which isn't especially useful. We'd have to keep the cluster and head node alive just to have the dashboard up for each job.

We'd prefer to be able to dump dashboard state to google cloud storage to be able to access job and task information after the job run.

Related: ray-project/kuberay#2615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashboard Issues specific to the Ray Dashboard enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

5 participants