-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use HTTP probes for Ray readiness and liviness probes #2360
base: master
Are you sure you want to change the base?
Conversation
@@ -271,7 +256,7 @@ func initLivenessAndReadinessProbe(rayContainer *corev1.Container, rayNodeType r | |||
SuccessThreshold: utils.DefaultLivenessProbeSuccessThreshold, | |||
FailureThreshold: utils.DefaultLivenessProbeFailureThreshold, | |||
} | |||
rayContainer.LivenessProbe.Exec = &corev1.ExecAction{Command: []string{"bash", "-c", strings.Join(commands, " && ")}} | |||
rayContainer.LivenessProbe.HTTPGet = &corev1.HTTPGetAction{Path: healthCheckPath, Port: healthCheckPort} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using HTTP probes means we can only query 1 endpoint per probe now. For head pod this would /api/gcs_healthz
and for worker pod it would be api/local_raylet_healthz
. I'm not sure if not health checking api/local_raylet_healthz
in the head pod is problematic, it would depend on what whether /api/gcs_healthz
incorporates raylet health in some way as well
We also face this issue when the workload is high. |
@andrewsykim do we still need this PR after #2353 has been merged? |
I think we should still consider use of HTTP probes, they are significantly ligher weight. I haven't root caused the issue I'm seeing, but increasing the timeout did not fully resolve the issue I'm seeing where exec probes cause high load |
I have encountered some bizarre behavior with the exec probes that I think would be solved with http probes. The biggest issue may actually be a k8s bug though. The
|
I'm still in favor of this change and I would like to see it merged for v1.3. However, using http probes means we can only probe 1 HTTP endpoint per container. Specifically for the Head pod, it means probing only the dashboard endpoint and not the raylet agent. Are w okay with that change? @kevin85421 @joshhvulcan |
I think an http probe on the dashboard would be sufficient for the failures we have experienced. |
865a01a
to
e7cdb70
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
e7cdb70
to
d106ab9
Compare
PR updated, PTAL |
A side benefit is that this does not force custom images to include wget |
Consider consolidating health check endpoints in Ray Core |
Why are these changes needed?
HTTP probes are considered lighter-weight than exec probes. However, exec probes have the advantage of doing multiple health checks. In KubeRay, we use exec probes to execute "wget" commands against multiple endpoints. Use of exec probes seems to be causing some issues, as shown in #2264 and from KubeRay scalability testing.
This PR explores using HTTP probes instead. This PR needs more consideration as using HTTP probes means we can only health check 1 end point per probe. Marking WIP for now until that quesiton is resolved.
Related issue number
Checks