[Feature]: Continuous health checks on services with reactions like pod-replacements #1349

devidw · 2024-06-21T13:43:32Z

Problem

The only reliable way to know that a service is healthy is to test it aka perform a health check that does a minimal processing of task it should do

Since this is a end to end test, its a good indicator that the service is really healthy and can take load

If there is something in the network, the health check will fail, which for example is not accounted for, by standard restart policies on container-exit

Solution

Because of this it would be extremely helpful to have health check support in dstack, and then have configuration options how to react to those changes

In order to react, it would be helpful to have a config option to set how many failures we want to consider a unhealthy, for example 3 failed ones

Then one reaction could be to try to restart the pod

Another reaction could be to remove the pod and replace it with a new one

Basically the idea is to always ensure the configured number of replicas is really healthy

Workaround

https://github.com/devidw/gingo created this to perform health checks and then perform pod restarts/adding/removing based on the health status of pods in a configured cluster

can be extended by writing other connectors, currently just has a runpod one

Would you like to help us implement this feature by sending a PR?

No

peterschmidt85 · 2024-06-21T13:46:56Z

@devidw, gonna discuss this with the team next week and get back to you with an update on when we can support this. Stay tuned.

peterschmidt85 · 2024-07-22T01:51:49Z

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 · 2024-08-05T01:52:04Z

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

github-actions · 2024-10-08T01:58:40Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-11-08T01:57:44Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-12-09T02:07:01Z

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 · 2025-01-13T14:26:35Z

Is duplicated by #2181

devidw added the feature label Jun 21, 2024

peterschmidt85 mentioned this issue Jun 24, 2024

[Roadmap] Q3 2024 #1350

Closed

42 tasks

peterschmidt85 added the stale label Jul 22, 2024

peterschmidt85 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024

peterschmidt85 reopened this Sep 7, 2024

peterschmidt85 removed the stale label Sep 7, 2024

peterschmidt85 mentioned this issue Oct 3, 2024

[Roadmap] Q4 2024 #1782

Closed

49 tasks

github-actions bot added the stale label Oct 8, 2024

peterschmidt85 removed the stale label Oct 8, 2024

github-actions bot added the stale label Nov 8, 2024

peterschmidt85 removed the stale label Nov 8, 2024

github-actions bot added the stale label Dec 9, 2024

peterschmidt85 removed the stale label Dec 9, 2024

r4victor added the no-stale label Dec 23, 2024

peterschmidt85 closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Continuous health checks on services with reactions like pod-replacements #1349

[Feature]: Continuous health checks on services with reactions like pod-replacements #1349

devidw commented Jun 21, 2024

peterschmidt85 commented Jun 21, 2024

peterschmidt85 commented Jul 22, 2024

peterschmidt85 commented Aug 5, 2024

github-actions bot commented Oct 8, 2024

github-actions bot commented Nov 8, 2024

github-actions bot commented Dec 9, 2024

peterschmidt85 commented Jan 13, 2025

[Feature]: Continuous health checks on services with reactions like pod-replacements #1349

[Feature]: Continuous health checks on services with reactions like pod-replacements #1349

Comments

devidw commented Jun 21, 2024

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

peterschmidt85 commented Jun 21, 2024

peterschmidt85 commented Jul 22, 2024

peterschmidt85 commented Aug 5, 2024

github-actions bot commented Oct 8, 2024

github-actions bot commented Nov 8, 2024

github-actions bot commented Dec 9, 2024

peterschmidt85 commented Jan 13, 2025