-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frontend health should not depend on Database #529
base: main
Are you sure you want to change the base?
Conversation
Health of services should not depend on upstream depending services. If CosmosDB is not reachable, frondtend would scale down. This could cause an even more catastrophic failure
We are using default handler, this requires using the default registry everywhere. This was discovered, cause Gauge metrics where not registred/showing in the metrics output. Also adapt tests to use a global metrics emmiter to avoid panics cause by re-registering metrics in the global registry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand the motivation behind the changes in this file? Not strongly opposed, just curious about the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to learn more how the frontend is implemented and stumbled over that part.
Not depending on databases and other external services comes from experience. It can cause cascading failures which can be even harder to recover from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that in terms of multiple microservices, but I think a ReadinessProbe defined against a database connection test is a normal usage: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
We don't want this pod to accept any requests if it cannot connect to the database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's personal preference then. I prefer pods to keep running and handle errors in my application rather than have them restart once the database connections got issues.
Feel free to discard and close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this health check is used for readiness and lifeness probe alike, right?
lifeness: including a DB check into a lifeness probe is not the right thing to do in my opinion. it will rarely solve the issue if the DB has a problem and can have negative impact on the DB.
readiness: what is the expectation of ARM towards an RP? is it preferrable to have the RP answering with an error or not answering at all? if the DB is not working it would affect all pods, and a readiness probe including a DB check would empty the endpoint list of the service. i prefer a service to remain accessible and answer with proper error message. but i'm not going to die on that hill.
tldr: i think Jans change is the right thing to do. we can revisite if lifeness / readiness probes can/should check different things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: not sure how we are going to define scrape targets for prometheus. in general prometheus will not scrape non-ready pods when using a ServiceMonitor
as they will not show up in the endpoint list of a service.
@tony-schndr what scrape-config approach are we going to leverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geoberle I'm not certain at this time, I have only had success with Prometheus annotations using Istio's metric merging feature. ServiceMonitor
can't connect due to Azure Managed Prometheus not having the certificate to satisfy strict mTLS. I'm going to raise this with the team after recharge/f2f, maybe there is something I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to move the DB check into a startup probe and definitely agree the liveness probe has no need to check the database.
I can see how the RP responding with a 500/Internal Server Error is preferable and we can alert on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with other reviewers here, we should not validate dependencies ever to serve /healthz
endpoints. Customers should get 5xx, not tcp i/o connection timeout
when our database is down, since the liveness will determine if we are an Endpoint in the Service (if I am understanding our architecture correctly?)
dbConErr := f.dbClient.DBConnectionTest(request.Context()) | ||
if !f.CheckReady() { | ||
writer.WriteHeader(http.StatusInternalServerError) | ||
healthStatus = 0.0 | ||
} else if dbConErr != nil { | ||
writer.WriteHeader(http.StatusOK) | ||
healthStatus = 1.0 | ||
f.logger.Error(fmt.Sprintf("Database test failed: %v", dbConErr)) | ||
healthStatus = 0.5 | ||
} else { | ||
arm.WriteInternalServerError(writer) | ||
healthStatus = 0.0 | ||
writer.WriteHeader(http.StatusOK) | ||
healthStatus = 1.0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually doesn't solve the cascading error either.
We need to turn this problem on it's head: we should adopt a pattern of alerting on the Frontend generally for high rates of 5xx responses on all endpoints. /healthz
should be "dumb" IMO because it is structurally relevant for both kube (RestartPolicy) and TCP (Service Endpoint) configuration. If our golang server is online (that is, the PID is up and our service is capable of handling TCP traffic), we should return 200 here, no if
s or other logic. To clarify what I mean: Is it valuable for us to spam the kube-apiserver to kick our Frontend Pods when the DB is down? Is it valuable for us to return tcp: connection i/o timeout
on Frontend when the DB is down? I think we can agree: no.
So what do we do instead? Using middleware to expose metrics on all requests and responses regarless of endpoint will enable us to determine "oh, half our endpoints are returning 5xx? Interesting - are those the endpoints that rely on OCM? DB? EventGrid? Entra? etc" and we can triage from there. I think that implementing https://github.com/slok/go-http-metrics (or similar middleware) is a great place for us to start tracking these requests/responses/error rates in a general way.
TL;DR - I'm +1 to remove all validation on /healthz
- it should unconditionally return 200. It is a test of our service's PID/TCP configuration, not our dependencies and/or accuracy/correctness of our service.
What this PR does
Jira:
Link to demo recording:
Special notes for your reviewer