ROX-17469: implemented sli/alerts for central api latencies #117

pepedocs · 2023-07-20T09:31:44Z

What: Implemented SLI/SLO/Alerts for Central HTTP/GRPC APIs latencies.

The implementation is as follows:

SLIs

A static SLI rule central:xxx:rate10m:p90:sli is created for the 1h and 28d window per API mainly for the error budget exhaustion alerts mentioned below.
SLIs for monitoring purposes can be simply derived from this SLI if desired.

Alerts

An error budget exhaustion/consumption alert is created per threshold (e.g. 90%, 70%, 50%) per API.
An error budget burn rate alert with threshold 50% is created per API.

Additional Info

Alerts will ignore SLIs that do not have enough samples so it won't fire for new central instances.
Long-running GRPC APIs are ignored from the SLI.
Some of the HTTP API paths are ignored from the SLI due to the fact that they do not meet the < 100ms criteria for most of the time. We can either include them and increase the 100ms threshold to 400ms as I've seen them reach 300ms+ or create another SLI for them. I prefer the first option if desired.

path!~"/api/extensions/scannerdefinitions|/api/graphql|/sso/|/|/api/cli/download/"

I have intentionally skipped the GraphQL API latencies as they need to be profiled first before we can create an SLI for them. Looking at the graphs, most of their requests do not even have enough data points therefore making the histogram_quantile query useless. I believe they need to be handled differently, so for now I've left them out for another story/ticket.

Todo for this ticket (waiting for suggestions)

Some of the HTTP API paths are ignored from the SLI due to the fact that they do not meet the < 100ms criteria

Todo for another ticket

Implement GRPC API latency SLI/SLO

@kylape @stehessel Can you please review?

stehessel

Separate SLIs for each API type.

I think this is fine, but we should be careful with increasing the SLI cardinality, otherwise we will end up with a huge matrix of SLI metrics. I agree with skipping the p99 SLI for this reason, although I remember that others felt strongly about these (+ even more like p100 ...).

Alerting

On a more general note, I would like to see the alerting based on error budgets and burn rates - see central:slo:availability:error_budget_exhaustion and central:slo:availability:burnrate1h as examples. I see two strategies on how to achieve that for latency:

Time based SLI

The SLO period (28 days) is divided into time chunks, in which the SLI is either met or violated. This is how we calculate the availability SLI. In the case of availability, it means if the service is determined to be up or down. For latency, we could use 10 minute time chunks similar to what you have defined, and count how much time the SLI is met. This allows us to transcribe the latency into the "9 notation":

99% of the time, the 10 minute latency p90 must be < 100ms. We are currently at 99.5% over the last month. The error budget is at 50%.

As a side note, we could even roll the latency SLI into the availability SLI. The interpretation would be: "If the service latency was too high for the last 10 minutes, it was unavailable." However, I would argue against this because that is not how availability is defined in our SLAs, and it would complicate our metrics definition.

Counter based SLI

We measure p90 over the entire SLO period of 28 days. The transcription would be:

p90 must be < 100ms. We are currently at p90=50ms over the last 28 days. The error budget is at 50%.