Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-17469: implemented sli/alerts for central api latencies #117

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pepedocs
Copy link

@pepedocs pepedocs commented Jul 20, 2023

What: Implemented SLI/SLO/Alerts for Central HTTP/GRPC APIs latencies.

The implementation is as follows:

SLIs

  • A static SLI rule central:xxx:rate10m:p90:sli is created for the 1h and 28d window per API mainly for the error budget exhaustion alerts mentioned below.
  • SLIs for monitoring purposes can be simply derived from this SLI if desired.

Alerts

  • An error budget exhaustion/consumption alert is created per threshold (e.g. 90%, 70%, 50%) per API.
  • An error budget burn rate alert with threshold 50% is created per API.

Additional Info

  1. Alerts will ignore SLIs that do not have enough samples so it won't fire for new central instances.
  2. Long-running GRPC APIs are ignored from the SLI.
  3. Some of the HTTP API paths are ignored from the SLI due to the fact that they do not meet the < 100ms criteria for most of the time. We can either include them and increase the 100ms threshold to 400ms as I've seen them reach 300ms+ or create another SLI for them. I prefer the first option if desired.
  • path!~"/api/extensions/scannerdefinitions|/api/graphql|/sso/|/|/api/cli/download/"
  1. I have intentionally skipped the GraphQL API latencies as they need to be profiled first before we can create an SLI for them. Looking at the graphs, most of their requests do not even have enough data points therefore making the histogram_quantile query useless. I believe they need to be handled differently, so for now I've left them out for another story/ticket.

Todo for this ticket (waiting for suggestions)

  1. Some of the HTTP API paths are ignored from the SLI due to the fact that they do not meet the < 100ms criteria

Todo for another ticket

  1. Implement GRPC API latency SLI/SLO

@kylape @stehessel Can you please review?

Copy link
Contributor

@stehessel stehessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate SLIs for each API type.

I think this is fine, but we should be careful with increasing the SLI cardinality, otherwise we will end up with a huge matrix of SLI metrics. I agree with skipping the p99 SLI for this reason, although I remember that others felt strongly about these (+ even more like p100 ...).

Alerting

On a more general note, I would like to see the alerting based on error budgets and burn rates - see central:slo:availability:error_budget_exhaustion and central:slo:availability:burnrate1h as examples. I see two strategies on how to achieve that for latency:

  1. Time based SLI

The SLO period (28 days) is divided into time chunks, in which the SLI is either met or violated. This is how we calculate the availability SLI. In the case of availability, it means if the service is determined to be up or down. For latency, we could use 10 minute time chunks similar to what you have defined, and count how much time the SLI is met. This allows us to transcribe the latency into the "9 notation":

99% of the time, the 10 minute latency p90 must be < 100ms. We are currently at 99.5% over the last month. The error budget is at 50%.

As a side note, we could even roll the latency SLI into the availability SLI. The interpretation would be: "If the service latency was too high for the last 10 minutes, it was unavailable." However, I would argue against this because that is not how availability is defined in our SLAs, and it would complicate our metrics definition.

  1. Counter based SLI

We measure p90 over the entire SLO period of 28 days. The transcription would be:

p90 must be < 100ms. We are currently at p90=50ms over the last 28 days. The error budget is at 50%.

Other random thoughts

  • The Central traffic is actually quite low in many cases (<1 rps). In some cases it may also be zero (e.g. no sensor is connected, nobody is using the UI, ...). This makes it hard to get good statistics for the percentiles.
  • Looking at the graphql latencies, I don't think 100ms is a reasonable target unfortunately. I believe that is because some of the UI queries actually scale with the data (e.g. number of deployments in the cluster), so they may take quite some time. As far as I know the UI is rendered synchronously. Obviously hanging UIs are bad user experience, but in regards to SLIs, that may be expected for some queries. Perhaps you could talk to the UI team (https://redhat-internal.slack.com/archives/C7ERNFL0M) about what the general latency expectation is.
  • The grpc metrics as defined include internal endpoints (e.g. PingService), which are lumped in actual Sensor/roxctl requests. We may choose to ignore this for simplicity, but it's something to be aware of when interpreting the data. Concretely, I expect this to skew the latencies to lower end.

@pepedocs pepedocs force-pushed the latency-slis branch 2 times, most recently from 560fdc6 to 842cc9a Compare July 28, 2023 11:11
@pepedocs pepedocs marked this pull request as ready for review August 3, 2023 10:30
@pepedocs pepedocs requested a review from a team as a code owner August 3, 2023 10:30

# - Queries the error budget exhaustion (or consumption) for the whole slo window (28d).
- expr: |
(1 - central:grpc_server_handling_seconds:rate10m:p90:sli) / 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define a scalar recording rule for the target (0.01 / 0.99)? That allows us to change the value in a single place.

- alert: Central latency burn rate for GRPC API
annotations:
message: "Latency burn rate for central's GRPC API. Current burn rate per hour: {{ $value | humanize }}."
expr: |
Copy link
Contributor

@stehessel stehessel Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you chose 0.5 as the burn rate threshold? That seems very low to me. Note that by definition, a burn rate of 1 means that the full error budget will be consumed after 28 days.

For slow burns we already have the alerts based on the total consumption. I'd keep this one for high burn alerts.

@@ -454,6 +454,78 @@ spec:
)
record: central:sli:availability:extended_avg_over_time28d

# - Queries the 90th percentile of central's handled GRPC/HTTP API requests latencies over the last 10 minutes.
- expr: |
(histogram_quantile(0.9, sum by(le, namespace, grpc_service, grpc_method) (rate(grpc_server_handling_seconds_bucket{container="central", grpc_method!~"ScanImageInternal|DeleteImages|EnrichLocalImageInternal|RunReport|ScanImage|TriggerExternalBackup|Ping"}[10m]))) > 0) < bool 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, didn't know about < bool. Probably can use that else where as well to simplify the promQL.

# - Queries the error rate or the ratio of the instances of central:xxx:rate10m:p90
# were equal to 0 over the total instances of central:xxx:rate10m:p90 within a period.
- expr: |
1 - central:grpc_server_handling_seconds:rate10m:p90:sli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you changed the order in the naming compared to existing metrics? E.g. central:grpc_server_handling_seconds:rate10m:p90:sli vs central:sli:grpc_server_handling_seconds:rate10m:p90.

central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_rate1h / 0.01
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:burn_rate1h

# - A sample count filter that ignores central:xxx:rate10m:p90 instances that has samples less than the expected sample count.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the filter treat periods where there is no incoming traffic (and the base metrics are therefore undefined)?

@@ -533,6 +605,11 @@ spec:
severity: critical
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.rhacs_instance_id }}"
rhacs_org_name: "{{ $labels.rhacs_org_name }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't add these labels here in general, because the values originate in Central, and if Central itself is down, they don't exist.

labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance id is not the same as the namespace. The namespace is rhacs-{instance_id}.

@@ -454,6 +454,78 @@ spec:
)
record: central:sli:availability:extended_avg_over_time28d

# - Queries the 90th percentile of central's handled GRPC/HTTP API requests latencies over the last 10 minutes.
- expr: |
(histogram_quantile(0.9, sum by(le, namespace, grpc_service, grpc_method) (rate(grpc_server_handling_seconds_bucket{container="central", grpc_method!~"ScanImageInternal|DeleteImages|EnrichLocalImageInternal|RunReport|ScanImage|TriggerExternalBackup|Ping"}[10m]))) > 0) < bool 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incoming gRPC calls are already very sparse for most Centrals. I think we should consider consolidating them if they roughly do the same thing latency wise. So I would sum here over grpc_service / grpc_method.

@stehessel
Copy link
Contributor

In general looks good to me, but I think we should plan some time to observe the metrics in stage and maybe iterate the values if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants