-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-17469: implemented sli/alerts for central api latencies #117
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate SLIs for each API type.
I think this is fine, but we should be careful with increasing the SLI cardinality, otherwise we will end up with a huge matrix of SLI metrics. I agree with skipping the p99 SLI for this reason, although I remember that others felt strongly about these (+ even more like p100 ...).
Alerting
On a more general note, I would like to see the alerting based on error budgets and burn rates - see central:slo:availability:error_budget_exhaustion
and central:slo:availability:burnrate1h
as examples. I see two strategies on how to achieve that for latency:
- Time based SLI
The SLO period (28 days) is divided into time chunks, in which the SLI is either met or violated. This is how we calculate the availability SLI. In the case of availability, it means if the service is determined to be up or down. For latency, we could use 10 minute time chunks similar to what you have defined, and count how much time the SLI is met. This allows us to transcribe the latency into the "9 notation":
99% of the time, the 10 minute latency p90 must be < 100ms. We are currently at 99.5% over the last month. The error budget is at 50%.
As a side note, we could even roll the latency SLI into the availability SLI. The interpretation would be: "If the service latency was too high for the last 10 minutes, it was unavailable." However, I would argue against this because that is not how availability is defined in our SLAs, and it would complicate our metrics definition.
- Counter based SLI
We measure p90 over the entire SLO period of 28 days. The transcription would be:
p90 must be < 100ms. We are currently at p90=50ms over the last 28 days. The error budget is at 50%.
Other random thoughts
- The Central traffic is actually quite low in many cases (<1 rps). In some cases it may also be zero (e.g. no sensor is connected, nobody is using the UI, ...). This makes it hard to get good statistics for the percentiles.
- Looking at the graphql latencies, I don't think 100ms is a reasonable target unfortunately. I believe that is because some of the UI queries actually scale with the data (e.g. number of deployments in the cluster), so they may take quite some time. As far as I know the UI is rendered synchronously. Obviously hanging UIs are bad user experience, but in regards to SLIs, that may be expected for some queries. Perhaps you could talk to the UI team (https://redhat-internal.slack.com/archives/C7ERNFL0M) about what the general latency expectation is.
- The grpc metrics as defined include internal endpoints (e.g.
PingService
), which are lumped in actual Sensor/roxctl requests. We may choose to ignore this for simplicity, but it's something to be aware of when interpreting the data. Concretely, I expect this to skew the latencies to lower end.
560fdc6
to
842cc9a
Compare
|
||
# - Queries the error budget exhaustion (or consumption) for the whole slo window (28d). | ||
- expr: | | ||
(1 - central:grpc_server_handling_seconds:rate10m:p90:sli) / 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define a scalar recording rule for the target (0.01
/ 0.99
)? That allows us to change the value in a single place.
- alert: Central latency burn rate for GRPC API | ||
annotations: | ||
message: "Latency burn rate for central's GRPC API. Current burn rate per hour: {{ $value | humanize }}." | ||
expr: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you chose 0.5
as the burn rate threshold? That seems very low to me. Note that by definition, a burn rate of 1 means that the full error budget will be consumed after 28 days.
For slow burns we already have the alerts based on the total consumption. I'd keep this one for high burn alerts.
@@ -454,6 +454,78 @@ spec: | |||
) | |||
record: central:sli:availability:extended_avg_over_time28d | |||
|
|||
# - Queries the 90th percentile of central's handled GRPC/HTTP API requests latencies over the last 10 minutes. | |||
- expr: | | |||
(histogram_quantile(0.9, sum by(le, namespace, grpc_service, grpc_method) (rate(grpc_server_handling_seconds_bucket{container="central", grpc_method!~"ScanImageInternal|DeleteImages|EnrichLocalImageInternal|RunReport|ScanImage|TriggerExternalBackup|Ping"}[10m]))) > 0) < bool 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, didn't know about < bool
. Probably can use that else where as well to simplify the promQL.
# - Queries the error rate or the ratio of the instances of central:xxx:rate10m:p90 | ||
# were equal to 0 over the total instances of central:xxx:rate10m:p90 within a period. | ||
- expr: | | ||
1 - central:grpc_server_handling_seconds:rate10m:p90:sli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you changed the order in the naming compared to existing metrics? E.g. central:grpc_server_handling_seconds:rate10m:p90:sli
vs central:sli:grpc_server_handling_seconds:rate10m:p90
.
central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_rate1h / 0.01 | ||
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:burn_rate1h | ||
|
||
# - A sample count filter that ignores central:xxx:rate10m:p90 instances that has samples less than the expected sample count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the filter treat periods where there is no incoming traffic (and the base metrics are therefore undefined)?
@@ -533,6 +605,11 @@ spec: | |||
severity: critical | |||
namespace: "{{ $labels.namespace }}" | |||
rhacs_instance_id: "{{ $labels.rhacs_instance_id }}" | |||
rhacs_org_name: "{{ $labels.rhacs_org_name }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't add these labels here in general, because the values originate in Central, and if Central itself is down, they don't exist.
labels: | ||
service: central | ||
namespace: "{{ $labels.namespace }}" | ||
rhacs_instance_id: "{{ $labels.namespace }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance id is not the same as the namespace. The namespace is rhacs-{instance_id}
.
@@ -454,6 +454,78 @@ spec: | |||
) | |||
record: central:sli:availability:extended_avg_over_time28d | |||
|
|||
# - Queries the 90th percentile of central's handled GRPC/HTTP API requests latencies over the last 10 minutes. | |||
- expr: | | |||
(histogram_quantile(0.9, sum by(le, namespace, grpc_service, grpc_method) (rate(grpc_server_handling_seconds_bucket{container="central", grpc_method!~"ScanImageInternal|DeleteImages|EnrichLocalImageInternal|RunReport|ScanImage|TriggerExternalBackup|Ping"}[10m]))) > 0) < bool 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incoming gRPC calls are already very sparse for most Centrals. I think we should consider consolidating them if they roughly do the same thing latency wise. So I would sum here over grpc_service
/ grpc_method
.
In general looks good to me, but I think we should plan some time to observe the metrics in stage and maybe iterate the values if needed. |
What: Implemented SLI/SLO/Alerts for Central HTTP/GRPC APIs latencies.
The implementation is as follows:
SLIs
central:xxx:rate10m:p90:sli
is created for the1h
and28d
window per API mainly for the error budget exhaustion alerts mentioned below.Alerts
90%
,70%
,50%
) per API.Additional Info
path!~"/api/extensions/scannerdefinitions|/api/graphql|/sso/|/|/api/cli/download/"
Todo for this ticket (waiting for suggestions)
Todo for another ticket
@kylape @stehessel Can you please review?