Kubernetes Pod not healthy rules is not precise #217

yiyu0x · 2021-05-06T08:27:18Z

In section 5.1.17. Kubernetes Pod not healthy of this page, the description Pod has been in a non-ready state for longer than 15 minutes. and it below rule is:

expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0

But, I think the correct rule is:

expr: sum_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) == 15

The text was updated successfully, but these errors were encountered:

Iskaldr · 2022-02-14T09:17:11Z

I agree.
The current rule unfortunately also fires when a freshly (re-)deployed pod takes longer than 1 min to get ready, because the subquery [15m:1m] then only contains one bucket for that one minute with value = 1 triggering the min_over_time.

The proposed rule ensures, that the pod has been existing for 15 minutes and prevents the rule to pre-fire.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Pod not healthy rules is not precise #217

Kubernetes Pod not healthy rules is not precise #217

yiyu0x commented May 6, 2021 •

edited

Loading

Iskaldr commented Feb 14, 2022

Kubernetes Pod not healthy rules is not precise #217

Kubernetes Pod not healthy rules is not precise #217

Comments

yiyu0x commented May 6, 2021 • edited Loading

Iskaldr commented Feb 14, 2022

yiyu0x commented May 6, 2021 •

edited

Loading