Operator: unhealthy ingesters not leaving the ring #15702

aleert · 2025-01-11T19:10:18Z

Describe the bug
After network issues within our clusters we found, that ingesters were not able to join ring so we had to manually remove them.
There are multiple issues, describing this behavior, eg #8615 and #14847 .

Suggested fix would be to add autoforget_unhealthy flag to ingester config by default, as there seems to be no downsides for it.

Expected behavior
Unhealthy ingester leaving their ring after a timeout.

Environment:
Kubernetes 1.27

The text was updated successfully, but these errors were encountered:

xperimental · 2025-01-12T13:34:22Z

Hi @aleert ,

Can you provide more information on how to reproduce issue?

I have recently tried to reproduce a very similar report, but for me the ingesters instantly became healthy again after the network issues were removed.

aleert · 2025-01-14T14:40:43Z

@xperimental ok, so it is not exactly what happened to our clusters, but i was able to reproduce similar behavior. Turns out the problem is scaling ingester replicas count down while expiriencing network issues. You can use following steps to reproduce it:

Create lokistack with

template:
  ingester:
    replicas: 3

Disable network for ingesters. I used following NetworkPolicy as we use calico cni:

apiVersion: crd.projectcalico.org/v1
kind: NetworkPolicy
metadata:
  name: test-policy
  namespace: loki
spec:
  selector: app.kubernetes.io/component=='ingester'
  ingress:
  - action: Allow
    protocol: TCP
    source:
      selector: loki.grafana.com/gossip == 'true'
    destination:
      ports:
      - 22
  egress:
  - action: Allow
    protocol: TCP
    source:
      selector: loki.grafana.com/gossip == 'true'
    destination:
      ports:
      - 22

Wait until ingesters become UNHEALTHY
Change ingester replica to 2 or 1, wait until pods deleted.
Remove network policy. Now deleted ingesters will stuck in UNHEALTHY state. They can be fixed if scaled back to 3 or 2 so that the pod with the same name as the unhealthy ingester appears. However these ingesters will not become healthy or forgotten on their own.

xperimental · 2025-01-14T18:16:25Z

Thanks for the update. I'll try to reproduce this soon with the new information.

aleert linked a pull request Jan 11, 2025 that will close this issue

fix(operator): autoforget unhealthy ingesters #15703

Open

6 tasks

xperimental self-assigned this Jan 12, 2025

xperimental added the sig/operator label Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator: unhealthy ingesters not leaving the ring #15702

Operator: unhealthy ingesters not leaving the ring #15702

aleert commented Jan 11, 2025

xperimental commented Jan 12, 2025

aleert commented Jan 14, 2025 •

edited

Loading

xperimental commented Jan 14, 2025

Operator: unhealthy ingesters not leaving the ring #15702

Operator: unhealthy ingesters not leaving the ring #15702

Comments

aleert commented Jan 11, 2025

xperimental commented Jan 12, 2025

aleert commented Jan 14, 2025 • edited Loading

xperimental commented Jan 14, 2025

aleert commented Jan 14, 2025 •

edited

Loading