Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator: unhealthy ingesters not leaving the ring #15702

Open
aleert opened this issue Jan 11, 2025 · 3 comments · May be fixed by #15703
Open

Operator: unhealthy ingesters not leaving the ring #15702

aleert opened this issue Jan 11, 2025 · 3 comments · May be fixed by #15703
Assignees

Comments

@aleert
Copy link
Contributor

aleert commented Jan 11, 2025

Describe the bug
After network issues within our clusters we found, that ingesters were not able to join ring so we had to manually remove them.
There are multiple issues, describing this behavior, eg #8615 and #14847 .

Suggested fix would be to add autoforget_unhealthy flag to ingester config by default, as there seems to be no downsides for it.

Expected behavior
Unhealthy ingester leaving their ring after a timeout.

Environment:
Kubernetes 1.27

@aleert aleert linked a pull request Jan 11, 2025 that will close this issue
6 tasks
@xperimental
Copy link
Collaborator

Hi @aleert ,

Can you provide more information on how to reproduce issue?

I have recently tried to reproduce a very similar report, but for me the ingesters instantly became healthy again after the network issues were removed.

@aleert
Copy link
Contributor Author

aleert commented Jan 14, 2025

@xperimental ok, so it is not exactly what happened to our clusters, but i was able to reproduce similar behavior. Turns out the problem is scaling ingester replicas count down while expiriencing network issues. You can use following steps to reproduce it:

  1. Create lokistack with
template:
  ingester:
    replicas: 3
  1. Disable network for ingesters. I used following NetworkPolicy as we use calico cni:
apiVersion: crd.projectcalico.org/v1
kind: NetworkPolicy
metadata:
  name: test-policy
  namespace: loki
spec:
  selector: app.kubernetes.io/component=='ingester'
  ingress:
  - action: Allow
    protocol: TCP
    source:
      selector: loki.grafana.com/gossip == 'true'
    destination:
      ports:
      - 22
  egress:
  - action: Allow
    protocol: TCP
    source:
      selector: loki.grafana.com/gossip == 'true'
    destination:
      ports:
      - 22
  1. Wait until ingesters become UNHEALTHY
  2. Change ingester replica to 2 or 1, wait until pods deleted.
  3. Remove network policy. Now deleted ingesters will stuck in UNHEALTHY state. They can be fixed if scaled back to 3 or 2 so that the pod with the same name as the unhealthy ingester appears. However these ingesters will not become healthy or forgotten on their own.
Image

@xperimental
Copy link
Collaborator

Thanks for the update. I'll try to reproduce this soon with the new information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants