Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB operator hangs after setting the TiDB's status.report-status configuration to false #6013

Open
kos-team opened this issue Dec 26, 2024 · 1 comment

Comments

@kos-team
Copy link
Contributor

Bug Report

What version of Kubernetes are you using?
Client Version: v1.31.1
Kustomize Version: v5.4.2

What version of TiDB Operator are you using?
v1.6.0

What did you do?
We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is initialized, we set the status.report-status to false in spec.tidb.config and applied the change.

After the TiDB operator successfully reconfigures the TiDB cluster, it loses the connectivity to the TiDB cluster, and it mistakenly thinks that the TiDB cluster is unhealthy and constantly tries to run failover. The failover spawns new pods, however, the operator still cannot contact the new pods.

The healthcheck fails at

} else if tc.TiDBAllPodsStarted() && !tc.TiDBAllMembersReady() {

which constantly triggers the Failover function.

How to reproduce

  1. Deploy a TiDB cluster with TiProxy enabled, for example:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0
  1. Add status.status-port to the spec.tidb.config
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true

      [status]

      report-status = false
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see?
We expected to see either the tidb restarts and new configuration takes effect or the tidb operator rejects the change since the operations of tidb operator depend on HTTP API service.

What did you see instead?
The last pod terminated and restarted. After that, tidb operator could not connect to the cluster and hanged.

@csuzhangxc
Copy link
Member

We're planning to add a better webhook back in the upcoming TiDB Operator v2, and this verification may be implemented in the webhook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants