TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false` #6013

kos-team · 2024-12-26T22:31:45Z

Bug Report

What version of Kubernetes are you using?
Client Version: v1.31.1
Kustomize Version: v5.4.2

What version of TiDB Operator are you using?
v1.6.0

What did you do?
We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is initialized, we set the status.report-status to false in spec.tidb.config and applied the change.

After the TiDB operator successfully reconfigures the TiDB cluster, it loses the connectivity to the TiDB cluster, and it mistakenly thinks that the TiDB cluster is unhealthy and constantly tries to run failover. The failover spawns new pods, however, the operator still cannot contact the new pods.

The healthcheck fails at

tidb-operator/pkg/manager/member/tidb_member_manager.go

Line 303 in 24fa283

} else if tc.TiDBAllPodsStarted() && !tc.TiDBAllMembersReady() {

which constantly triggers the Failover function.

How to reproduce

Deploy a TiDB cluster with TiProxy enabled, for example:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

Add status.status-port to the spec.tidb.config

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true

      [status]

      report-status = false
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see?
We expected to see either the tidb restarts and new configuration takes effect or the tidb operator rejects the change since the operations of tidb operator depend on HTTP API service.

What did you see instead?
The last pod terminated and restarted. After that, tidb operator could not connect to the cluster and hanged.

The text was updated successfully, but these errors were encountered:

csuzhangxc · 2024-12-27T02:21:29Z

We're planning to add a better webhook back in the upcoming TiDB Operator v2, and this verification may be implemented in the webhook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false` #6013

TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false` #6013

kos-team commented Dec 26, 2024

csuzhangxc commented Dec 27, 2024

TiDB operator hangs after setting the TiDB's status.report-status configuration to false #6013

TiDB operator hangs after setting the TiDB's status.report-status configuration to false #6013

Comments

kos-team commented Dec 26, 2024

Bug Report

csuzhangxc commented Dec 27, 2024

TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false` #6013

TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false` #6013