PD is corrupting `etcd` database on restart #8547

faelau · 2024-08-19T15:30:02Z

Bug Report

If you restart a PD pod, you receive the following panic:

[2024/08/19 15:16:25.624 +00:00] [WARN] [server.go:297] ["exceeded recommended request limit"] [max-request-bytes=157286400] [max-request-size="157 MB"] [recommended-request-bytes=10485760] [recommended-request-size="10 MB"]
2024-08-19 15:16:25.624904 W | pkg/fileutil: check file permission: directory "/var/lib/pd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
[2024/08/19 15:16:25.636 +00:00] [PANIC] [backend.go:173] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="invalid database"]
panic: failed to open database
goroutine 251 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0x2?, {0x0?, 0x0?, 0xc0001364a0?})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0006f52b0, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc001299f80?, {0x304f490?, 0x16?}, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x51
go.etcd.io/etcd/mvcc/backend.newBackend({{0xc001299f80, 0x1a}, 0x5f5e100, 0x2710, {0x30191e2, 0x5}, 0x233333333, 0xc000053980, 0x0})
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:173 +0x35c
go.etcd.io/etcd/mvcc/backend.New(...)
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:151
go.etcd.io/etcd/etcdserver.newBackend({{0x7ffd58f397d8, 0xe}, {0x0, 0x0}, {0x0, 0x0}, {0xc0003086c0, 0x1, 0x1}, {0xc000308480, ...}, ...})
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:53 +0x3b0
go.etcd.io/etcd/etcdserver.openBackend.func1()
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:74 +0x45
created by go.etcd.io/etcd/etcdserver.openBackend in goroutine 1
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:73 +0x106

The PVC hat the cephfs.csi.ceph.com provisioner. The cluster is running on microk8s.

Checking the etcd database with bbolt, digging a bit deeper results in the following error:

$ ./go/bin/bbolt page --all --format-value=redacted db
cannot read number of pages: the Meta Page has wrong (unexpected) magic

What did you do?

Create a new TidbCluster:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: surrealdb
spec:
  version: v8.2.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    replicas: 1
    maxFailoverCount: 0
    mountClusterClientSecret: true
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    evictLeaderTimeout: 1m
    replicas: 3
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config:
      storage:
        reserve-space: "0MB"
      rocksdb:
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 5
    service:
      type: ClusterIP
    config: {}

Restart PD pod (e.g. if you drain a node on updating Kubernetes)
Getting the panic

What did you expect to see?

pd not corrupting the etcd database.

What did you see instead?

A panic of the PD container because the etcd database is corrupted.

What version of PD are you using (`pd-server -V`)?

[root@surrealdb-pd-0 /]# ./pd-server -V
Release Version: v8.2.0
Edition: Community
Git Commit Hash: c0ee2cd6c2eea7ad9372cc5bd00f6774abad6834
Git Branch: HEAD
UTC Build Time:  2024-07-04 09:39:38

The text was updated successfully, but these errors were encountered:

faelau · 2024-10-28T10:35:10Z

Having some news on this.

The panic is only happening when mounting PVCs with the CEPH kernel driver. If the PVCs are mounted with the fuse driver, the panic isn't happening.

Also this happened some days ago with another software using BoltDB, so this seems to be some kind of an upstream issue on boltdb/etcd?

Maybe an issue there should also be opened?

faelau added the type/bug The issue is confirmed as a bug. label Aug 19, 2024

github-project-automation bot added this to Questions and Bug Reports Aug 29, 2024

github-project-automation bot moved this to Need Triage in Questions and Bug Reports Aug 29, 2024

jebter added the severity/major label Aug 30, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Aug 30, 2024

jebter added the impact/panic label Oct 23, 2024

ti-chi-bot bot added the affects-8.5 This bug affects the 8.5.x(LTS) versions. label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD is corrupting `etcd` database on restart #8547

PD is corrupting `etcd` database on restart #8547

faelau commented Aug 19, 2024

faelau commented Oct 28, 2024

PD is corrupting etcd database on restart #8547

PD is corrupting etcd database on restart #8547

Comments

faelau commented Aug 19, 2024

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

faelau commented Oct 28, 2024

PD is corrupting `etcd` database on restart #8547

PD is corrupting `etcd` database on restart #8547

What version of PD are you using (`pd-server -V`)?