Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD is corrupting etcd database on restart #8547

Open
faelau opened this issue Aug 19, 2024 · 1 comment
Open

PD is corrupting etcd database on restart #8547

faelau opened this issue Aug 19, 2024 · 1 comment

Comments

@faelau
Copy link

faelau commented Aug 19, 2024

Bug Report

If you restart a PD pod, you receive the following panic:

[2024/08/19 15:16:25.624 +00:00] [WARN] [server.go:297] ["exceeded recommended request limit"] [max-request-bytes=157286400] [max-request-size="157 MB"] [recommended-request-bytes=10485760] [recommended-request-size="10 MB"]
2024-08-19 15:16:25.624904 W | pkg/fileutil: check file permission: directory "/var/lib/pd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
[2024/08/19 15:16:25.636 +00:00] [PANIC] [backend.go:173] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="invalid database"]
panic: failed to open database
goroutine 251 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0x2?, {0x0?, 0x0?, 0xc0001364a0?})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0006f52b0, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc001299f80?, {0x304f490?, 0x16?}, {0xc0012b9980, 0x2, 0x2})
	/root/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x51
go.etcd.io/etcd/mvcc/backend.newBackend({{0xc001299f80, 0x1a}, 0x5f5e100, 0x2710, {0x30191e2, 0x5}, 0x233333333, 0xc000053980, 0x0})
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:173 +0x35c
go.etcd.io/etcd/mvcc/backend.New(...)
	/root/go/pkg/mod/go.etcd.io/[email protected]/mvcc/backend/backend.go:151
go.etcd.io/etcd/etcdserver.newBackend({{0x7ffd58f397d8, 0xe}, {0x0, 0x0}, {0x0, 0x0}, {0xc0003086c0, 0x1, 0x1}, {0xc000308480, ...}, ...})
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:53 +0x3b0
go.etcd.io/etcd/etcdserver.openBackend.func1()
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:74 +0x45
created by go.etcd.io/etcd/etcdserver.openBackend in goroutine 1
	/root/go/pkg/mod/go.etcd.io/[email protected]/etcdserver/backend.go:73 +0x106

The PVC hat the cephfs.csi.ceph.com provisioner. The cluster is running on microk8s.

Checking the etcd database with bbolt, digging a bit deeper results in the following error:

$ ./go/bin/bbolt page --all --format-value=redacted db
cannot read number of pages: the Meta Page has wrong (unexpected) magic

What did you do?

  1. Create a new TidbCluster:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: surrealdb
spec:
  version: v8.2.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    replicas: 1
    maxFailoverCount: 0
    mountClusterClientSecret: true
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    evictLeaderTimeout: 1m
    replicas: 3
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config:
      storage:
        reserve-space: "0MB"
      rocksdb:
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 5
    service:
      type: ClusterIP
    config: {}
  1. Restart PD pod (e.g. if you drain a node on updating Kubernetes)
  2. Getting the panic

What did you expect to see?

pd not corrupting the etcd database.

What did you see instead?

A panic of the PD container because the etcd database is corrupted.

What version of PD are you using (pd-server -V)?

[root@surrealdb-pd-0 /]# ./pd-server -V
Release Version: v8.2.0
Edition: Community
Git Commit Hash: c0ee2cd6c2eea7ad9372cc5bd00f6774abad6834
Git Branch: HEAD
UTC Build Time:  2024-07-04 09:39:38
@faelau
Copy link
Author

faelau commented Oct 28, 2024

Having some news on this.

The panic is only happening when mounting PVCs with the CEPH kernel driver. If the PVCs are mounted with the fuse driver, the panic isn't happening.

Also this happened some days ago with another software using BoltDB, so this seems to be some kind of an upstream issue on boltdb/etcd?

Maybe an issue there should also be opened?

@ti-chi-bot ti-chi-bot bot added the affects-8.5 This bug affects the 8.5.x(LTS) versions. label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Need Triage
Development

No branches or pull requests

2 participants