Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

长时间压测后,线程wal-raft-executor-112680774442680320_0 和 basekv-range-mutator CPU高,一直降不下来 #94

Open
masterOcean opened this issue Jul 8, 2024 · 3 comments

Comments

@masterOcean
Copy link

masterOcean commented Jul 8, 2024

长时间压测后,线程wal-raft-executor-112680774442680320_0 CPU高,一直降不下来
集群3个节点(32C,64G)(20,54,124 三台),35w客户端,每隔10s发 40K body 压测,每隔10-12小时休眠 3分钟左右。大概2天后,20节点 wal-raft-executor-112680774442680320_0 线程 CPU 占用高,54 节点上 wal-raft-executor-112680774434029568_0 线程 CPU 占用高,而且一直降不下来,同时 basekv-range-mutator 线程 CPU也很高而且无法将来下。
但这期间集群正常,warn.log 和 error.log 都没有错误打印, gc 日志正常。balancer 日志中能搜到该线程
20 节点 cpu 截图
image

20 节点 retain.store-fd6e1d50-7308-4146-84fd-5fa62de36212.log

2024-06-30 20:23:07.191  INFO [bg-task-executor-7] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2784, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e, 542e442a-3748-4ec4-b6db-eda13ad225e6], learner=[]}] result: true
2024-06-30 22:08:53.690  INFO [bg-task-executor-2] --- [KVRangeBalanceController.java:169] Balancer[ReplicaCntBalancer] run command: ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2788, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e], learner=[]}
2024-07-01 09:55:06.882  INFO [bg-task-executor-3] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2844, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e], learner=[]}] result: true
2024-07-02 17:55:13.775  INFO [bg-task-executor] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2856, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e, 542e442a-3748-4ec4-b6db-eda13ad225e6], learner=[]}] result: true

54 节点 cpu 截图
image

54 节点 inbox.store-0a40673e-7e57-47d6-8fa9-e69a2305152e.log

2024-07-03 19:27:35.164  INFO [bg-task-executor] --- [KVRangeBalanceController.java:169] Balancer[ReplicaCntBalancer] run command: ChangeConfigCommand{toStore=0a40673e-7e57-47d6-8fa9-e69a2305152e, kvRangeId=112680774434029568_0, expectedVer=3640, voters=[62837868-8a27-4d5c-9bc3-1a155fc63a66, e8a84d42-8292-489e-a241-9ce716d14e07, 0a40673e-7e57-47d6-8fa9-e69a2305152e], learner=[]}

BifroMQ

  • Version: [3.1.1]
  • Deployment: [Cluster]

To Reproduce
压测客户端,35w client, 每隔8.5S 发送 body 40k qos =0 的消息,每隔10-12小时休眠 3分钟以上
*** PUB Client ***:

  • MQTT Connection:
    • ClientIdentifier:
    • etc...
  • MQTT Pub:
    • Topic:
    • QoS: [0]
    • Retain: [false]

Expected behavior

Logs

Configurations

OS(please complete the following information):

JVM:

  • Version: [ 17]

Performance Related

  • HOST:
    • Cluster node count: [3]
    • CPU: [32]
    • Memory: [64]
  • Network:
    • Bandwidth: [5Gbps]
    • Latency: []
  • Load:
    • PUB count: [350000]
    • SUB count: [0]
    • PUB QPS per connection: [0.12msg/s]
    • Payload size: [40KB]

Additional context
Add any other context about the problem here.

@popduke
Copy link
Member

popduke commented Jul 12, 2024

用你给的reproduce信息无法复现你描述的现象,以下建议供参考:1)在issue描述中给出完整的稳定reproduce问题步骤,或者2)如果停止压测并重启后问题依然存在,可提供三台节点完整的data数据共诊断

@masterOcean
Copy link
Author

用你给的reproduce信息无法复现你描述的现象,以下建议供参考:1)在issue描述中给出完整的稳定reproduce问题步骤,或者2)如果停止压测并重启后问题依然存在,可提供三台节点完整的data数据共诊断

data 数据
链接:https://pan.baidu.com/s/1K2gkC2vtzGz2ykbsFPYSAA?pwd=y5cg
提取码:y5cg

@popduke
Copy link
Member

popduke commented Jul 19, 2024

你的数据通过相关metrics(basekv_meta_ver)显示, inbox store和retain store的range经过了几千次的管理版本变更,并且副本之间的进展也不同步,占用cpu的线程应该是leader一直在尝试同步操作,这种情况你需要排查节点间的通信质量是否有问题。另外,3.2.1包含了一些存储引擎方面的稳定性优化,推荐用同样的场景实测。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants