Rebalancing an upsert table causing high GC and failure to reconnect to ZK #14301

dang-stripe · 2024-10-24T22:24:07Z

Follow up from apache/helix#2951 which provides more detail.

We performed a rebalance on an upsert table using low-disk mode that led to high GC on a server and the server constantly trying to reconnect to ZK. The server never recovers until we manually restart it.

@Jackie-Jiang had a theory this might be tied to the metadata manager for old partitions not getting released even after the segments were all dropped and thus there's a large empty concurrent hash map still on heap causing GC.

Jackie-Jiang · 2024-11-01T06:49:44Z

cc @klsince @tibrewalpratik17

tibrewalpratik17 · 2024-11-06T05:55:17Z

#11626 also might be related where we have seen similar behaviour after long GC pause and helix-pending-messages metric spikes up and doesn't recover.

dang-stripe mentioned this issue Oct 24, 2024

Frequent ZK session ID mismatches after GC leading to Helix messages treated as no-op apache/helix#2951

Closed

Jackie-Jiang added upsert bug labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalancing an upsert table causing high GC and failure to reconnect to ZK #14301

Rebalancing an upsert table causing high GC and failure to reconnect to ZK #14301

dang-stripe commented Oct 24, 2024

Jackie-Jiang commented Nov 1, 2024

tibrewalpratik17 commented Nov 6, 2024

Rebalancing an upsert table causing high GC and failure to reconnect to ZK #14301

Rebalancing an upsert table causing high GC and failure to reconnect to ZK #14301

Comments

dang-stripe commented Oct 24, 2024

Jackie-Jiang commented Nov 1, 2024

tibrewalpratik17 commented Nov 6, 2024