Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Tasyp · 2024-12-13T12:00:27Z

I have the following setup: 2 consumers of different topics inside the same consumer group distributed among 3 nodes and utilizing partition_assignment_strategy=callback_implemented.

Everything works great but there is 1 thing that worries me. During a shutdown, I can consistently see the following statement printed out on different nodes:

group_subscriber_v2 *group-id* failed to flush commits before termination :timeout

This is logged as an error so I treat it as an abnormal execution.

This seems to be a safety mechanism to prevent the call to the group coordinator hang forever:

brod/src/brod_group_subscriber_v2.erl

Line 404 in e18151c

ok = flush_offset_commits(GroupId, Coordinator),

Could it be related to the usage of the callback-implemented partition assignment strategy? For example, the original group leader is already shutdown, a new one is elected, it starts doing preparatory work and that's when the call to flush offsets call comes in.

Are there any logs/other information I could provide to simplify the investigation?

The text was updated successfully, but these errors were encountered:

zmstone · 2024-12-14T14:10:27Z

Maybe it's because the coordinator process is in the middle of rebalance, e.g. calling the member process the revoke assignments or assign partitions.
Do you happen to shutdown all 3 nodes around the same period ? if so, do you observe all three nodes logging the same? or it's always the second and third node logging it ?

You can add some debug logs to the assignments_revoked and assign_partitions call to confirm if it's indeed a deadlock.
There is no quick solution for the deadlock
the best I can think of is to make the two calls async, this will need some state machine re-design for the coordinator.

Tasyp · 2024-12-18T10:00:29Z

Do you happen to shutdown all 3 nodes around the same period ?

Yes, it is done using Kubernetes rolling update. So not exactly at the same time but close enough to overlap.

If so, do you observe all three nodes logging the same? or it's always the second and third node logging it ?

~90+% of the time it is the second and third node. It is every rare that all 3 fail to flush commits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Tasyp commented Dec 13, 2024

zmstone commented Dec 14, 2024

Tasyp commented Dec 18, 2024 •

edited

Loading

Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Comments

Tasyp commented Dec 13, 2024

zmstone commented Dec 14, 2024

Tasyp commented Dec 18, 2024 • edited Loading

Tasyp commented Dec 18, 2024 •

edited

Loading