Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent "failed to flush commits before termination" error during consumer group shutdown #614

Open
Tasyp opened this issue Dec 13, 2024 · 2 comments

Comments

@Tasyp
Copy link

Tasyp commented Dec 13, 2024

I have the following setup: 2 consumers of different topics inside the same consumer group distributed among 3 nodes and utilizing partition_assignment_strategy=callback_implemented.

Everything works great but there is 1 thing that worries me. During a shutdown, I can consistently see the following statement printed out on different nodes:

group_subscriber_v2 *group-id* failed to flush commits before termination :timeout

This is logged as an error so I treat it as an abnormal execution.

This seems to be a safety mechanism to prevent the call to the group coordinator hang forever:

ok = flush_offset_commits(GroupId, Coordinator),

Could it be related to the usage of the callback-implemented partition assignment strategy? For example, the original group leader is already shutdown, a new one is elected, it starts doing preparatory work and that's when the call to flush offsets call comes in.

Are there any logs/other information I could provide to simplify the investigation?

@zmstone
Copy link
Contributor

zmstone commented Dec 14, 2024

Maybe it's because the coordinator process is in the middle of rebalance, e.g. calling the member process the revoke assignments or assign partitions.
Do you happen to shutdown all 3 nodes around the same period ? if so, do you observe all three nodes logging the same? or it's always the second and third node logging it ?

You can add some debug logs to the assignments_revoked and assign_partitions call to confirm if it's indeed a deadlock.
There is no quick solution for the deadlock
the best I can think of is to make the two calls async, this will need some state machine re-design for the coordinator.

@Tasyp
Copy link
Author

Tasyp commented Dec 18, 2024

Do you happen to shutdown all 3 nodes around the same period ?

Yes, it is done using Kubernetes rolling update. So not exactly at the same time but close enough to overlap.

If so, do you observe all three nodes logging the same? or it's always the second and third node logging it ?

~90+% of the time it is the second and third node. It is every rare that all 3 fail to flush commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants