You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the following setup: 2 consumers of different topics inside the same consumer group distributed among 3 nodes and utilizing partition_assignment_strategy=callback_implemented.
Everything works great but there is 1 thing that worries me. During a shutdown, I can consistently see the following statement printed out on different nodes:
group_subscriber_v2 *group-id* failed to flush commits before termination :timeout
This is logged as an error so I treat it as an abnormal execution.
This seems to be a safety mechanism to prevent the call to the group coordinator hang forever:
Could it be related to the usage of the callback-implemented partition assignment strategy? For example, the original group leader is already shutdown, a new one is elected, it starts doing preparatory work and that's when the call to flush offsets call comes in.
Are there any logs/other information I could provide to simplify the investigation?
The text was updated successfully, but these errors were encountered:
Maybe it's because the coordinator process is in the middle of rebalance, e.g. calling the member process the revoke assignments or assign partitions.
Do you happen to shutdown all 3 nodes around the same period ? if so, do you observe all three nodes logging the same? or it's always the second and third node logging it ?
You can add some debug logs to the assignments_revoked and assign_partitions call to confirm if it's indeed a deadlock.
There is no quick solution for the deadlock
the best I can think of is to make the two calls async, this will need some state machine re-design for the coordinator.
I have the following setup: 2 consumers of different topics inside the same consumer group distributed among 3 nodes and utilizing
partition_assignment_strategy=callback_implemented
.Everything works great but there is 1 thing that worries me. During a shutdown, I can consistently see the following statement printed out on different nodes:
This is logged as an error so I treat it as an abnormal execution.
This seems to be a safety mechanism to prevent the call to the group coordinator hang forever:
brod/src/brod_group_subscriber_v2.erl
Line 404 in e18151c
Could it be related to the usage of the callback-implemented partition assignment strategy? For example, the original group leader is already shutdown, a new one is elected, it starts doing preparatory work and that's when the call to flush offsets call comes in.
Are there any logs/other information I could provide to simplify the investigation?
The text was updated successfully, but these errors were encountered: