Remove cluster state version downgrade fallback #32297

vekterli · 2024-08-29T09:31:06Z

@geirst please review

To avoid inherent race conditions with overlapping cluster controller leader intervals (caused by the old leader not yet knowing it has been deposed) where both an old state version and a newer state version is concurrently published, we want to only accept strictly increasing version numbers (for the lifetime of a process; these are currently not durably stored on content nodes). On the cluster controllers themselves, this version number is backed by a ZooKeeper quorum, ensuring that it is durably stored where it matters the most.

A content node only observing strictly increasing version numbers is an invariant that holds unless an explicit fallback is triggered, where we can still accept an older version.

This fallback was intended to be a "failsafe" if ZooKeeper state on the cluster controllers was lost, but its implementation depended on information that is not actually present in all CC RPCs, meaning that it could kick in even when not intended, thus rendering the race condition protection void.

The CC RPC in question is not easily extensible, so instead remove the fallback entirely. This has the bonus of content nodes actually being able to rely on the version invariant internally. Downside is that content node and distributor processes must be restarted to accept a lower state version upon ZK state loss, but in that case you probably have bigger problems.

To avoid inherent race conditions with overlapping cluster controller leader intervals (caused by the old leader not yet knowing it has been deposed) where both an old state version and a newer state version is concurrently published, we want to only accept strictly increasing version numbers (for the lifetime of a process; these are currently not durably stored on content nodes). These version numbers are backed by a ZooKeeper quorum, ensuring that they _are_ durably stored for cluster controllers. A content node only observing strictly increasing version numbers is an invariant that holds _unless_ an explicit fallback is triggered, where we can still accept an older version. This fallback was intended to be a "failsafe" if ZooKeeper state on the cluster controllers was lost, but its implementation depended on information that is not actually present in all CC RPCs, meaning that it could kick in even when not intended, thus rendering the race condition protection void. The CC RPC in question is not easily extensible, so instead remove the fallback entirely. This has the bonus of content nodes actually being able to rely on the version invariant internally. Downside is that content node and distributor processes must be restarted to accept a lower state version upon ZK state loss, but in that case you probably have bigger problems.

geirst

👍

vekterli requested a review from geirst August 29, 2024 09:31

geirst approved these changes Aug 29, 2024

View reviewed changes

vekterli merged commit d235a77 into master Aug 29, 2024
3 checks passed

vekterli deleted the vekterli/remove-state-version-downgrade-fallback branch August 29, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove cluster state version downgrade fallback #32297

Remove cluster state version downgrade fallback #32297

vekterli commented Aug 29, 2024

geirst left a comment

Remove cluster state version downgrade fallback #32297

Remove cluster state version downgrade fallback #32297

Conversation

vekterli commented Aug 29, 2024

geirst left a comment

Choose a reason for hiding this comment