Remove cluster state version downgrade fallback #32297
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@geirst please review
To avoid inherent race conditions with overlapping cluster controller leader intervals (caused by the old leader not yet knowing it has been deposed) where both an old state version and a newer state version is concurrently published, we want to only accept strictly increasing version numbers (for the lifetime of a process; these are currently not durably stored on content nodes). On the cluster controllers themselves, this version number is backed by a ZooKeeper quorum, ensuring that it is durably stored where it matters the most.
A content node only observing strictly increasing version numbers is an invariant that holds unless an explicit fallback is triggered, where we can still accept an older version.
This fallback was intended to be a "failsafe" if ZooKeeper state on the cluster controllers was lost, but its implementation depended on information that is not actually present in all CC RPCs, meaning that it could kick in even when not intended, thus rendering the race condition protection void.
The CC RPC in question is not easily extensible, so instead remove the fallback entirely. This has the bonus of content nodes actually being able to rely on the version invariant internally. Downside is that content node and distributor processes must be restarted to accept a lower state version upon ZK state loss, but in that case you probably have bigger problems.