Add partial support for cancelling async write mutex requests #6486
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While we can't cancel the actual wait on the write mutex, we can dequeue specific Transactions which are waiting for their turn to write, and only block when the DB itself is destroyed. This makes it so that individual Transactions with cancelled async writes can be cleaned up while the write lock is held.
This is done by changing the async write queue in DB::AsyncCommitHelper from arbitrary callbacks to a queue of Transaction instances. It holds unowned Transactions to avoid ever having the final reference to the DB held on the worker thread. This required some adjustments to the locking to ensure that we're holding a lock whenever we need a pointer to remain valid, and to avoid lock order inversions this means that all of the calls to AsyncCommitHelper haver to be done without a lock held. We only do those calls from the Transaction's thread, so that didn't actually cause many problems.
The changes to BowlOfStonesSemaphore scoping in sync tests is to fix a pre-existing race condition in those tests which tsan is now complaining about. The semaphore was often being captured by something which outlived it, and could theoretically be destroyed before the call to
pthread_cond_signal()
returned. I doubt this ever caused any actual problems, but it could explain extremely rare crashes in sync tests.This fixes the same problem as #6413, but without the part where we'd sometimes end up closing the DB from the async commit helper thread, as that really didn't work.