Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd: make CommitQueue.Add non-blocking #5153

Closed

Conversation

cfromknecht
Copy link
Contributor

Alternative to #5117

This PR takes a slightly different approach, by making the CommitQueue API non-blocking. This allows
the test to run with intentionally conflicting transactions and succeed without relying on timing assumptions.

Thanks to @bhandras for hunting down the root cause :)

@bhandras bhandras self-requested a review March 29, 2021 21:06
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @cfromknecht, thanks for jumping on this too. I like the condition variable based design very much. Did a few rounds of payments benchmark and it seems like the performance difference is negligible but there is a race condition when the context is canceled (iuuc) that prevents shutdown.

// transaction in our queue.
c.queueCond.L.Lock()
for c.freeCount > 0 || c.queue.Front() == nil {
c.queueCond.Wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few runs of make itest etcd=1 icase=async_payments_benchmark to see if there's any performance difference between this and the RWMutex based queue implementation and this seems to deadlock. I've uploaded the profile: https://paste.ubuntu.com/p/J3DZCDwCCQ/

To me it seems like that the problem might be that even though the context is canceled, we never signal the queueCond. We only signal when waiting for the queue to "stop".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @bhandras! I was able to figure it out, see the second commit :)

Currently we wait for a transaction to be applied by waiting for either
the done signal or the context to be canceled. When the context is
canceled, it's possible for executeErr to still be modified by the
concurrent execution of the transaction. We fix this by only
reading/returning the value when the transaction has successfully
executed.

This doesn't appear to have been caught by our unit tests because Add
would block while executing non-conflicting transactions. In the
subsequent commits that make Add non-blocking, this causes
TestAbortContext to fail since it returned nil (the value read from
executeErr prematurely).
Forks the passed context and adds an additional cancel function that is
called when the database is closing. This ensure the commit queue is
properly exited.
@cfromknecht cfromknecht force-pushed the etcd-commit-queue-v2 branch from 3cb6f38 to 635a6aa Compare April 1, 2021 01:05
@cfromknecht cfromknecht requested a review from bhandras April 7, 2021 23:09
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK LGTM, thanks @cfromknecht 🍻

// At this point it is safe to execute the "unblocked" tx, as no
// blocked tx will be read from the queue until the freeCount is
// decremented back to 0.
go func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this goroutine normally? It's handy for the commit queue unit test though. Did a few benchmarks with/without and performance difference is negligible.

@cfromknecht cfromknecht requested a review from halseth April 21, 2021 20:05
// on this "unblocked" tx. Increment our free counter before
// unlocking so that the mainLoop stops pulling off blocked
// transactions from the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line

func (c *commitQueue) signalUntilShutdown() {
for {
select {
case <-time.After(time.Millisecond):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so rapidly? 😮

Shouldn't a single signal be enough to unlock the loop?

queueMu sync.RWMutex
queueCond *sync.Cond
queue *list.List
freeCount uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should document these a bit?

}

// Execute the next blocked transaction.
top()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't hold the mutex here now, how can we be sure that no other commit comes in while this is executing? I guess as long as it doesn't conflict that's okay, and any conflict would go back on the queue=

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to make sure there are no competing commits for the same keys, and since this (top) element already increased ref counts any subsequent txns that touch those keys are queued. Transactions that do not have conflicting keys are executed without queuing up (etcd does allow multi writers)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one thing that may not be immediately clear is that we are OK with conflicting transactions, since the STMs will retry them and eventually one of them will succeed first followed by the other. The commit queue makes these conflicts less likely (reducing the number of retries to at most 2).

@bhandras
Copy link
Collaborator

bhandras commented Aug 6, 2021

Closing this as a derivative solution has been implemented in: #5513

@bhandras bhandras closed this Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants