Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd: fix dereferencing issue in commit queue causing contention and change design to be more scalable #5513

Merged
merged 3 commits into from
Aug 6, 2021

Conversation

bhandras
Copy link
Collaborator

@bhandras bhandras commented Jul 14, 2021

This commit splits the commit queue changes from #5392.
The original idea with the commit queue was to enable seamless concurrency for transactions that are non blocking, while queuing up all txns such that the number of individual retries is minimized. The old design failed to correctly implement the queuing up part which may have resulted in contention on certain keys and slowing things down considerably. This PR attempts to fix these issues by making the queue design simpler and more scalable.

Design changes heavily inspired by #5153

Rebased on: #5579

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, LGTM 💯

Only nits and questions at this point.

kvdb/etcd/commit_queue.go Show resolved Hide resolved
kvdb/etcd/commit_queue.go Outdated Show resolved Hide resolved
kvdb/etcd/commit_queue.go Show resolved Hide resolved
kvdb/etcd/commit_queue.go Outdated Show resolved Hide resolved
kvdb/etcd/commit_queue.go Outdated Show resolved Hide resolved
@bhandras bhandras force-pushed the commit_queue_fix branch 4 times, most recently from bbd8e1a to f806408 Compare July 27, 2021 14:31
func (c *commitQueue) Stop() {
// Signal the queue's condition variable to ensure the mainLoop reliably
// unblocks to check for the exit condition.
c.queueCond.Signal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we've found that at times this initial signal gets sort of "swallowed" causing us to add a loop around it like in this areas:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to cancel the main context, or send on some quit channel here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I wasn't aware of this. I suggested removing the loop in a previous review cycle 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to look up any references about this, but didn't succeed. Is this a know issue with Go conditional variables? Maybe we did the loops because of other circumstances on our side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a know issue with Go conditional variables? Maybe we did the loops because of other circumstances on our side?

Good question...we could just be using the API wrong, and then ended up settling on this workaround as it seemed to make a difference in practice. Otherwise, we would see behavior where things would never shutdown without this type of loop added. I guess I view it as more of a defensive thing, but we should def try to repro it in like golang playground or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, doesn't seem to be causing any issues in the itest (which are actually all green on this PR!!!), so I guess lets just monitor it and see if it triggers any issue re shutdown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

kvdb/etcd/db.go Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎨

@Roasbeef
Copy link
Member

Roasbeef commented Aug 5, 2021

Just needs a rebase now to fix the release notes conflict.

This commit fixes an issue where subsequent transaction retries may have
changed the read/write sets inside the STM which in turn left junk
references to these keys in the CommitQueue. The left keys potentially
conflicted with subsequent transactions, queueing them up causing
througput degradation.
This commit builds on the ideas of @cfromknecht in lnd/5153. The
addition is that the design is now simpler and more robust by queueing
up everything, but allowing maximal parallelism where txns don't block.
Furthermore the commit makes CommitQueue.Done() private essentially
removing the need to understand the queue externally.
@bhandras bhandras merged commit 254d9be into lightningnetwork:master Aug 6, 2021
@bhandras bhandras deleted the commit_queue_fix branch September 12, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants