-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: etcd fixes and performance improvements #5392
Conversation
876a0e3
to
720598c
Compare
7b042eb
to
9e1a678
Compare
An idea to fix the mission control store could be to completely eliminate the use of the cursor, but instead extend the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this, looks like you were able to identify and address quite a number of bottlenecks.
I can't say I fully understand the whole code around STM just yet, so take my review with a grain of salt. But the changes look good as far as I can tell.
channeldb/payment_control.go
Outdated
} | ||
|
||
err = htlcBucket.Put(htlcAttemptInfoKey, htlcInfoBytes) | ||
err = htlcsBucket.Put(append(htlcAttemptInfoKey, htlcIDBytes...), htlcInfoBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful when using append()
with global variables as sometimes that can modify their value. See https://trstringer.com/golang-append/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this, looks like you were able to identify and address quite a number of bottlenecks.
I can't say I fully understand the whole code around STM just yet, so take my review with a grain of salt. But the changes look good as far as I can tell.
Thanks for the review @guggero !! :)
channeldb/payment_control.go
Outdated
@@ -449,23 +442,22 @@ func (p *PaymentControl) updateHtlcKey(paymentHash lntypes.Hash, | |||
return fmt.Errorf("htlcs bucket not found") | |||
} | |||
|
|||
htlcBucket := htlcsBucket.NestedReadWriteBucket(htlcIDBytes) | |||
if htlcBucket == nil { | |||
if htlcsBucket.Get(append(htlcAttemptInfoKey, htlcIDBytes...)) == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change would actually be bad for bbolt
's performance. This would mean all HTLCs would be in the same top-level bucket. This makes write operations very memory intensive with bbolt
when there are many entries as it needs to copy its whole btree when adding new leaves.
But I can definitely see why this would help etcd
. Not sure what other options we have. Thought if this is a performance bottleneck then it's definitely worth brainstorming a bit more.
Or perhaps even decide that etcd
performance has precedence here if the performance hit in bbolt is not too dramatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable names may be a bit misleading, but the htlcs are in the payment's own bucket (where the key is the payment hash). Before it was the same but there were three sub buckets per htlc attempt.
channeldb/payments.go
Outdated
@@ -67,31 +67,31 @@ var ( | |||
|
|||
// paymentSequenceKey is a key used in the payment's sub-bucket to | |||
// store the sequence number of the payment. | |||
paymentSequenceKey = []byte("payment-sequence-key") | |||
paymentSequenceKey = []byte("s") // "payment-sequence-key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of shortening the keys. Though perhaps we should always use two characters of which at least one is not in the hex alphabet to make it a bit more clear when just looking at the keys?
For example:
ci -> payment-creation-info
ai -> htlc-attempt-info
si -> htlc-settle-info
fi -> htlc-fail-info
pf -> payment-fail-info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe two char names are better for readability. I wonder how much we'd spare by shortening our keys, since I think bbolt also serializes the key and is not dictionary based. Maybe I'm wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't all this a breaking change? Also IIRC, for etcd we hashed these keys down anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is breaking change, but there's some huge gains even in etcd
. With the etcd
driver we currently hash the bucket path and then append the key and the postfix of whether this is a bucket or value key.
So something like this: hash(hash(parent + b) + child + b) + key + b/v
. The last part is where these longs names that I renamed would be. For each payment times each HTLC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing is that if we flatten the payment's htlc bucket that is also a change that needs migration, so we could be at least efficient breaking our schema :)
About flattening: to decide whether or not a payment is settled, we now iterate the htlc bucket (mutliple times during payment lifecycle actually) and then for each HTLC we also get the attempt, settle and fail info keys.
Which means without flattenting and assuming the above hashed structure we'd need num(htlcs) * (3 + 1) roundtrips to the DB per payment every time we need to calculate the payment status. The plus 1 comes from getting into the attempt bucket for the HTLC where the key is a global counter, so we have no way to know which ones to prefetch (keys are guaranteed to be monotonic but not sequential).
channeldb/payment_control.go
Outdated
|
||
// Set a new upper bound in the DB every 1000 payments to avoid | ||
// conflicts on the sequence when using etcd. | ||
if p.currPaymentSeq == p.storedPaymentSeq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I see how this should speed things up considerably! Worst case here is that we just "waste" 1k sequences if we restart lnd
directly after setting a new upper bound. But that probably shouldn't matter too much if we have gaps in the payment sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, wasting an interval is not a huge issue, since we only require these ids to be monotonic. With etcd there was a huge contention on this sequence key causing all payment transactions queueing up, and retried in the STM. This was one of our major bottlenecks.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
routing/missioncontrol_store.go
Outdated
return kvdb.Update(b.db, func(tx kvdb.RwTx) error { | ||
b.queueMx.Lock() | ||
defer b.queueMx.Unlock() | ||
b.queue.PushBack(rp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the "results" are never actually read from the store, only on startup? Pathfinding always uses its internal memory representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Also I was thinking maybe the whole design of the mission control store is wrong since we use the keys for ordering to be able to delete old ones. Was thinking about a circular list based design where we can deterministically overwrite old entries, so no deletion would be needed but didn't really expand on the idea yet.
I ran this branch (commit 12e2fc025d8d100df9238fa7dd2a0ced32ecbddc) again on the benchmark rig. Still seeing performance going down:
The benchmark eventually crashes. Some of the errors in the log:
|
You need to compact and defrag since with that many payments we'll keep a lot of history in etcd. Or you need to set autocompact and defrag. |
Also interesting that I'm not able to reproduce the same degradation as you (ptal included results above). What version of etcd did you use? Is this the same gcloud instance? |
How to configure this exactly?
I use the code that you supplied to me earlier: https://github.com/bottlepay/lightning-benchmark/compare/etcd-improvements-test. Running on the same gcloud instance. Maybe you can give me a patch on top of that so that I am surely running the right config? |
Okay did some more tests, and it seems like after 50K payments on my setup degradation was also apparent. Fortunately the cause was simple to pinpoint and it seems like that my simplistic fix to the MC store design was not sufficient. Deleting high number of entries in a foreach cursor is still bad for performance if we do it every 1 sec. For a simple test I simply commented out those parts and perf degradation is completely non-existent which is good news.
|
There are a few tunings I did, that I think help too with the benchmark, although not sure how much yet since I had limited time today to test this.
It seemed like that on my machine Alice's etcd kept using about 25% CPU going up to 60-70% so less then one "whole core" amount. Seems like the benchmark is faster when |
Put up a fix (last commit) which I think is better than my original idea. The gist is that it creates the difference and stores that rather than iterating the range. More efficient in etcd. For reference benchmark on my end:
|
Question about these etcd parameters:
How do you know for sure that the settings above guarantee that an error like |
kvdb/interface.go
Outdated
// Prefetch will prefetch keys (exact match) and ranges (prefix match). | ||
// So that subsequent fetches for those keys and keys in ranges don't | ||
// go the the DB. | ||
Prefetch(keys []string, ranges []string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this seems to be an interface that is pretty etcd-specific. Isn't it possible to define a more abstract interface for prefetching at the kvdb
level?
kvdb/etcd/stm.go
Outdated
@@ -732,6 +942,72 @@ func (s *stm) OnCommit(cb func()) { | |||
s.onCommit = cb | |||
} | |||
|
|||
// Prefetch will prefetch the passed keys and prefixes in one transaction. | |||
// Keys and prefixes that we already have will be skipped. | |||
func (s *stm) Prefetch(keys []string, prefixes []string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say that I am a little intimidated by the STM code. It is almost as if logic that is normally part of a (battle-tested) database server is now implemented in lnd
itself. Do you think it is solid enough to trust it with critical Lightning data? It feels risky, but maybe I am not familiar enough with etcd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah since we needed to bend the usual implementation to fit our pattern/interfaces, we ended up having to write more custom logic here than originally anticipated. I also get somewhat squeamish when we need to make significant changes in this area. FWIW, if we want to rely on them, there're tools like Jepsen that can be used to find consistency violating issues in distributed databases. A run of the tool a few years ago actually found a number of issues in the etcd
logic as well as some of the concurrency APIs (leasing, locking, etc).
I am also getting stable result with the changes above, even without defrag. It went up to 225k payments, but then this happened in the logs:
(repeating) I thought that the disk is full, but that doesn't appear to be the case. Maybe this is because defrag isn't running? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool set of optimizations!
There's definitely a lot we can do here to improve the underlying data model of the KV interface as well as implementations of certain major data stores. I think we should be careful here though to not over invest in what may be incremental performance improvements from the PoV of external performance, and also to not force the interfaces to grow in order to accommodate some of these optimizations (like the caller needing to know what to prefetch, etc).
I'm also apprehensive to to force migrations to re-order the bucket structures for bolt, etc when lately we can avoid them mainly by just relying on TLV blobs appended to certain values.
With that said, I think a lot of what's in this PR are easy wins including: the optimization of the borked check (runs each time we need to update a commitment, may very well just be in memory), the batch allocation of sequence numbers, delayed mission control syncing (as it's non-critical really), and the improved commit queue for etcd.
kvdb/interface.go
Outdated
// Prefetch will prefetch keys (exact match) and ranges (prefix match). | ||
// So that subsequent fetches for those keys and keys in ranges don't | ||
// go the the DB. | ||
Prefetch(keys []string, ranges []string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems to be leaking too much information about the underlying semantics of the database to the caller. Why not just hide it further underneath the existing API? Also not clear if this extended bucket is useful for anything other than the etcd optimizations implemented in this PR.
Same goes for the methods below as well, which forces a distinction between range/bucket/value keys.
channeldb/payments.go
Outdated
@@ -67,31 +67,31 @@ var ( | |||
|
|||
// paymentSequenceKey is a key used in the payment's sub-bucket to | |||
// store the sequence number of the payment. | |||
paymentSequenceKey = []byte("payment-sequence-key") | |||
paymentSequenceKey = []byte("s") // "payment-sequence-key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't all this a breaking change? Also IIRC, for etcd we hashed these keys down anyway.
channeldb/payment_control.go
Outdated
|
||
// Set a new upper bound in the DB every 1000 payments to avoid | ||
// conflicts on the sequence when using etcd. | ||
if p.currPaymentSeq == p.storedPaymentSeq { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
IIRC, the version compaction is auto, but the defrag needs to happen manually? A few weeks ago we defragged the Pool database, and reclaimed gigabytes in one swoop.
One of those things where you kinda just need to "set it high enough to hopefully never happen"... |
I ran again with defrag in the background, but got this:
|
Doesn't sound good to me. |
Also let's update this to etcd 3.5 so we inherit their bug fixes and performance optimizations? |
I am testing on etcd 3.5. Defrag worked for a few times (in the endless loop provided above), but then started failing:
|
dc1eaac
to
6c72bb2
Compare
ec9bd41
to
0064e7c
Compare
0064e7c
to
6018a40
Compare
Did run some new tests using the bottlepay benchmark suite. These are interesting because since this PR has been started, we did a lot of changes and now all state is in the remote DB. Median performance is around stable 31 TPS, sometimes fluctuating upwards to a maximum of ~68TPS. Machine: I did use etcd 3.5, autocompact every 1000 revisions.
Defragged
|
6018a40
to
c93193a
Compare
I've also been working on a benchmark variation that includes a router node. The payments will take the A->B->C route. Perhaps it is interesting to take a look at that too to spot any bottlenecks in the forwarding logic. https://github.com/bottlepay/lightning-benchmark/tree/etcd-updates-router |
Got similiar performance. According to
|
Last two commits are still wip added for some additional testing. Above benchmark results are without those two. |
Not bad those results with a routing node in the path. |
I ran the commit
|
That was fast. Curious to what happens when you let it run to 1M payments which should only take around an hour on your box it seems. Another interesting thing would be to run it with replicated etcd, which may be closer to a production setup? |
5eb805c
to
349a0d1
Compare
Visit https://dashboard.github.orijtech.com?back=0&pr=5392&remote=true&repo=bhandras%2Flnd to see benchmark details. |
349a0d1
to
bc379dd
Compare
This commits adds a new generic Cache to kvdb which is capable of prefetching large buckets recursively while skipping arbitrary buckets at the same time. This mechanism lets us use such cache for buckets that don't grow indefinitely but are accessed very frequently, essentially reducing DB operations to just the puts and deletes.
bc379dd
to
b4f32d5
Compare
@bhandras, remember to re-request review from reviewers for your latest update |
Closing this PR as most of the fixes have been merged already and the rest can be tracked separately. |
Edit:
- rebased on #5516 (1st commit)- rebased on #5515 (2nd commit)- rebased on #5514 (3rd commit)- rebased on #5513 (4-5-6th commits)The results below are somewhat outdated since now all data is in the remote db.
- rebased on: kvdb+channeld: extend(merged)kvdb
withPrefetch
for prefetching buckets in one go and speed up payment control by prefetching payments on hot paths #5640Description:
This PR adds a few performance improvements to LND's etcd
kvdb
wrapper and fixes a few of the most important hotspots.The performance improvements is pretty dramatic and could possibly be further improved. Some of the changes require migration which is not part of the PR until ACKed.
Test 1:
make itest etcd=1 icase=async_payments_benchmark
master
bhandras:etcd_improvements
For reference the same tests with bbolt:
master
bhandras:etcd_improvements
Test 2:
Bottlepay benchmark (bottlepay/lightning-benchmark#6)
lnd-etcd-notls
with etcd 3.5. (machine is a Hetzner AX41-nvme dedicated server):It's possible that there's still some amount of degradation over time. After better caching and prefetches at hotspots the main culprit seemed to be the payment sequence and the misson control store. In my tests temporally removing those components completely eliminated all degradation (fluctuations still remained).