Run async event processing in parallel #2491

TheBlueMatt · 2023-08-11T23:16:20Z

I think we can Just Do This (tm) with some additional documentation on events and the handlers. CC https://github.com/orgs/lightningdevkit/discussions/2381

MaxFangX · 2024-11-07T04:18:09Z

What would it take to do this? I'm looking at the process_events_body! macro in channelmanager.rs:

for (event, action_opt) in pending_events {
    $event_to_handle = event;
    $handle_event;
    if let Some(action) = action_opt {
        post_event_actions.push(action);
    }
}

Looks like we'd need to change the macro so it somehow works for both async / blocking, and slap a futures::future::join_all (or equivalent) in there?

As a temporary workaround, would it be okay to spawn off all event processing into tasks (using the blocking EventHandler interface), and just prevent the channel manager from being repersisted until all processing is complete?

tnull · 2024-11-07T08:56:45Z

What would it take to do this? I'm looking at the process_events_body! macro in channelmanager.rs:
...
Looks like we'd need to change the macro so it somehow works for both async / blocking, and slap a futures::future::join_all (or equivalent) in there?

Note we already started doing this for OnionMessenger events in #3060. We might be able to follow a similar pattern in general/for other event types, but doing so in a safe way it would likely take more steps than just fire-and-forgetting tasks.

As a temporary workaround, would it be okay to spawn off all event processing into tasks (using the blocking EventHandler interface), and just prevent the channel manager from being repersisted until all processing is complete?

Since we added fallible event handling in #2995 we expect the event handler to actually return a Result, so even if you would get all the persistence-related issues right (which isn't trivial at all), you can't just spawn tasks and return their result in arbitrary order without having the wrong ones replayed, for instance.

Remind me, what type of events / what in particular makes this a bottleneck for you?

TheBlueMatt · 2024-11-08T19:11:36Z

just prevent the channel manager from being repersisted until all processing is complete?

Sadly, no. Right now PaymentSent events block monitor persistence to ensure that you've had a chance to see+persist the payment preimage (proof of payment) before it is irrevocably deleted in the monitor.

MaxFangX · 2024-11-09T04:15:05Z

Heya! Just saw this.

Remind me, what type of events / what in particular makes this a bottleneck for you?

Event handling in general is pretty hot for our LSP, which connects to possibly thousands of user nodes. HTLCIntercepted in particular is a big bottleneck, since if an HTLC comes in and the user isn't online, we have to start the user's node, wait for the user node BDK + LDK sync and reconnection to the LSP, and only then can we proceed to forward the HTLC (or possibly open a JIT channel, which takes even longer). This entire process takes about 15 seconds, which is 15 seconds that the LSP can't be processing payments for any other user (or any other events, for that matter). @tnull

Since we added fallible event handling in #2995 ...

I took a look at the updated code! Very helpful improvement. I think the replay behavior should be independent between events. For example, if our LSP gets events for users A, B, C in the queue, failure to handle the event for user A shouldn't block the handling of the event for users B and C. FWIW, earlier discussions about whether events needed to be processed in order didn't turn up any examples.

Full concurrency

In the long term we'd love to have some interface where LDK feeds us events from the queue even before the channel manager persists them, where the contract is "do your best to durably persist, then handle these events", which would allow for concurrency between batches of events in addition to concurrency within batches of events. Or some other construction which allows for full concurrency between batches. But if we only have within-batch concurrency in the near term, that's a major improvement over what exists today.

tnull · 2024-11-11T08:57:44Z

Event handling in general is pretty hot for our LSP, which connects to possibly thousands of user nodes. HTLCIntercepted in particular is a big bottleneck, since if an HTLC comes in and the user isn't online, we have to start the user's node, wait for the user node BDK + LDK sync and reconnection to the LSP, and only then can we proceed to forward the HTLC (or possibly open a JIT channel, which takes even longer). This entire process takes about 15 seconds, which is 15 seconds that the LSP can't be processing payments for any other user (or any other events, for that matter). @tnull

Mhh, seems to me that would be better/easier solved by introducing a 'proper' HTLC interception interface (cf. #2855), rather parallelizing event handling?

I took a look at the updated code! Very helpful improvement. I think the replay behavior should be independent between events. For example, if our LSP gets events for users A, B, C in the queue, failure to handle the event for user A shouldn't block the handling of the event for users B and C.

The main expected reason why event handling would fail are persistence failures, in particular for remote persistence. It therefore makes sense to assume that if one fails, we can't make progress on any and rather need to exit and eventually (currently right-away) retry without losing any events.

phlip9 · 2024-11-13T01:25:59Z

Remind me, what type of events / what in particular makes this a bottleneck for you?

The HTLC intercept is for sure the biggest issue, so we're moving that into a separate persisted queue for later handling. But several other events (PaymentSent, PaymentFailed, PaymentClaimable, PaymentClaimed, OpenChannelRequest, SpendableOutputs) also cause at least 1 persist (~5 ms/per) and/or an external API call (~1ms - ~100ms). They have to update internal payment accounting, broadcast txs, do some internal DB lookups, etc...

After removing our previously crash-unsafe event handling (spawning a task to handle each event and immediately returning), we're back to serially handling each event one-after-the-other. Since we're persisting events serially, we'll take ~5ms * num_pending_events per process_pending_events_async. Realistically, for the short/medium term this is probably OK.

Maybe a short-term improvement would be passing the whole Vec<Event> batch to the handler, so we can persist the necessary events in a single batch and handle the other events inline. Then our handler takes ~5ms.

Though might be a little tricky to integrate with the fallible handling : D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run async event processing in parallel #2491

Run async event processing in parallel #2491

TheBlueMatt commented Aug 11, 2023

MaxFangX commented Nov 7, 2024

tnull commented Nov 7, 2024

TheBlueMatt commented Nov 8, 2024

MaxFangX commented Nov 9, 2024 •

edited

Loading

tnull commented Nov 11, 2024 •

edited

Loading

phlip9 commented Nov 13, 2024

Run async event processing in parallel #2491

Run async event processing in parallel #2491

Comments

TheBlueMatt commented Aug 11, 2023

MaxFangX commented Nov 7, 2024

tnull commented Nov 7, 2024

TheBlueMatt commented Nov 8, 2024

MaxFangX commented Nov 9, 2024 • edited Loading

tnull commented Nov 11, 2024 • edited Loading

phlip9 commented Nov 13, 2024

MaxFangX commented Nov 9, 2024 •

edited

Loading

tnull commented Nov 11, 2024 •

edited

Loading