Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run async event processing in parallel #2491

Open
TheBlueMatt opened this issue Aug 11, 2023 · 6 comments
Open

Run async event processing in parallel #2491

TheBlueMatt opened this issue Aug 11, 2023 · 6 comments

Comments

@TheBlueMatt
Copy link
Collaborator

I think we can Just Do This (tm) with some additional documentation on events and the handlers. CC https://github.com/orgs/lightningdevkit/discussions/2381

@MaxFangX
Copy link
Contributor

MaxFangX commented Nov 7, 2024

What would it take to do this? I'm looking at the process_events_body! macro in channelmanager.rs:

for (event, action_opt) in pending_events {
    $event_to_handle = event;
    $handle_event;
    if let Some(action) = action_opt {
        post_event_actions.push(action);
    }
}

Looks like we'd need to change the macro so it somehow works for both async / blocking, and slap a futures::future::join_all (or equivalent) in there?

As a temporary workaround, would it be okay to spawn off all event processing into tasks (using the blocking EventHandler interface), and just prevent the channel manager from being repersisted until all processing is complete?

@tnull
Copy link
Contributor

tnull commented Nov 7, 2024

What would it take to do this? I'm looking at the process_events_body! macro in channelmanager.rs:
...
Looks like we'd need to change the macro so it somehow works for both async / blocking, and slap a futures::future::join_all (or equivalent) in there?

Note we already started doing this for OnionMessenger events in #3060. We might be able to follow a similar pattern in general/for other event types, but doing so in a safe way it would likely take more steps than just fire-and-forgetting tasks.

As a temporary workaround, would it be okay to spawn off all event processing into tasks (using the blocking EventHandler interface), and just prevent the channel manager from being repersisted until all processing is complete?

Since we added fallible event handling in #2995 we expect the event handler to actually return a Result, so even if you would get all the persistence-related issues right (which isn't trivial at all), you can't just spawn tasks and return their result in arbitrary order without having the wrong ones replayed, for instance.

Remind me, what type of events / what in particular makes this a bottleneck for you?

@TheBlueMatt
Copy link
Collaborator Author

just prevent the channel manager from being repersisted until all processing is complete?

Sadly, no. Right now PaymentSent events block monitor persistence to ensure that you've had a chance to see+persist the payment preimage (proof of payment) before it is irrevocably deleted in the monitor.

@MaxFangX
Copy link
Contributor

MaxFangX commented Nov 9, 2024

Heya! Just saw this.

Remind me, what type of events / what in particular makes this a bottleneck for you?

Event handling in general is pretty hot for our LSP, which connects to possibly thousands of user nodes. HTLCIntercepted in particular is a big bottleneck, since if an HTLC comes in and the user isn't online, we have to start the user's node, wait for the user node BDK + LDK sync and reconnection to the LSP, and only then can we proceed to forward the HTLC (or possibly open a JIT channel, which takes even longer). This entire process takes about 15 seconds, which is 15 seconds that the LSP can't be processing payments for any other user (or any other events, for that matter). @tnull

Since we added fallible event handling in #2995 ...

I took a look at the updated code! Very helpful improvement. I think the replay behavior should be independent between events. For example, if our LSP gets events for users A, B, C in the queue, failure to handle the event for user A shouldn't block the handling of the event for users B and C. FWIW, earlier discussions about whether events needed to be processed in order didn't turn up any examples.

Full concurrency

In the long term we'd love to have some interface where LDK feeds us events from the queue even before the channel manager persists them, where the contract is "do your best to durably persist, then handle these events", which would allow for concurrency between batches of events in addition to concurrency within batches of events. Or some other construction which allows for full concurrency between batches. But if we only have within-batch concurrency in the near term, that's a major improvement over what exists today.

@tnull
Copy link
Contributor

tnull commented Nov 11, 2024

Event handling in general is pretty hot for our LSP, which connects to possibly thousands of user nodes. HTLCIntercepted in particular is a big bottleneck, since if an HTLC comes in and the user isn't online, we have to start the user's node, wait for the user node BDK + LDK sync and reconnection to the LSP, and only then can we proceed to forward the HTLC (or possibly open a JIT channel, which takes even longer). This entire process takes about 15 seconds, which is 15 seconds that the LSP can't be processing payments for any other user (or any other events, for that matter). @tnull

Mhh, seems to me that would be better/easier solved by introducing a 'proper' HTLC interception interface (cf. #2855), rather parallelizing event handling?

I took a look at the updated code! Very helpful improvement. I think the replay behavior should be independent between events. For example, if our LSP gets events for users A, B, C in the queue, failure to handle the event for user A shouldn't block the handling of the event for users B and C.

The main expected reason why event handling would fail are persistence failures, in particular for remote persistence. It therefore makes sense to assume that if one fails, we can't make progress on any and rather need to exit and eventually (currently right-away) retry without losing any events.

@phlip9
Copy link

phlip9 commented Nov 13, 2024

Remind me, what type of events / what in particular makes this a bottleneck for you?

The HTLC intercept is for sure the biggest issue, so we're moving that into a separate persisted queue for later handling. But several other events (PaymentSent, PaymentFailed, PaymentClaimable, PaymentClaimed, OpenChannelRequest, SpendableOutputs) also cause at least 1 persist (~5 ms/per) and/or an external API call (~1ms - ~100ms). They have to update internal payment accounting, broadcast txs, do some internal DB lookups, etc...

After removing our previously crash-unsafe event handling (spawning a task to handle each event and immediately returning), we're back to serially handling each event one-after-the-other. Since we're persisting events serially, we'll take ~5ms * num_pending_events per process_pending_events_async. Realistically, for the short/medium term this is probably OK.

Maybe a short-term improvement would be passing the whole Vec<Event> batch to the handler, so we can persist the necessary events in a single batch and handle the other events inline. Then our handler takes ~5ms.

Though might be a little tricky to integrate with the fallible handling : D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants