Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make event handling fallible #2995

Merged

Conversation

tnull
Copy link
Contributor

@tnull tnull commented Apr 15, 2024

Closes #2490,
Closes #2097.

Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the EventHandler any other way we'd forget about the event and wouldn't replay them after restart.

Here, we implement fallible event handling, allowing the user to return Err(()) which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).

TODO:

  • Add test coverage for replay behavior on Err(()).

@tnull tnull marked this pull request as draft April 15, 2024 09:12
@TheBlueMatt
Copy link
Collaborator

Previously, we would require our users to handle all events successfully inline or panic will trying to do so

I believe our recommendation was always to simply loop trying to handle the event until they succeed, which is basically what we're doing here for them :)

As to the code here, I think we should make more clear in the interface the event will be replayed, eg by making the error variant a unit struct called ReplayEvent or so. Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping. Otherwise concept lgtm.

@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 5a45ebe to 34d7a9b Compare May 27, 2024 10:03
@codecov-commenter
Copy link

codecov-commenter commented May 27, 2024

Codecov Report

Attention: Patch coverage is 63.91753% with 70 lines in your changes missing coverage. Please review.

Project coverage is 89.75%. Comparing base (0cfe55c) to head (e617a39).

Files Patch % Lines
lightning/src/onion_message/messenger.rs 58.06% 21 Missing and 5 partials ⚠️
lightning/src/util/async_poll.rs 34.61% 14 Missing and 3 partials ⚠️
lightning-background-processor/src/lib.rs 80.48% 5 Missing and 11 partials ⚠️
lightning/src/chain/chainmonitor.rs 46.15% 6 Missing and 1 partial ⚠️
lightning/src/chain/channelmonitor.rs 40.00% 3 Missing ⚠️
lightning/src/events/mod.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2995      +/-   ##
==========================================
- Coverage   89.79%   89.75%   -0.05%     
==========================================
  Files         121      121              
  Lines      100826   100916      +90     
  Branches   100826   100916      +90     
==========================================
+ Hits        90537    90576      +39     
- Misses       7614     7658      +44     
- Partials     2675     2682       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See-also my above comments.

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 34d7a9b to 772a851 Compare June 3, 2024 12:43
@tnull
Copy link
Contributor Author

tnull commented Jun 3, 2024

Rebased and included a fixup to introduce ReplayEvent unit struct error variant. But still ironing out some details regarding the approach.

Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping.

Could you clarify which flag you are referring to exactly? Do you mean just continueing the loop in case any of the event processors fails? If we do that, we should probably make sure it only happens once (and then maaaybe with an exponential back-off?) as otherwise the entire run loop might busy-wait if event handling keeps failing, e.g., due to a persistence failure?

@TheBlueMatt
Copy link
Collaborator

Could you clarify which flag you are referring to exactly?

The event_persist_notifier flag.

as otherwise the entire run loop might busy-wait

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

@tnull
Copy link
Contributor Author

tnull commented Jun 3, 2024

The event_persist_notifier flag.

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

@TheBlueMatt
Copy link
Collaborator

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

Sure, they should all do a similar thing.

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

Hmm, as long as the user has an async...anything handling the event that's failing we should yield at least once during the BP loop. We could add an explicit yield (in the form of a ~1ms sleep), I guess?

@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 772a851 to 4debd21 Compare June 5, 2024 09:09
@tnull
Copy link
Contributor Author

tnull commented Jun 5, 2024

Rebased to resolve conflicts after #3060 landed, have yet to adjust MultiFuturePoller though.

@G8XSU G8XSU self-requested a review June 6, 2024 17:40
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from fd89e2a to 43a4a04 Compare July 2, 2024 13:11
@tnull tnull marked this pull request as ready for review July 2, 2024 13:22
@tnull
Copy link
Contributor Author

tnull commented Jul 2, 2024

Now added logic to retain failed events in the OnionMessenger event queues to have them replayed upon next invocation. To do this I adjusted MultiFuturePoller to collect and return the actual event handling results.

Two observations:

  1. It's a bit unfortunate that this requires us to clone the queues as we can only remove the events from the queues after they have been successfully processed (but we have the same issue in CM/CM).
  2. While we replay events, we of course still won't persist the OnionMessenger events in any way. While I understand this is a deliberate design choice for DoS protection, it seems also like an API contract violation as events might get lost if the node restarts/crashes between event generation and the user handling it.

The event_persist_notifier flag.

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Generally this is still missing some test coverage, but should be ready for another round of review apart from that.

@TheBlueMatt
Copy link
Collaborator

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
/// An error type that may be returned to LDK in order to safely abort event handling if it can't
/// currently succeed (e.g., due to a persistence failure).
///
/// LDK will ensure the event is persisted and will eventually be replayed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be so cut-and-dry here cause its not really true - we may well (persist and) replay the event, but depending on the event type we may not. I'm not really sure how to accurately communicate this to users, though, at least short of documenting it for each individual event type (#2097).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, this comment predates the discussion we had offline. I think I'll see to pick up #2097 as part of this PR also.

Copy link
Contributor Author

@tnull tnull Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now took a stab at addressing #2097. Let me know if you think we should add more fine-grained context to some variants.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks great, though we also really need to add some additional context around events that are "robust" vs not - eg you could open a channel, have it closed and restart without ever persisting, implying a "lost" DiscardFunding. Doesn't have to happen in this PR, though is an implied part of #2097.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the DiscardFunding example, as it seems to me it would indeed be persisted across restart if it was ever generated? At least all codepaths I see that lead to finish_close_channel seem to set DoPersist or notify_on_drop?

Could it be that you're referring to the issue described in #2508, which however means DiscardFunding wouldn't get generated in the first place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to what happens if the ChannelManager (or any other thing) is not persisted. #2508 does come up here, but this generally applies to all events.

Copy link
Contributor Author

@tnull tnull Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, may be worth opening another issue to that end?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't think it needs to be addressed here, we should just not close #2097 until we address it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now opened #3186 to track this specifically as #2097 didn't mention this super clearly, IMO. Will let #2097 close with the changes in this PR.

@tnull tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from 7dc55e5 to 188edca Compare July 3, 2024 08:48
@tnull
Copy link
Contributor Author

tnull commented Jul 3, 2024

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

Agreed. Now added a fixup that has ChainMonitor trigger its event_notifier when any of the ChannelMonitor::process_pending_events calls fails.

@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 188edca to 985056c Compare July 8, 2024 13:51
@tnull
Copy link
Contributor Author

tnull commented Jul 8, 2024

Added a simple BackgroundProcessor test to check events are being replayed and also added a commit documenting the failure-to-handle/persistence behavior of all event variants (i.e., addressed #2097). I now also dropped the serialization logic for OnionMessageIntercepted and OnionMessagePeerConnected, as we don't ever write these events.

@tnull tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from d20af7c to 6e83263 Compare July 8, 2024 14:01
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just a few comments.

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved
lightning/src/onion_message/messenger.rs Show resolved Hide resolved
lightning/src/onion_message/messenger.rs Show resolved Hide resolved
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from 291f42e to 456bcf5 Compare July 17, 2024 08:21
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to squash, IMO.

lightning/src/onion_message/messenger.rs Show resolved Hide resolved
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 456bcf5 to 1c2478d Compare July 17, 2024 17:20
@tnull
Copy link
Contributor Author

tnull commented Jul 17, 2024

LGTM. Feel free to squash, IMO.

Squashed without further changes.

jkczyz
jkczyz previously approved these changes Jul 17, 2024
@TheBlueMatt
Copy link
Collaborator

The Make event handling fallible commit itself doesn't build, so check_commits is failing.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits and one real comment, but only worth fixing cause you have to rewrite some commits anyway.

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved
lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved
lightning/src/events/mod.rs Show resolved Hide resolved
Copy link
Contributor Author

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Make event handling fallible commit itself doesn't build, so check_commits is failing.

Squashed the MultiResultFuturePoller commit and added further fixups addressing the nits.

lightning/src/events/mod.rs Show resolved Hide resolved
lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved
lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 1c2478d to cbcd88f Compare July 18, 2024 07:05
This is a minor refactor that will allow us to access the individual
event queue Mutexes separately, allowing us to drop the locks earlier
when processing them individually.
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from cbcd88f to 258853a Compare July 18, 2024 07:06
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tnull added 4 commits July 18, 2024 15:54
Previously, we would require our users to handle all events
successfully inline or panic will trying to do so. If they would exit
the `EventHandler` any other way we'd forget about the event and
wouldn't replay them after restart.

Here, we implement fallible event handling, allowing the user to return
`Err(())` which signals to our event providers they should abort event
processing and replay any unhandled events later (i.e., in the next
invocation).
Previously, we would just fire-and-forget in `OnionMessenger`'s event
handling. Since we now introduced the possibility of event handling
failures, we here adapt the event handling logic to retain any
events which we failed to handle to have them replayed upon the next
invocation of `process_pending_events`/`process_pending_events_async`.
@tnull tnull force-pushed the 2024-04-fallible-event-handler branch from 258853a to e617a39 Compare July 18, 2024 13:54
@tnull
Copy link
Contributor Author

tnull commented Jul 18, 2024

Squashed fixups without further changes:

> git diff-tree -U2  258853aed e617a394e
>

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to land this, but probably needs some small followups.

lightning/src/onion_message/messenger.rs Show resolved Hide resolved
lightning-background-processor/src/lib.rs Show resolved Hide resolved
lightning/src/events/mod.rs Show resolved Hide resolved
@TheBlueMatt TheBlueMatt merged commit 2bfddea into lightningdevkit:main Jul 18, 2024
17 of 19 checks passed
@TheBlueMatt TheBlueMatt mentioned this pull request Jul 18, 2024
@tnull tnull mentioned this pull request Jul 19, 2024
tnull added a commit that referenced this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fallible event handling Document regenerated-on-restart events
4 participants