-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support intercepting onion messages for offline peers #2973
Support intercepting onion messages for offline peers #2973
Conversation
@@ -921,6 +921,17 @@ where | |||
} | |||
msgs | |||
} | |||
|
|||
fn enqueue_event(&self, event: Event) { | |||
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to get some thoughts on this event buffer size (currently 0.34MB). I can also add logging if an event is dropped, though I believe that should be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define it as N / ::core::mem::size_of::<Event>()
, but 1MiB seems fine? I don't think we're worried about a machine running with this that is super duper memory constrained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that exclude the majority of the bytes in the onion packet, which are on the heap? Are we trying to limit the number of events or the total size of those events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valentinewallace Not sure if this was missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I thought we were trying to limit the total size of the events. I'm not sure I'm understanding @TheBlueMatt's suggestion :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, @jkczyz is right, I was too used to serialization buffers.
@@ -983,6 +1065,13 @@ where | |||
e.get_mut().enqueue_message(onion_message); | |||
log_trace!(logger, "Forwarding an onion message to peer {}", next_node_id); | |||
}, | |||
_ if self.intercept_oms_for_offline_peers => { | |||
self.enqueue_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we could avoid introducing a lock order dependency between message_recipients
and pending_events
by pushing the pending events at the end of the method. I didn't bother but lmk if that seems worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know nobody has responded to your comment, but I think it should be fine as is. Any thoughts, @TheBlueMatt, @jkczyz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be fine.
Basically LGTM, but we should figure out how to pipe the events through for default users. Generally our users don't expect to actually hook up to events providers manually, they expect to do it via the BP, so maybe we need to take an Optional OnionMessenger to the BP? |
I believe that was already done when we started generating the |
Looks good to me, too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks great! 🚀
Just a few points I would love to discuss!
match message_recipients.entry(*peer_node_id) { | ||
hash_map::Entry::Occupied(mut e) if e.get().is_connected() => { | ||
e.get_mut().enqueue_message(message); | ||
Ok(()) | ||
}, | ||
_ => Err(SendError::InvalidFirstHop(*peer_node_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was giving a thought on how we are handling this situation. It might be possible that the peer is known, but just not connected at the instant. Sending an InvalidFirstHop
error might confuse users since it reads “The first hop is not a peer and …”. Maybe we could introduce a new Error type for cases like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is that confusing? At this stage we don't have any way of knowing whether the peer is known or not. I think "not a peer" captures that pretty well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not sure if the peer's known or not, InvalidFirstHop
could work. But in this code snippet, we're checking if the peer's connected with if e.get().is_connected()
, not if it's known. So, something like a PeerNotConnected
error would make more sense for the error message. Still, I'm cool with whatever you think works best!
/// LDK will not rate limit how many [`Event::OnionMessageForOfflinePeer`]s | ||
/// are generated, so it is the caller's responsibility to limit how many |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, that's correct, but we do have a hard cap on the total number of events we can store (enforced by enqueue_events
). Do you think it would be a good idea to update the docs to clarify that users can't set an infinitely high cap on the number of events they can receive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The events queue should never get close to filling up, if it is we have bigger problems and the user's offer implementation is probably broken :) Not sure what you mean about users setting a cap on the number of events they can receive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that makes sense! The upper cap by enqueue_events
is more like a fail-safe, than a hard cap, so it makes sense to not mention it here!
Not sure what you mean about users setting a cap on the number of events they can receive.
So I was trying to point to this line
"it is the caller's responsibility to limit how many..."
But considering the context, I don't think we need to make any changes here.
Thanks for clarifying it! ✌️
6a3e37c
to
343ca62
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2973 +/- ##
==========================================
+ Coverage 89.35% 91.37% +2.02%
==========================================
Files 117 118 +1
Lines 96335 109415 +13080
Branches 96335 109415 +13080
==========================================
+ Hits 86080 99980 +13900
+ Misses 8033 6972 -1061
- Partials 2222 2463 +241 ☔ View full report in Codecov by Sentry. |
343ca62
to
313b89d
Compare
} | ||
|
||
match message_recipients.entry(*peer_node_id) { | ||
hash_map::Entry::Occupied(mut e) if e.get().is_connected() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to hit the ConnectionNeeded
path, too? Say a peer comes back online, the user gets the OnionMessagePeerConnected
event, but the peer goes offline before forward_onion_message
is called. When the user calls forward_onion_message
, it errors and consumes the OnionMessage
, which is now lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be rare that a peer is disconnected by the time we go to forward, but it is currently possible. As mentioned in the PR description, planning to address this in follow-up per the race condition TODO here: #2950
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing that out. How would the ping
/pong
solution prevent storing data for a non-existent peer? (e.g., when a malicious actor asks us to forward a message to a made up node id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your question, but it's up to the LDK user's event handling code to decide which peers it should store OM forwards for and which should be discarded. If the user decides an OM is worth storing and later calls forward_onion_message
when the outbound peer comes back online, then we would go through the ping/pong flow to let them know when the OM is officially safe to delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I misunderstood the proposed solution. I thought LDK would manage whether to forget the message (in OnionMessenger
) based on whether the ping/pong completed. But, IIUC, the code in this PR would remain as is and LDK would instead surface some event when a peer comes online and has completed a ping/pong. Is that correct?
So, in the scenario I gave, the user would simply not store the message in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code in this PR would remain as is and LDK would instead surface some event when a peer comes online and has completed a ping/pong. Is that correct?
So, in the scenario I gave, the user would simply not store the message in the first place?
Yes to both!
@@ -921,6 +921,17 @@ where | |||
} | |||
msgs | |||
} | |||
|
|||
fn enqueue_event(&self, event: Event) { | |||
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that exclude the majority of the bytes in the onion packet, which are on the heap? Are we trying to limit the number of events or the total size of those events?
@@ -175,6 +175,8 @@ where | |||
message_router: MR, | |||
offers_handler: OMH, | |||
custom_handler: CMH, | |||
intercept_oms_for_offline_peers: bool, | |||
pending_events: Mutex<Vec<Event>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a separate queue if these aren't persisted? If we instead use message_recipients
for buffering -- with a new variant -- then we can re-use outbound_buffer_full
and generate events in the same manner as ConnectionNeeded
. Otherwise, we are left with two different systems for buffering messages and generating events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I believe with this approach an attacker may be able to fragment our heap by continually sending bogus forwards to a new fake peer each time (where each fake forward results in a VecDeque
allocation). I think my other reasoning is that because the events generated may all be bogus, they shouldn't be allowed to count towards outbound_buffer_full
, since currently all of that traffic is known to be more legitimate (either an outbound OM initiated by the LDK user or a forward to a peer that is connected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I believe with this approach an attacker may be able to fragment our heap by continually sending bogus forwards to a new fake peer each time (where each fake forward results in a
VecDeque
allocation).
Yeah, though this seems to be an issue already with onion messages, FWIW. Processing an OnionMessage
involves allocating a Vec<u8>
for Packet::hop_data
.
I think my other reasoning is that because the events generated may all be bogus, they shouldn't be allowed to count towards
outbound_buffer_full
, since currently all of that traffic is known to be more legitimate (either an outbound OM initiated by the LDK user or a forward to a peer that is connected).
Right, though this makes me wonder if the user should specify which peers it supports forwarding to. See other comment about the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Yeah, I suppose specifying which peers support forwarding adds a bit of user to the burden. They are already doing this when processing events, so no sense forcing them to duplicate it here an risk inconsistencies.
313b89d
to
925b529
Compare
Addressed feedback. |
@@ -921,6 +921,17 @@ where | |||
} | |||
msgs | |||
} | |||
|
|||
fn enqueue_event(&self, event: Event) { | |||
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valentinewallace Not sure if this was missed.
925b529
to
1b40d56
Compare
Thanks @jkczyz, addressed feedback. |
@@ -983,6 +1065,13 @@ where | |||
e.get_mut().enqueue_message(onion_message); | |||
log_trace!(logger, "Forwarding an onion message to peer {}", next_node_id); | |||
}, | |||
_ if self.intercept_oms_for_offline_peers => { | |||
self.enqueue_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be fine.
self.enqueue_event( | ||
Event::OnionMessagePeerConnected { peer_node_id: *their_node_id } | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it could be a problem if we enqueue a bunch of Event::OnionMessageForOfflinePeer
but end up dropping an Event::OnionMessagePeerConnected
for the corresponding peer because of a full buffer. Maybe events for a different peer cause the buffer to fill up, for instance. Should we unconditionally enqueue the latter type of events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned #2973 (comment), the events queue should never get close to filling up. So I'm not sure it's worth adding special handling for event dropping cases?
1b40d56
to
bcb55fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall! I was about to ask a question about peers disconnecting prior to the message getting forwarded, but realized you already addressed it: #2973 (comment)
Should be good to squash imo.
Feel free to squash |
Docs will be added in upcoming commits.
Useful if we are in the mode of interception OMs for offline peers, so users know when to re-inject intercepted OMs.
Will be used in the next commit when another config parameter is added.
Previously we derived Debug, but that caused a lot of unreadable encrypted bytes to be printed.
bcb55fe
to
a5ada64
Compare
Squashed. |
_ if self.intercept_messages_for_offline_peers => { | ||
self.pending_events.lock().unwrap().push( | ||
Event::OnionMessageIntercepted { | ||
peer_node_id: next_node_id, message: onion_message | ||
} | ||
); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we log_trace
here like the other arms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I omitted it because of our past discussions of avoiding adding logs that would allow someone to adversarially blow up our logs on disk (there are some pre-existing ones in this method, ofc). So I'm not sure it's a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I thought that was only applicable for higher-level logging like log_error
. Will let @TheBlueMatt provide guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, happy to add it in follow-up too, no reason to hold the PR up.
&Event::OnionMessageIntercepted { ref peer_node_id, ref message } => { | ||
37u8.write(writer)?; | ||
write_tlv_fields!(writer, { | ||
(0, peer_node_id, required), | ||
(2, message, required), | ||
}); | ||
}, | ||
&Event::OnionMessagePeerConnected { ref peer_node_id } => { | ||
39u8.write(writer)?; | ||
write_tlv_fields!(writer, { | ||
(0, peer_node_id, required), | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, these are never persisted by LDK, but this lets us check the buffer size. Clients could persist them, I suppose, but would probably want to persists the parts rather than the enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
As part of implementing async payments, we need a way to store onion message forwards on behalf of often-offline next-hop peers. This allows payers to send a
held_htlc_available
onion message to the mobile recipient, where the recipient's counterparty holds onto this onion message until the recipient comes back online.Rather than storing onion messages internally in LDK, we generate events on OM interception and later signal to the user to re-inject these onion messages for forwarding.
This implements basic offline peer OM interception support. As a follow-up, we should support indicating to users when an intercepted OM has successfully been forwarded and is safe to delete, see #2950.
Partially addresses #2950, partially addresses #2298.