Do not unnecessarily retransmit `commitment_signed` in dual funding #1214

TheBlueMatt · 2024-12-09T19:47:07Z

On reconnection in the middle of the dual-funding flow, if both nodes have exchanged the initial commitment_signed and node A had sent its (initial) tx_signatures but node B never received them, both nodes should send a channel_reestablish with next_funding_txid set and a next_commitment_number of 1 (as they've already received the commitment transaction for commitment number 0).

The spec indicates in this case that both nodes should retransmit their commitment_signed, however, as this is only gated on next_funding_txid and not the next_commitment_number field. This may cause implementations which assume that each new commitment_signed is for a new state to fail and potentially fail the channel.

Instead, we should rely both the presence of next_funding_txid and next_commitment_number being zero to decide if we need to resend our commitment_signed. Sadly, we cannot rely on just next_commitment_number as that is used to request a force-closure in a non-standard way to work around implementations not honoring the error message.

On reconnection in the middle of the dual-funding flow, if both nodes have exchanged the initial `commitment_signed` and node A had sent its (initial) `tx_signatures` but node B never received them, both nodes should send a `channel_reestablish` with `next_funding_txid` set and a `next_commitment_number` of 1 (as they've already received the commitment transaction for commitment number 0). The spec indicates in this case that both nodes should retransmit their `commitment_signed`, however, as this is only gated on `next_funding_txid` and not the `next_commitment_number` field. This may cause implementations which assume that each new `commitment_signed` is for a new state to fail and potentially fail the channel. Instead, we should rely both the presence of `next_funding_txid` *and* `next_commitment_number` being zero to decide if we need to resend our `commitment_signed`. Sadly, we cannot rely on just `next_commitment_number` as that is used to request a force-closure in a non-standard way to work around implementations not honoring the `error` message.

Sending a `channel_reestablish` with `next_commitment_number` of zero is used in practice to request a peer fail a channel and broadcast the latest state (for implementations which continue to ignore the `error` message). Because its used in practice we should document it to avoid accidentally writing up incompatible things in the future.

t-bast · 2024-12-11T16:41:44Z

02-peer-protocol.md

+        - if `next_commitment_number` is zero:
+          - MUST retransmit its `commitment_signed` for that funding transaction.


I notice that you didn't add a requirement that we MUST NOT retransmit commitment_signed if next_commitment_number = 1. That's a good thing, because this wouldn't be backwards-compatible, since eclair and cln currently always send next_commitment_number = 1 and will always retransmit commit_sig if next_funding_txid is included.

We previously discussed whether it was worth avoiding unnecessarily retransmitting commit_sig in that case. We decided that since it was simple to ignore a spurious retransmission, it wasn't worth the extra effort. But if you feel that it's important to be cleaner here and avoid that retransmission, we can definitely add it, WDYT @ddustin @niftynei?

In order to introduce this without breaking too many nodes, I think we should deploy that change using the following steps:

as the receiver of channel_reestablish, accept next_commitment_number = 0 if next_funding_txid is set

(once enough nodes have upgraded with the behavior above) as the sender of channel_reestablish, set next_commitment_number = 0 when asking to retransmit commit_sig

(once enough nodes have upgraded with the behavior above) as the receiver of channel_reestablish, don't retransmit commit_sig if next_commitment_number = 1

If we decide to go down that road, we must add the same mechanism for splicing: retransmitting commit_sig when next_funding_txid is set should be gated on next_commitment_number being equal to the current commitment_number. I'm implementing this in eclair to verify that this doesn't have any unintended side effects.

Actually, that was a bit of an oversight. I don't really see why you'd want to spuriously retransmit when its so trivial to avoid. I'm not sure about the current dual-funding implementation but in LDK more generally if you retransmit a commitment_signed post-open we'll definitely think its invalid and FC the channel.

In terms of upgrade, I don't see why we wouldn't just stop sending the spurious retransmit. Most nodes that support dual-funding will upgrade quickly but generally nodes should in practice handle the spurious retransmit for a bit, but I don't really think it needs to be around that long :)

We need to handle backwards-compatibility, because there are two implementations that already shipped with the current retransmit behavior and it's actively used on mainnet. It doesn't have to take a year, but we must coordinate releases with cln if we make this change and allow a grace period where we retransmit.

I don't really see why you'd want to spuriously retransmit when its so trivial to avoid.

It's even more trivial to ignore the spurious retransmit and not have to deal with thinking about whether to retransmit or not (and always send commit_sig before tx_signatures), which is what we chose to currently implement...to be honest it's unclear to me why it's a real problem for you (apart from "we can avoid it")?

Right, okay, so then we're on the same page. One remaining question - what does CLN/eclair do today if they get a next commit number of 0? Can we skip the first step of your upgrade process above?

It's better than I thought for the dual-funding case: eclair will accept next_commitment_number = 0, so we can indeed skip one step of the upgrade.

It's for splicing that this is an issue, because that next_commitment_number will currently be interpreted as our peer being late, but that's only a problem for our own backwards-compat between eclair and phoenix since splicing isn't officially supported yet, so we'll bite the bullet and deal with it.

Presumably you're not splicing with a next commit number of 0, though, that would imply the first version never even got signed :)

What I meant is that for splicing this is slightly more complex:

our channel is at local_commitment_number = N

if we disconnect, when reconnecting, we would set next_commitment_number = N+1

we start splicing and disconnect after sending our commit_sig but before receiving the remote commit_sig

on reconnection, if we want to avoid spurious retransmits, we need to set next_funding_txid and next_commitment_number = N, right? Because we're missing our peer's commit_sig for commitment N that spends the new splice transaction

with what we currently implemented, we didn't make any change to the channel_reestablish logic and would be sending next_commitment_number = N + 1, and if we sent next_commitment_number = N our peer would currently think we're late (because we're not looking at whether next_funding_txid is set in that case, which would remove the confusion)

So we need to more carefully manage our deployment to ensure we're not force-closing channels in the process 😅

As was discussed during this week's spec meeting, since this only matters if peers disconnect in the middle of the signing flow, which should be extremely rare for server nodes, we probably don't need to care that much about phasing deployments between eclair and cln, and can directly implement this PR. We need a sign-off from cln though, @ddustin @niftynei does that sound good to you?

t-bast · 2024-12-11T16:42:52Z

02-peer-protocol.md

+  - if `next_funding_txid` is not set, and `next_commitment_number` is zero:
+    - MUST immediately fail the channel and broadcast any relevant latest commitment
+      transaction.


Is that still a thing? Is lnd still not sending an error when they want their peer to force-close? @Roasbeef what's the status on that, I thought we already settled that a long time ago when discussing #934?

They were definitely still doing it post-934, we started doing it (in addition to an error) to get LND to FC when we ask maybe a year ago and they were still doing it then AFAIU.

@ziggie1984 this is the discussion we had about lnd's behavior: when restoring from an SCB, it looks like lnd is sending a channel_reestablish with next_commitment_number = 0. When that happens, does the sender need its peer to immediately force-close, or will lnd send an error when receiving the remote channel_reestablish to request a force-close? The latter would be best, as it is spec compliant and wouldn't require adding this new requirement.

LND will respond to a normal establishment msg with an error so there is no need for the other implementation to immediately force close when receiving the establishment-msg with a next_commitment_number == 0 , but I think having this code part in place does also not violate the spec am I correct ? Meaning that we can take it out of the Spec but if LND still wants to force-close if the next_commitment_number is not equal to the local height we have for this channel then I think that's ok as well.

That wasn't quite the question, though - will lnd force-close a channel when it receives an error message from its peer (as the spec mandates)? Otherwise, AFAIU, other implementations have to implement a hack to send a reestablish with next_commitment_number = 0 specifically to request the peer force-close the channel.

Triggering a force-close with a channel-reestablishment (setting next_commitment_number = 0) msg seems more robust rather then always force closing when an error is received wdyt ?

Can somebody form CLN look whether CLN still sends an error if there are gossip sync issue ?

I don't think this is the case, as we haven't seen such error from our node with cln peers.

yes

Good, thanks for clarifying that. Since the SCB-restoring node sends error, we don't need to add this requirement @TheBlueMatt: eclair, cln and ldk will force-close when receiving this error, which is spec-compliant. If lnd wants to immediately force-close when receiving channel_reestablish with next_commitment_number = 0, they may do so but that doesn't need to become standard.

Just one last question: the error sent by lnd isn't "internal error", is it? This is the only error on which we don't force-close, because it was a transient one on previous lnd versions...

Triggering a force-close with a channel-reestablishment (setting next_commitment_number = 0) msg seems more robust rather then always force closing when an error is received wdyt ?

I disagree, and a good counter-example to that is that dual-funding has valid reasons to set next_commitment_number = 0 that does not request a force-close.

It's a much better spec to force-close on errors. It's the responsibility of implementation to not send error when force-closing isn't required: I don't see why that cannot be correctly implemented? Especially since we now have the warning message for soft errors.

Good point, haven't looked into dual-funding, that then needs to change on our side and we need to make it more strict.

Just one last question: the error sent by lnd isn't "internal error", is it? This is the only error on which we don't force-close, because it was a transient one on previous lnd versions...

We sent the following error string:

unable to resume channel, recovery required

There seems to be no error codes for these error msgs:
https://github.com/lightning/bolts/blob/c41536829c1461fe5109183ce8a7f88f5c81348b/01-messaging.md#the-error-and-warning-messages

We sent the following error string:

Perfect, we'll definitely force-close on that one!

There seems to be no error codes for these error msgs:

Because there shouldn't need to be any error code, since the behavior should be to:

only send error when you want your peer to force-close

force-close when receiving error, whatever it contains

But implementations have had to deviate from the spec because lnd (and based on what you're saying, cln as well) was sending errors for non-fatal issues that actually didn't require a force-close, and it made people lose many sats in on-chain fees...

…umber` As pointed out in lightning/bolts#1214, when reconnecting a partially signed `interactive-tx` session, we should set `next_commitment_number` to the current commitment number if we haven't received our peer's `commit_sig`, which tells them they need to retransmit it. That's not what we're currently doing: we're currently setting this value to the next commitment number, regardless of whether or not we have received our peer's `commit_sig`. And we always retransmit our `commit_sig` if our peer is setting `next_funding_txid`, even if they have already received it. More importantly, if our peer behaves correctly and sends us the current commitment number, we will think that they're late and will halt, waiting for them to send `error`. This commit fixes that by allowing our peers to use the current commitment number when they set `next_funding_txid`. Note that this doesn't yet make us spec-compliant, but in order to guarantee backwards-compatibility, we must first deploy that change before we can start removing spurious `commit_sig` retransmissions.

We fully implement lightning/bolts#1214 to stop retransmitting `commit_sig` when our peer has already received it. We also correctly set `next_commitment_number` to let our peer know whether we have received their `commit_sig` or not.

As pointed out in lightning/bolts#1214, when reconnecting a partially signed `interactive-tx` session, we should set `next_commitment_number` to the current commitment number if we haven't received our peer's `commit_sig`, which tells them they need to retransmit it. More importantly, if our peer behaves correctly and sends us the current commitment number, we must not think that they're late and halt, waiting for them to send `error`. This commit fixes that by allowing our peers to use the current commitment number when they set `next_funding_txid`. Note that we keep retransmitting our `commit_sig` regardless of the value our peer set in `next_commitment_number`, because we need to wait for them to have an opportunity to upgrade. In a future commit we will stop sending spurious `commit_sig`.

We fully implement lightning/bolts#1214 to stop retransmitting `commit_sig` when our peer has already received it. We also correctly set `next_commitment_number` to let our peer know whether we have received their `commit_sig` or not.

TheBlueMatt added 2 commits December 9, 2024 19:46

TheBlueMatt mentioned this pull request Dec 9, 2024

Follow-ups to PR 3137 lightningdevkit/rust-lightning#3423

Open

t-bast mentioned this pull request Dec 11, 2024

Lightning Specification Meeting 2024/12/16 #1213

Open

19 tasks

t-bast reviewed Dec 11, 2024

View reviewed changes

t-bast mentioned this pull request Dec 12, 2024

Add more splice RBF reconnection tests ACINQ/eclair#2964

Merged

t-bast mentioned this pull request Dec 12, 2024

Reestablish partially signed splice with current remote_commitment_number ACINQ/eclair#2965

Open

t-bast mentioned this pull request Dec 13, 2024

Remove spurious interactive-tx commit_sig retransmission ACINQ/eclair#2966

Draft

t-bast mentioned this pull request Dec 13, 2024

Correctly set next_commitment_number during splice reconnect ACINQ/lightning-kmp#736

Draft

t-bast mentioned this pull request Dec 13, 2024

Remove spurious interactive-tx commit_sig retransmission ACINQ/lightning-kmp#737

Draft

ziggie1984 mentioned this pull request Dec 18, 2024

[bug]: Remove Soft-Error Handling during ChannelLink reestablishment. lightningnetwork/lnd#9370

Open

t-bast mentioned this pull request Jan 2, 2025

Channel Splicing (feature 62/63) #1160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not unnecessarily retransmit `commitment_signed` in dual funding #1214

Do not unnecessarily retransmit `commitment_signed` in dual funding #1214

TheBlueMatt commented Dec 9, 2024

t-bast Dec 11, 2024

TheBlueMatt Dec 11, 2024

TheBlueMatt Dec 11, 2024

t-bast Dec 12, 2024

t-bast Dec 12, 2024

TheBlueMatt Dec 12, 2024

t-bast Dec 12, 2024 •

edited

Loading

TheBlueMatt Dec 12, 2024

t-bast Dec 12, 2024

t-bast Dec 18, 2024

t-bast Dec 11, 2024

TheBlueMatt Dec 11, 2024

t-bast Dec 16, 2024

ziggie1984 Dec 17, 2024 •

edited

Loading

TheBlueMatt Dec 17, 2024

ziggie1984 Dec 18, 2024

t-bast Dec 18, 2024

t-bast Dec 18, 2024

ziggie1984 Dec 18, 2024

t-bast Dec 18, 2024

		- if `next_commitment_number` is zero:
		- MUST retransmit its `commitment_signed` for that funding transaction.

Do not unnecessarily retransmit commitment_signed in dual funding #1214

Are you sure you want to change the base?

Do not unnecessarily retransmit commitment_signed in dual funding #1214

Conversation

TheBlueMatt commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-bast Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Do not unnecessarily retransmit `commitment_signed` in dual funding #1214

Do not unnecessarily retransmit `commitment_signed` in dual funding #1214

t-bast Dec 12, 2024 •

edited

Loading

ziggie1984 Dec 17, 2024 •

edited

Loading