[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

xmrk-btc · 2024-03-30T09:10:04Z

Not sure whether this is bug of lnd, zfs or some other SW or HW (disks...). Reporting because it was mentioned in #8166. Note that I already wrote about this on slack and elsewhere, but there are some updates here, most notably tests with diskchecker.pl.

How it occured

few days before I did lncli --deletepayments. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.
lncli stop
installing updates, this took 30-60 minutes
sync
reboot (does not work, systemd got unresponsive)
reboot -f
lnd starts, it takes much more time than usually, lnd logs show rescans of 1000 last blocks when starting any channel mgr, the machine with bitcoind keeps reading from disk, channels stay inactive or are being closed.
stopped lnd after 15 minutes
saw penalty transaction, stopped lnd, did zpool scrub, got no error.
started lnd again, running for 20 minutes, stopped again

Environment

lnd 0.17.3 64 bits running on Ryzen laptop w 32 GB RAM - will refer to it as lnd-laptop.
channel.db was around 20GB, so most of it should be cached
boltdb
three disks:
- sda - internal SSD. Some sectors are unreadable. Stores / and /home using btrfs. Both were mounted without discard option - this plus liquid probably killed the disk. The mentioned apt update was on sda.
- sdb, sdc - external USB connected SSD disks, store lnd_data as ZFS RAID1, using zfs option sync=standard (never changed this)
Debian 12
zfs kernel module from bookworm-backports. Version zfs-2.1.12-0-g86783d7d9-dist running before the restart, upgraded to zfs-2.2.2-0-g494aaaed8-dist just before the fatal reboot.
lnd data encrypted by native ZFS encryption
bitcoind on another machine (call it bitcoin-laptop), 8 GB RAM laptop with 2 HDDs. (This machine was not restarted). Not pruned, with txindex. Bitcoind's chainstate directory on ZFS, using 4GB of lnd-laptop's memory as l2arc: lnd-laptop exposes 4GB ramdisk via iSCSI, and bitcoin-laptop connects and uses that iSCSI device as l2arc.
lnd-laptop has damaged internal SDD (sda), this probably caused systemd to not respond. Also got strange errors (SIGSEGV) from lnd when running it later for SCB recovery.

Misc

compacting channel.db with chantools - no error
channel.backup up to date - contained all opened channel, even those opened less than 1000 blocks ago, located on the same ZFS filesystem. So this is not a filesystem-wide rollback.
had similar problem in September 2023 - the main problem was that lnd did not start, but I also suffered smaller data loss. I was using the same 2 USB disks as today, with different computer (Raspberry Pi 4 then).
my channel with Blockstream Store probably did not suffer data loss - my LocalHtlcIndex (as seen by doing chantools dumpchannels) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.

Tests

on the same zfs pool that suffered data loss, just turned off compression, 500 MB testfile
tested using diskchecker.pl, see https://brad.livejournal.com/2116715.html
diskchecker.pl on lnd-laptop with zfs kernel module ver. 2.2.2, did sync; reboot -f and verify was ok.
diskchecker.pl on old RPi 4 with 4GB RAM, zfs v2.1.5-1ubuntu6~22.04.2. Tried checking write cache while test was running (sdparm --get=WCE /dev/sd?), this caused some problem because sdparm froze and writing stopped. Did reboot -f (without even doing sync) shortly after and verify was ok.
repeated the same test on RPi but without sdparm, also ok
sdparm returns (on rpi, so disks are renamed)

# sdparm --get=WCE /dev/sda
   /dev/sda: Samsung   SSD 870 EVO 500G  0
WCE not found in Caching (SBC) mode page
# sdparm --get=WCE /dev/sdb
   /dev/sdb: 6iY  �b(HJ�C��O%�  0959
mode sense (10): transport: Host_status=0x03 [DID_TIME_OUT]
Driver_status=0x00 [DRIVER_OK]

WCE           1  [cha: y, def:  1]

The text was updated successfully, but these errors were encountered:

ziggie1984 · 2024-03-30T12:52:53Z

EDIT (changed referenced issue, you are right @Roasbeef)

Thanks for reporting this issue, this ties strongly to #3287 with a safety mode in place we can avoided situations like this in the future when our peer is honest and reports the true state of the channel to us. So limiting the damage mostly to a Data Loss Case where all the channels need to be force-closed but at least we avoid LND broadcasting an old state.

Roasbeef · 2024-04-01T18:28:00Z

@ziggie1984 how would that issue (no connections until start up) address this? IIUC the latest proposal there, we'd still broadcast everything before connecting to peer connections. If we broadcasted an old commitment to try to claim a well timeed out HTLC, then the same event would occur.

Instead I think the safe mode (#3287) feature makes more sense here, as that would disable any/all transaction broadcast or go-to-chain decisions until a user go ahead is acknowledged.

xmrk-btc added bug Unintended code behaviour needs triage labels Mar 30, 2024

xmrk-btc changed the title ~~[bug]: data loss and penalty txs after doing reboot -f while lnd was stopped 30 minutes earlier~~ [bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier Mar 30, 2024

ziggie1984 added safety General label for issues/PRs related to the safety of using the software security General label for issues/PRs related to the security of the software and removed bug Unintended code behaviour needs triage labels Mar 30, 2024

saubyk added the P2 should be fixed if one has time label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

xmrk-btc commented Mar 30, 2024

ziggie1984 commented Mar 30, 2024 •

edited

Loading

Roasbeef commented Apr 1, 2024

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

Comments

xmrk-btc commented Mar 30, 2024

How it occured

Environment

Misc

Tests

ziggie1984 commented Mar 30, 2024 • edited Loading

Roasbeef commented Apr 1, 2024

ziggie1984 commented Mar 30, 2024 •

edited

Loading