Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

Open
xmrk-btc opened this issue Mar 30, 2024 · 2 comments
Labels
P2 should be fixed if one has time safety General label for issues/PRs related to the safety of using the software security General label for issues/PRs related to the security of the software

Comments

@xmrk-btc
Copy link

Not sure whether this is bug of lnd, zfs or some other SW or HW (disks...). Reporting because it was mentioned in #8166. Note that I already wrote about this on slack and elsewhere, but there are some updates here, most notably tests with diskchecker.pl.

How it occured

  1. few days before I did lncli --deletepayments. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.
  2. lncli stop
  3. installing updates, this took 30-60 minutes
  4. sync
  5. reboot (does not work, systemd got unresponsive)
  6. reboot -f
  7. lnd starts, it takes much more time than usually, lnd logs show rescans of 1000 last blocks when starting any channel mgr, the machine with bitcoind keeps reading from disk, channels stay inactive or are being closed.
  8. stopped lnd after 15 minutes
  9. saw penalty transaction, stopped lnd, did zpool scrub, got no error.
  10. started lnd again, running for 20 minutes, stopped again

Environment

  • lnd 0.17.3 64 bits running on Ryzen laptop w 32 GB RAM - will refer to it as lnd-laptop.
  • channel.db was around 20GB, so most of it should be cached
  • boltdb
  • three disks:
    • sda - internal SSD. Some sectors are unreadable. Stores / and /home using btrfs. Both were mounted without discard option - this plus liquid probably killed the disk. The mentioned apt update was on sda.
    • sdb, sdc - external USB connected SSD disks, store lnd_data as ZFS RAID1, using zfs option sync=standard (never changed this)
  • Debian 12
  • zfs kernel module from bookworm-backports. Version zfs-2.1.12-0-g86783d7d9-dist running before the restart, upgraded to zfs-2.2.2-0-g494aaaed8-dist just before the fatal reboot.
  • lnd data encrypted by native ZFS encryption
  • bitcoind on another machine (call it bitcoin-laptop), 8 GB RAM laptop with 2 HDDs. (This machine was not restarted). Not pruned, with txindex. Bitcoind's chainstate directory on ZFS, using 4GB of lnd-laptop's memory as l2arc: lnd-laptop exposes 4GB ramdisk via iSCSI, and bitcoin-laptop connects and uses that iSCSI device as l2arc.
  • lnd-laptop has damaged internal SDD (sda), this probably caused systemd to not respond. Also got strange errors (SIGSEGV) from lnd when running it later for SCB recovery.

Misc

  • compacting channel.db with chantools - no error
  • channel.backup up to date - contained all opened channel, even those opened less than 1000 blocks ago, located on the same ZFS filesystem. So this is not a filesystem-wide rollback.
  • had similar problem in September 2023 - the main problem was that lnd did not start, but I also suffered smaller data loss. I was using the same 2 USB disks as today, with different computer (Raspberry Pi 4 then).
  • my channel with Blockstream Store probably did not suffer data loss - my LocalHtlcIndex (as seen by doing chantools dumpchannels) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.

Tests

  • on the same zfs pool that suffered data loss, just turned off compression, 500 MB testfile
  • tested using diskchecker.pl, see https://brad.livejournal.com/2116715.html
  • diskchecker.pl on lnd-laptop with zfs kernel module ver. 2.2.2, did sync; reboot -f and verify was ok.
  • diskchecker.pl on old RPi 4 with 4GB RAM, zfs v2.1.5-1ubuntu6~22.04.2. Tried checking write cache while test was running (sdparm --get=WCE /dev/sd?), this caused some problem because sdparm froze and writing stopped. Did reboot -f (without even doing sync) shortly after and verify was ok.
  • repeated the same test on RPi but without sdparm, also ok
  • sdparm returns (on rpi, so disks are renamed)
# sdparm --get=WCE /dev/sda
   /dev/sda: Samsung   SSD 870 EVO 500G  0
WCE not found in Caching (SBC) mode page
# sdparm --get=WCE /dev/sdb
   /dev/sdb: 6iY  �b(HJ�C��O%�  0959
mode sense (10): transport: Host_status=0x03 [DID_TIME_OUT]
Driver_status=0x00 [DRIVER_OK]

WCE           1  [cha: y, def:  1]

@xmrk-btc xmrk-btc added bug Unintended code behaviour needs triage labels Mar 30, 2024
@xmrk-btc xmrk-btc changed the title [bug]: data loss and penalty txs after doing reboot -f while lnd was stopped 30 minutes earlier [bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier Mar 30, 2024
@ziggie1984 ziggie1984 added safety General label for issues/PRs related to the safety of using the software security General label for issues/PRs related to the security of the software and removed bug Unintended code behaviour needs triage labels Mar 30, 2024
@ziggie1984
Copy link
Collaborator

ziggie1984 commented Mar 30, 2024

EDIT (changed referenced issue, you are right @Roasbeef)

Thanks for reporting this issue, this ties strongly to #3287 with a safety mode in place we can avoided situations like this in the future when our peer is honest and reports the true state of the channel to us. So limiting the damage mostly to a Data Loss Case where all the channels need to be force-closed but at least we avoid LND broadcasting an old state.

@Roasbeef
Copy link
Member

Roasbeef commented Apr 1, 2024

@ziggie1984 how would that issue (no connections until start up) address this? IIUC the latest proposal there, we'd still broadcast everything before connecting to peer connections. If we broadcasted an old commitment to try to claim a well timeed out HTLC, then the same event would occur.

Instead I think the safe mode (#3287) feature makes more sense here, as that would disable any/all transaction broadcast or go-to-chain decisions until a user go ahead is acknowledged.

@saubyk saubyk added the P2 should be fixed if one has time label Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 should be fixed if one has time safety General label for issues/PRs related to the safety of using the software security General label for issues/PRs related to the security of the software
Projects
None yet
Development

No branches or pull requests

4 participants