You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xmrk-btc opened this issue
Mar 30, 2024
· 2 comments
Labels
P2should be fixed if one has timesafetyGeneral label for issues/PRs related to the safety of using the softwaresecurityGeneral label for issues/PRs related to the security of the software
Not sure whether this is bug of lnd, zfs or some other SW or HW (disks...). Reporting because it was mentioned in #8166. Note that I already wrote about this on slack and elsewhere, but there are some updates here, most notably tests with diskchecker.pl.
How it occured
few days before I did lncli --deletepayments. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.
lncli stop
installing updates, this took 30-60 minutes
sync
reboot (does not work, systemd got unresponsive)
reboot -f
lnd starts, it takes much more time than usually, lnd logs show rescans of 1000 last blocks when starting any channel mgr, the machine with bitcoind keeps reading from disk, channels stay inactive or are being closed.
stopped lnd after 15 minutes
saw penalty transaction, stopped lnd, did zpool scrub, got no error.
started lnd again, running for 20 minutes, stopped again
Environment
lnd 0.17.3 64 bits running on Ryzen laptop w 32 GB RAM - will refer to it as lnd-laptop.
channel.db was around 20GB, so most of it should be cached
boltdb
three disks:
sda - internal SSD. Some sectors are unreadable. Stores / and /home using btrfs. Both were mounted without discard option - this plus liquid probably killed the disk. The mentioned apt update was on sda.
sdb, sdc - external USB connected SSD disks, store lnd_data as ZFS RAID1, using zfs option sync=standard (never changed this)
Debian 12
zfs kernel module from bookworm-backports. Version zfs-2.1.12-0-g86783d7d9-dist running before the restart, upgraded to zfs-2.2.2-0-g494aaaed8-dist just before the fatal reboot.
lnd data encrypted by native ZFS encryption
bitcoind on another machine (call it bitcoin-laptop), 8 GB RAM laptop with 2 HDDs. (This machine was not restarted). Not pruned, with txindex. Bitcoind's chainstate directory on ZFS, using 4GB of lnd-laptop's memory as l2arc: lnd-laptop exposes 4GB ramdisk via iSCSI, and bitcoin-laptop connects and uses that iSCSI device as l2arc.
lnd-laptop has damaged internal SDD (sda), this probably caused systemd to not respond. Also got strange errors (SIGSEGV) from lnd when running it later for SCB recovery.
Misc
compacting channel.db with chantools - no error
channel.backup up to date - contained all opened channel, even those opened less than 1000 blocks ago, located on the same ZFS filesystem. So this is not a filesystem-wide rollback.
had similar problem in September 2023 - the main problem was that lnd did not start, but I also suffered smaller data loss. I was using the same 2 USB disks as today, with different computer (Raspberry Pi 4 then).
my channel with Blockstream Store probably did not suffer data loss - my LocalHtlcIndex (as seen by doing chantools dumpchannels) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.
Tests
on the same zfs pool that suffered data loss, just turned off compression, 500 MB testfile
diskchecker.pl on lnd-laptop with zfs kernel module ver. 2.2.2, did sync; reboot -f and verify was ok.
diskchecker.pl on old RPi 4 with 4GB RAM, zfs v2.1.5-1ubuntu6~22.04.2. Tried checking write cache while test was running (sdparm --get=WCE /dev/sd?), this caused some problem because sdparm froze and writing stopped. Did reboot -f (without even doing sync) shortly after and verify was ok.
repeated the same test on RPi but without sdparm, also ok
sdparm returns (on rpi, so disks are renamed)
# sdparm --get=WCE /dev/sda
/dev/sda: Samsung SSD 870 EVO 500G 0
WCE not found in Caching (SBC) mode page
# sdparm --get=WCE /dev/sdb
/dev/sdb: 6iY �b(HJ�C��O%� 0959
mode sense (10): transport: Host_status=0x03 [DID_TIME_OUT]
Driver_status=0x00 [DRIVER_OK]
WCE 1 [cha: y, def: 1]
The text was updated successfully, but these errors were encountered:
xmrk-btc
changed the title
[bug]: data loss and penalty txs after doing reboot -f while lnd was stopped 30 minutes earlier
[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier
Mar 30, 2024
ziggie1984
added
safety
General label for issues/PRs related to the safety of using the software
security
General label for issues/PRs related to the security of the software
and removed
bug
Unintended code behaviour
needs triage
labels
Mar 30, 2024
EDIT (changed referenced issue, you are right @Roasbeef)
Thanks for reporting this issue, this ties strongly to #3287 with a safety mode in place we can avoided situations like this in the future when our peer is honest and reports the true state of the channel to us. So limiting the damage mostly to a Data Loss Case where all the channels need to be force-closed but at least we avoid LND broadcasting an old state.
@ziggie1984 how would that issue (no connections until start up) address this? IIUC the latest proposal there, we'd still broadcast everything before connecting to peer connections. If we broadcasted an old commitment to try to claim a well timeed out HTLC, then the same event would occur.
Instead I think the safe mode (#3287) feature makes more sense here, as that would disable any/all transaction broadcast or go-to-chain decisions until a user go ahead is acknowledged.
P2should be fixed if one has timesafetyGeneral label for issues/PRs related to the safety of using the softwaresecurityGeneral label for issues/PRs related to the security of the software
Not sure whether this is bug of lnd, zfs or some other SW or HW (disks...). Reporting because it was mentioned in #8166. Note that I already wrote about this on slack and elsewhere, but there are some updates here, most notably tests with diskchecker.pl.
How it occured
lncli --deletepayments
. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.Environment
apt update
was on sda.sync=standard
(never changed this)Misc
chantools dumpchannels
) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.Tests
sync; reboot -f
and verify was ok.sdparm --get=WCE /dev/sd?
), this caused some problem because sdparm froze and writing stopped. Didreboot -f
(without even doing sync) shortly after and verify was ok.The text was updated successfully, but these errors were encountered: