-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qubes Backup hangs if there is an I/O error #7567
Comments
Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem. |
I'll try it |
Tried it. No change. Is there a log of what has been backed up? I tried doing a "verify" from the restore tool to see what got backed up and what did not, in order to get a hint at which VM was causing it, but that lead to a new larger concern: New Even Larger Concern: |
Raising priority as it causes data loss. |
My understanding is that the backup verification tool checks only for errors during the restore process; it does not check for missing data. @marmarek, is this accurate? |
The verification tool should check for truncated VM images at least. |
Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456 |
That would explain it!!!! This is definitely a (UX, at least) bug. |
Yes I did. After your post, I removed everything that I believe had sparse filesystems in them and it still only got to 15665920 . running top shows no resource consumption, and ps does not show any tarwriter in the process listing Is there any way to figure out what VM it's hanging on? I'm really hoping for a log somewhere. |
I found qvm-backup, so to try to debug it, instead of using the GUI, I did:
Any ideas? Obviously this is not the cause of the problem, as it happens instantly whereas the problem gets 15Gig into the backuo, but the backup preparation error is preventing me from figuring out what's going on.
|
The backup profile must be passed as the basename of a .conf file under
But
|
Results: Most backups seem to look like this:
However, the final backup when the hang happened, looked like this:
And then just hangs there. Now to be fair, the system that it hangs on has 28G of uncompressable data (and 7G free space) on the private volume Note that the "handle_streams returned" line showed up before the "Sending file" and "Removing file" lines for vm20/private.img , as opposed to vm19 and some others I checked. Also, i see that it split files for some of the larger templates, so with 28G of data, i would expect it to be trying to split it too. Also, I left the string "vm-sys-backup-private" in to draw to the point that the backup system backed up the backup qube (the qube that the backup data goes through on its way to getting saved) right before it started vm20 (in case that was relevant). |
I removed that one VM from the list to back up and the backup worked. I can probably fix that one by creating a new appVM, qvm-copy the files over to that, and erase the original appVM. Assuming that process works, is there any reason to debug the original appVM to uncover what the actual problem was? Interestingly, that appVM is one I had backed up before, and to the best of my knowledge haven't touched since the last backup. |
Just to confirm, you already tried creating the backup without compression, right? (That's always the first thing I try when there's a problem backing up VMs containing large amounts of incompressible data.) |
Ah, here's the issue I was thinking of: |
If you have enough disk space you could keep that VM around, in case someone comes up with an idea how to find the bug. |
One possibility would be to attach a Python debugger to qubesd, but a simpler solution might be to just run |
I have now, and got the same result
There are not. there is no "gzip" or "tar" strings (other then strings containing "start" or "target") in the results of "ps aux" when it is locked up. Nothing related to backup seems to be using the cpu when it's locked up. I'm guessing that the "Finished sending thread" log message means the tar/gzip process has ended? |
Looks like i found the root cause! trying to qvm-copy files over to the new VM generated the error: with dmesg in the VM: and dom0 dmesg: implying the nvme is bad. So we should be able to reduce this to some UX issues like:
Actually, instead of just changing the documentation for qubes-verify to make it clear it doesn't actually verify it was written, could we make qubes-verify actually verify the hash of the backup? |
@ddevz Thanks for tracking this down! An I/O error can cause all sorts of problems, but a hang should not be one of them unless something else goes wrong. This may well happen, as most software is not tested for its handling of I/O errors. Nevertheless, backup verification really should verify that the backup was actually written, as restoring a backup is time-consuming and an untested backup provides a false sense of security. |
Maybe the I/O error even causes an exception, but something in the backup system swallows it like in #7411 (comment):
|
Sure thing. Another thought. Does qubes have a way to notify the user of S.M.A.R.T. /disk errors? There is a package "smart-notifier" that might be able to do it, but it says "only for gtk", and I believe qubes is xfce. of course that would be adding yet another package to dom0 that you'd have to worry about the security of. |
Just use |
I'm not sure whether I suffered from this particular bug when upgrading from Qubes 4.0 to 4.1.1 (I already lost +200GiB of data, including over 2 years worth of work in some projects. I found out too late when the fresh install was already done over the old one), but it's closely related: I made a +900GiB backup on a 3.5" SATA hard drive attached to a USB 3.0 adapter using the sys-usb VM and mounted on a disposable VM. I knew it would take several hours so I unchecked the compress option to save CPU. When I checked next day the Qubes backup tool showed a success message, but I noticed the hard drive motor was stopped (that hard drive never stops its motor when it's connected to an internal SATA port). The backup file was slightly under 800GiB, but at that moment it seemed reasonable because, despite not being compressed, actual written sparse zeroes on some VMs could have been left out. I made the risky decision of not verifying the backup's integrity, as it would have required a similar amount of hours, and I was overly confident in my computer forensic skills in the event of damaged sectors. At the restore stage an I/O error popped out and half of my VMs showed 0 bytes of Disk Usage in the Qube Manager. When doing an emergency recovery I found out that all of those 0-byte VMs had an "Unexpected EOF" error in all of their chunks when decrypting them with scrypt. One of the VMs' chunks were readable until the 490'th. My (unproven) theory is that at some point either the disposable VM the hard drive was mounted on or sys-usb sent it the message to sleep (thus stopping the motor) and that caused an I/O error when trying to write past the partial VM's 490'th chunk. In order to prevent further data losses I would suggest comparing the expected uncompressed size of the logical volumes to the actual written bytes before compressing their images by piping |
@AlbertGoma please file a new issue for this. |
@DemiMarie Done: #7797 |
I can confirm this issue on Qubes R4.1 with USBVM setup. When doing a backup, I had hardware issues. The harddisk become too hot. The VM having the external USB drive mounted (
Meanwhile Qubes dom0 Maybe to reproduce it would be simplest to just run a backup and then unplug the external USB backup disk while the backup is running. |
This issue is being closed because:
If anyone believes that this issue should be reopened, please leave a comment saying so. |
Qubes OS release
4.1
Brief summary
Qubes Backup now hangs after writing 15Gigs:
First time: 15570100 backed up (found with ls -s)
Second time: 15676416 backed up (found with ls -s)
Third time: 15676416 backed up (found with ls -s)
This used to work and write files 183G.
After hanging the disk activity stops.
I"m guessing that its hanging when it gets to a particular VM.
Is there a debug log for qubes-backup somewhere to look for errors?
Steps to reproduce
This happens reliably and has happened 3 out of 3 sequential backup attempts.
I see similar symptoms, (but a different cause because he was not using LVM, lvdisplay shows I am using LVM) in the report at: #7411
The text was updated successfully, but these errors were encountered: