-
Notifications
You must be signed in to change notification settings - Fork 21
Old notes 2
- cloud storage:
- reasonably cheap per byte (~$1/GB/year)
- unlimited in size (restricted only by your budget)
- unlikely to suffer data loss because it has internal redundancy
- restricted in bandwidth (maybe 100kB/s write, 1000kB/s read) and latency (~200ms?)
- private, but not strongly private - so encryption is a good idea
- no server-side smarts
- tends to want to write whole files atomically
consequences: we can write any amount of data, but we can’t write it very fast, so we need to avoid reading back existing data, and to be able to easily interrupt and resume a backup
Therefore a good backup program needs to:
- support both local near-line storage, and cloud storage to recover from local hardware failure or
- write moderately large chunks, without choppy operations to read back previous state
- assume that the program can never complete, and pick up from an interruption and keep going
backup programs in general:
- simplicity is highly desirable: what’s not there can’t fail
- make it easy to verify that what’s stored there is self-consistent
- recover well from bugs
- need a format for escaping or encoding filenames within index file
- check handling of non-regular files (symlinks)
- check sorting of unicode names
- store names relative to the top of the backup tree
- cope with files changing while being backed up
- layers
- we have layers, and then blocks within each layer
All files actually stored are compressed and then gpg-encrypted.
There is layer start metadata: the utc time the layer started being recorded, and the utc start time of the previous backup layer, if this is an incremental backup. Only files modified on or after the time of the previous layer will be included.
Each block includes: tarball of actual files, and an index listing the files in that tarball, with their mtime, ctime, and hash. Each real file is stored in a single block, so the tarballs can grow to at most twice the size of any actual file. (Perhaps for easier atomic transfer we should split them even within a second tarball - depends whether we want to assume the storage allows resuming an interrupted transfer, but I think S3 does.) There is also an index of deletions, which are files present in the previous layer no longer present on disk.
To start a new block within a layer: read the index of the previous block (if any), and get the last file stored within that block. Seek through the filesystem to the point lexically just after that file. Start recording files modified after that time.
To record deletions, we’ll need to also read back the indexes of all underlying layers. (That suggests perhaps we want a stack not a linear chain...)
metadata: time layer recording started, time before which files will be excluded, path after which files are included
store files in order by path, modified after that time, whose path comes after the given point once the layer gets too big, finish it and start a new one also store an index of files included in that layer
mission: Real filesystems don't general have self-consistent quiescent points. It doesn't make sense to try to restore a backup at a particular moment in time because the filesystem probably never had that state. Systems that think in terms of transactional snapshots tend to have trouble with needing to store a whole snapshot to accomplish anything. With large disks and intermittent network connections, it can be hard to ever finish a backup. If you've written up 1GB of data, you ought to be able to restore most of that data, regardless of how much more remains to be written.So instead Blanket accumulates over time a set of files that partially cover the filesystem. Replaying all of these files in the order they were recorded lets you restore the filesystem up to that point. Each tarball covers a contiguous subsequence of ordered list of files.
- copes well with interrupted or partial backups; as long as one whole file is transferred, it can be restored
- copes with a dumb server (including S3)
- can cap the amount of space used for backups and keep as many previous increments as will fit in that
- simple backup format allowing manual recovery
- names: blanket? always? replicity?
##desires
- as much as possible, something that can just run from a cron job and never need maintenance
- cope with multiple interrupted short run: get away from the idea of a single run that must complete to be able to restore
- can interrupt or reboot client machine, reconnect to server
- ideally would not count on even uploading a single file in one transaction, but rather be able to upload that single file. but this seems to put some constraints on the storage format and perhaps it's not really worthwhile.
- quickly verify that what was uploaded is self-consistent and correct, ideally without downloading all of it - some contradiction there - might be able to ask S3 for the hash of the files?
- minimum assumptions about capabilities of the store: don't count on being able to represent all filenames or being able to store permissions
- restore some (multiple?) subset of files or directories without scanning through the whole archive
- use librsync to store increments between files (later?)
- restore by meshing together multiple damaged or partial backups from different servers
- when restoring, if some files already have the right hash, don't bother reading them
- ui abstraction so it can get a gui later
- sign/encrypt data files, through gpg or something else
- relatively simple storage
- never require uploading the whole filesystem to make progress
- interrupted or in-progress backup mustn't prevent restore operations
- multiple increments would be nice, so that you can get back previous states
- a way to garbage-collect old unwanted increments, without rewriting the whole archive or making a new full backup
rsync: requires smart counterparty; full filesystem scan on both machines at startup; stores files unpacked so can't encrypt and relies on destination capabilities
duplicity:
- for decent performance, requires sometimes doing a full backup; but in fact you may often have some files that never change and copying them all the way up seems redundant
- until the first full backup completes, you can't restore anything?
- can resume backups but this seems to cause some glitches
file mtimes:
- if we could trust file mtimes, we could avoid a lot of trouble with reading indexes for the old backups, but they're probably not ultimately trustworthy
- also files are probably reasonably often touched but not changed, so if we do this the lack of rsync compression will stick out more
- perhaps it's reasonable to think the clock does not skew by so much that one backup overlaps with another?
have "layers" of backups and do garbage collection based on that? so the daily backups would contain all files changed since the last weekly backup. exclude other backups from the same level from consideration when deciding whether a file needs to be backed up. but this doesn't seem to totally fit the rather freeform and emergent approach discussed in other places. do you have to tell it the level each time? have an explicit garbage-collection option to remove some or all layers, perhaps layer prior to a previous date. do that by finding files that still exist and that are only referenced from those layer and rewriting them into a smaller pack. we could even do this locally by just looking at mtime/ctimes, assuming we trust them, which is probably not quite safe enough.
need to distinguish "not changed in this layer" from "deleted in this layer"