Old notes 2

axioms

cloud storage:
reasonably cheap per byte (~$1/GB/year)
unlimited in size (restricted only by your budget)
unlikely to suffer data loss because it has internal redundancy
restricted in bandwidth (maybe 100kB/s write, 1000kB/s read) and latency (~200ms?)
private, but not strongly private - so encryption is a good idea
no server-side smarts
tends to want to write whole files atomically

consequences: we can write any amount of data, but we can’t write it very fast, so we need to avoid reading back existing data, and to be able to easily interrupt and resume a backup

Therefore a good backup program needs to:

support both local near-line storage, and cloud storage to recover from local hardware failure or
write moderately large chunks, without choppy operations to read back previous state
assume that the program can never complete, and pick up from an interruption and keep going

backup programs in general:

simplicity is highly desirable: what’s not there can’t fail
make it easy to verify that what’s stored there is self-consistent
recover well from bugs

always backup todo

need a format for escaping or encoding filenames within index file
check handling of non-regular files (symlinks)
check sorting of unicode names
store names relative to the top of the backup tree
cope with files changing while being backed up
layers
we have layers, and then blocks within each layer

All files actually stored are compressed and then gpg-encrypted.

There is layer start metadata: the utc time the layer started being recorded, and the utc start time of the previous backup layer, if this is an incremental backup. Only files modified on or after the time of the previous layer will be included.

Each block includes: tarball of actual files, and an index listing the files in that tarball, with their mtime, ctime, and hash. Each real file is stored in a single block, so the tarballs can grow to at most twice the size of any actual file. (Perhaps for easier atomic transfer we should split them even within a second tarball - depends whether we want to assume the storage allows resuming an interrupted transfer, but I think S3 does.) There is also an index of deletions, which are files present in the previous layer no longer present on disk.

To start a new block within a layer: read the index of the previous block (if any), and get the last file stored within that block. Seek through the filesystem to the point lexically just after that file. Start recording files modified after that time.

To record deletions, we’ll need to also read back the indexes of all underlying layers. (That suggests perhaps we want a stack not a linear chain...)

metadata: time layer recording started, time before which files will be excluded, path after which files are included

store files in order by path, modified after that time, whose path comes after the given point once the layer gets too big, finish it and start a new one also store an index of files included in that layer

blanket backup

mission: Real filesystems don't general have self-consistent quiescent points. It doesn't make sense to try to restore a backup at a particular moment in time because the filesystem probably never had that state. Systems that think in terms of transactional snapshots tend to have trouble with needing to store a whole snapshot to accomplish anything. With large disks and intermittent network connections, it can be hard to ever finish a backup. If you've written up 1GB of data, you ought to be able to restore most of that data, regardless of how much more remains to be written.So instead Blanket accumulates over time a set of files that partially cover the filesystem. Replaying all of these files in the order they were recorded lets you restore the filesystem up to that point. Each tarball covers a contiguous subsequence of ordered list of files.

key features

copes well with interrupted or partial backups; as long as one whole file is transferred, it can be restored
copes with a dumb server (including S3)
can cap the amount of space used for backups and keep as many previous increments as will fit in that
simple backup format allowing manual recovery
names: blanket? always? replicity?

##desires

as much as possible, something that can just run from a cron job and never need maintenance
cope with multiple interrupted short run: get away from the idea of a single run that must complete to be able to restore
can interrupt or reboot client machine, reconnect to server
ideally would not count on even uploading a single file in one transaction, but rather be able to upload that single file. but this seems to put some constraints on the storage format and perhaps it's not really worthwhile.
quickly verify that what was uploaded is self-consistent and correct, ideally without downloading all of it - some contradiction there - might be able to ask S3 for the hash of the files?
minimum assumptions about capabilities of the store: don't count on being able to represent all filenames or being able to store permissions
restore some (multiple?) subset of files or directories without scanning through the whole archive
use librsync to store increments between files (later?)
restore by meshing together multiple damaged or partial backups from different servers
when restoring, if some files already have the right hash, don't bother reading them
ui abstraction so it can get a gui later
sign/encrypt data files, through gpg or something else
relatively simple storage
never require uploading the whole filesystem to make progress
interrupted or in-progress backup mustn't prevent restore operations
multiple increments would be nice, so that you can get back previous states
a way to garbage-collect old unwanted increments, without rewriting the whole archive or making a new full backup

good and bad points in other systems

rsync: requires smart counterparty; full filesystem scan on both machines at startup; stores files unpacked so can't encrypt and relies on destination capabilities

duplicity:

for decent performance, requires sometimes doing a full backup; but in fact you may often have some files that never change and copying them all the way up seems redundant
until the first full backup completes, you can't restore anything?
can resume backups but this seems to cause some glitches

file mtimes:

if we could trust file mtimes, we could avoid a lot of trouble with reading indexes for the old backups, but they're probably not ultimately trustworthy
also files are probably reasonably often touched but not changed, so if we do this the lack of rsync compression will stick out more
perhaps it's reasonable to think the clock does not skew by so much that one backup overlaps with another?

garbage collection:

have "layers" of backups and do garbage collection based on that? so the daily backups would contain all files changed since the last weekly backup. exclude other backups from the same level from consideration when deciding whether a file needs to be backed up. but this doesn't seem to totally fit the rather freeform and emergent approach discussed in other places. do you have to tell it the level each time? have an explicit garbage-collection option to remove some or all layers, perhaps layer prior to a previous date. do that by finding files that still exist and that are only referenced from those layer and rewriting them into a smaller pack. we could even do this locally by just looking at mtime/ctimes, assuming we trust them, which is probably not quite safe enough.

handling deletions:

need to distinguish "not changed in this layer" from "deleted in this layer"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly