WORKBENCH / DISPATCH 004 / 4

Moving data you can't afford to lose

The boring disciplines behind a data migration you can actually prove: irreplaceability order, checksum gates, offsite-restore tests, uid preservation, and verifying by function.

  • zfs
  • rsync
  • restic
  • nfs
STATUS · SHIPPED

Last month I moved about three-quarters of a terabyte off an old machine onto a new ZFS mirror: photos, roughly ten years of media, and the data and databases behind about a dozen services. Copying is the easy part. The work is proving it — that every byte arrived, the databases still open, and the services come back. These are the checks I ran, in order.

Copy in irreplaceability order. First the data that exists nowhere else — photos, archive, originals — verified and offsite, before the bulk I could re-rip in a weekend. If a step fails early, it fails on replaceable data.

A size match is not a copy. du disagrees across filesystems: it counts directory metadata and dedups hardlinks, so identical data reports two different sizes on two machines. Compare content, not size — rsync -c (checksum) or a byte-for-byte comparison. The checksum pass over the full archive returned zero differences.

Restore one file before you trust the backup. After the offsite restic copy finished, I restored a single 450 MB video and byte-compared it against the source. Identical. A backup you’ve never restored from is unverified.

Preserve uids or Postgres won’t start. Postgres data is owned by a specific user id; a naive copy squashes ownership and Postgres refuses to start. I synced with rsync -aHX --numeric-ids over a share that preserves ownership, then checked it: wrote a file as that uid, confirmed it landed as that uid. Thousands of database-owned files moved across with ownership intact.

Cut services over with a quiesce. Don’t copy data out from under a running app. Dump the database, stop the containers, run a final sync of the now-static files, repoint storage, restart. That last sync is consistent because nothing is writing. The source stays in place until the end, so any step can be reversed.

Verify by function, not “container up.” A green status light proves little. Make the service do its job: the resolver resolving a real domain, the cameras reconnecting to their stations, a document going through scan → sync → file, the music server reporting its real track count, the devices rediscovering themselves on the network.

Two things I didn’t plan for. A disk in the new mirror failed mid-copy; the migration continued untouched, because the irreplaceable data was already offsite and the source was only read, never written. And before deleting anything that looked like a duplicate, I confirmed it was one — comparing sizes and file types, not memory. Some things that look redundant hold the only copy.

The goal was never speed. It was a copy I could prove.


Postscript — the disk.

The failed mirror member (a WD 8 TB) sat degraded until the migration finished, then I cold-swapped it. No identify LED on the enclosure, so: power down, read the serials off the labels, pull the one that isn’t the survivor.

The resilver only had to copy the ~685 GB in use and finished in about an hour — but it logged 18 checksum errors on each disk and three unrecoverable ones, flagged as data corruption. The damaged object was a single piece of ZFS metadata (<metadata>:<0x3d>), not a file: corruption from the window when the mirror ran on one disk with no second copy to repair from. A scrub re-read every block with both disks present — repaired 0B in 00:57:45 with 0 errors, then “No known data errors” — and zpool clear reset the counters. Nothing lost.

The corruption banner was real. It didn’t matter, because the originals were still on the old machine and a verified copy was already offsite. A degraded destination only costs you the data it holds the sole copy of.

BUILT.