WORKBENCH / DISPATCH 003 / 3

Draining a server with the lights on

Retiring a flaky, load-bearing box in an afternoon: DNS, public ingress, home automations and all, with nobody noticing.

  • docker-swarm
  • dns
  • cloudflare-tunnel
  • observability
STATUS · SHIPPED

I’d been putting off killing that machine for months.

Not because it was hard. Because it was scary. One box, quietly load-bearing: the primary DNS for the whole house, the public front door for a handful of sites, the brain running my home automations, the feed for the cameras. If it stuttered, everything downstream felt it. So it sat there, fragile, having already crashed on me once, and I kept finding reasons not to touch it.

Last week I drained it completely. In an afternoon. Nothing went down that anyone noticed.

I want to write down how, because the how is the interesting part. The fear was never really about the work. It was about how much depended on that one box. And a dependency like that is something you can take apart deliberately, one piece at a time.

Here is the shape of it. Every critical job moved to a calmer host: DNS to a small, boring Raspberry Pi, the automations and cameras to the machine that already runs the smart home, the public tunnel to a cloud box that never sleeps. Then the old machine left the cluster, left the orchestrator, and powered off. Its disk is intact, sitting cold, in case I ever want it back.

I did most of it with Claude at the keyboard. Not doing the thinking for me. Holding the method when I’d have been tempted to rush.

A few things made it calm instead of tense.

Move under the cover of redundancy. DNS is the classic touch-it-and-the-house-goes-dark service. So I never let it go dark. A second resolver answered the entire time, and I only moved the primary once the backup was carrying the load. The rule wasn’t “be careful.” It was “keep one resolver answering, always.”

Check what a thing does before you kill it. One container looked like dead weight. Its config was a stale clone of services that had moved away months ago, so deleting it seemed obvious. Then its logs showed it was the live public front door, still routing real traffic. We relocated it instead. Better to check what a thing actually does before deciding it does nothing.

Relocate, don’t rebuild. Most of these services were stateless once their data lived on shared storage. Moving one was a thirty-second repoint, not a rebuild: stop it here, point it there, start it, check. Each step reversible, and the old box stayed on until the end, so undo was always one command away.

Steal the address, skip the politics. The textbook DNS move means editing the router and waiting for every device to pick up the change. Instead, the new host took over the old one’s IP once it powered down. No router change, no waiting. As far as the clients were concerned, nothing moved.

Wire the alarms first. Before any of this, I built the part I’d put off for a year: monitoring. Two real failures had slipped past me recently. A machine that crashed and sat dead for half a day. A storage array running degraded on a single disk, found by chance. So the first job wasn’t moving anything. It was making the system able to tell me, on my phone, when something is wrong.

A few smaller things, for anyone who has been here. A single config file mounted into a container keeps serving the old version after you change it, because the mount pins the original; you have to recreate the container, not restart it. Two managers is the one cluster size that is worse than one, because losing either one breaks the quorum. And a service can sit there reporting “down” while doing its job perfectly well, if its name doesn’t match what the dashboard expects.

My favourite of the afternoon was a test that failed three times in a row, insisting DNS was down. It wasn’t. I had fat-fingered the test. We checked by hand, found the typo, and the panic went away. The thing doing the checking can be wrong too.

Here is what I actually want to keep from it.

A migration this size used to mean a weekend, a knot in my stomach, and a rollback plan I didn’t fully trust. This time it was an afternoon of small, reversible steps, each one checked before the next. The calm didn’t come from being brave. It came from method, and from a partner that held the method when I’d have rushed: keep the invariant, check before you delete, verify before you believe, leave the door open until you’re sure.

There is a version of working with AI that is all speed and noise. This wasn’t that. It was slower, quieter, a lot more boring. The kind of boring that lets you finally do the thing you’ve been putting off, and find out it was never as frightening as it looked.

The server is off now. Nobody noticed.

You don’t need to be fearless to take on the big migration. You need to be reversible.

BUILT.