WORKBENCH / DISPATCH 002 / 5

Blueprint: Migrating a 70-Device Zigbee Network to a New Coordinator Stack — Without Re-Pairing

A field-tested runbook for moving Zigbee2MQTT from an ember (EFR32) coordinator to a zstack (CC2652P) one in place — including the undocumented channel_mask trap, a socket-contention root cause, and what an AI pair-operator actually contributed. With three postscripts: the beta-firmware auto-update ambush, the week-long hunt that ended in a wrong-board-variant firmware and a Wi-Fi AP squatting on the mesh's spectrum, and a 72-hour soak that finally turned the claim into proof.

  • zigbee2mqtt
  • zigbee-herdsman
  • zstack
  • home-assistant
  • mqtt
STATUS · SHIPPED
Hand-drafted blueprint-style schematic of a Zigbee mesh network: dozens of small device glyphs connected by fine hairline routes converging on a new central coordinator drawn in green, while the old coordinator sits crossed out in ochre at the edge of the sheet

Hand-drafted blueprint-style schematic of a Zigbee mesh network: dozens of small device glyphs connected by fine hairline routes converging on a new central coordinator drawn in green, while the old coordinator sits crossed out in ochre at the edge of the sheet

Our Zigbee mesh had been sick for months: devices going stuck, new pairings hanging, removals failing, and a network-wide sluggishness that came and went. Around 70 devices — lights, plugs, buttons, climate sensors, mmWave presence radars — on an EFR32MG24-based Ethernet coordinator running Zigbee2MQTT’s ember driver.

Today we moved the whole network to a CC2652P-based coordinator (an SMLIGHT SLZB-06, also over Ethernet) on the zstack driver — in place, preserving the network, with almost no re-pairing. The official Zigbee2MQTT docs call cross-stack restore unsupported: “results might vary.” They did vary. This is the blueprint of what actually happened, including the one-line fix that made it work.

The setup

Two changes, one session:

  1. Rehome Zigbee2MQTT onto a new VM host (its data had been living on an NFS share; now it’s on local NVMe with hourly snapshots).
  2. Swap the coordinator: ember/EFR32 → zstack/CC2652P, using Z2M’s backup/restore so devices keep their network.

Both transport-over-TCP, so no USB passthrough anywhere. MQTT broker and base_topic stayed identical — which means every Home Assistant entity ID survives untouched. That’s the quiet superpower of the Z2M architecture: HA couples to MQTT topics, not to where Z2M runs or what radio it speaks through.

Finding #1: the secret second client

During pre-flight recon we found something nobody had put in the plan: Home Assistant still carried an enabled ZHA config entry pointing at the same TCP coordinator socket Zigbee2MQTT was using. Zero devices, never completed setup — but every HA restart it would poke the coordinator while Z2M held it.

We got live proof during the migration: the moment the old Z2M instance disconnected, something grabbed the socket, and the new instance crash-looped on the EZSP handshake until the ZHA entry was deleted.

Rule: one coordinator, one client. If you migrated from ZHA to Z2M years ago, check that the old config entry is actually gone. A TCP coordinator makes this failure mode much easier to create than a USB stick ever did.

Phase 1: moving the host

Mostly routine — stop, copy data, redeploy elsewhere — with one lesson worth paying for:

Take your backup after stopping Zigbee2MQTT, and checksum it. Z2M rewrites database.db and coordinator_backup.json on shutdown. Our first tar, taken while it was still running, silently differed from the post-stop state. md5sum on both ends caught it.

Verification that the move was clean: same coordinator, same network, only the client host changed — devices answered a state read round-trip within a minute.

Phase 2: the cross-stack restore (and the trap)

The happy path: stop Z2M, point serial: at the new coordinator with adapter: zstack, power off the old coordinator, start. zigbee-herdsman detects a blank adapter plus a valid backup and restores the network — PAN ID, extended PAN, network key, frame counter — onto the new radio.

What actually happened:

z2m: Error: network commissioning timed out - most likely network
with the same panId or extendedPanId already exists nearby

Crash loop, every ~70 seconds. The “network nearby” was our own live mesh — forty mains-powered routers still beaconing. Z2M wasn’t restoring; it was trying to form a brand-new network with the same parameters, and colliding with itself.

Why? We read the herdsman source inside the container image. The zstack startup strategy requires the backup to exactly match the configuration before it will restore — and that comparison includes the channel list, compared as packed bitmasks:

  • Our config said channel: 25.
  • The backup — written by the ember driver — contained channel_mask: [11, 12, ..., 26]. The full scan mask. All sixteen channels.

[25] !== [11..26], so herdsman concluded “configuration does not match backup” and silently fell through to forming a new network. The error message about a conflicting PAN nearby is two steps removed from the actual cause. Nothing in the logs says “your channel mask is why.”

The fix is one line. Edit coordinator_backup.json:

"channel_mask": [25]

Set advanced.log_level: debug for the next start and you can watch the decision flip:

(stage-1) adapter is not configured / not commissioned
(stage-2) configuration matches backup
determined startup strategy: restoreBackup
...
zigbee-herdsman started (restored)

Same PAN, same key, channel 25, frame counter carried over. The mesh never knew anything changed.

Two more cross-stack footnotes:

  • Unplug the old coordinator before the first start. Its radio keeps the network alive independent of any host software. Two coordinators with identical restored parameters is not an experiment you want.
  • The restored coordinator IEEE came out byte-reversed relative to the original. It sounds alarming; it’s benign. The radio, the database entry, and all new bind targets are self-consistent, and existing device bindings deliver by network address (0x0000) anyway. The only fallout is a possible stale duplicate “bridge” device in HA’s MQTT discovery.

Results

  • ~90% of live devices worked immediately — no re-pairing, no renaming, no HA changes.
  • Routes rebuilt over ~5 minutes (the first test wave looked scary at 1-of-6; twenty minutes later the core of the house answered).
  • Several devices the old coordinator had lost for days came back on their own.
  • A device that had never successfully paired on the ember coordinator — one of the reasons for this migration — paired on the first attempt.
  • The stragglers were exactly the devices that were already dead before the migration: flat batteries and wall-switched lamps, not mesh victims.

Appendix: the flooders

Part of the original instability diagnosis was “at least one chatty device.” Measurement (a 45-second MQTT sample, counted per topic) found three Tuya ZY-M100 mmWave presence radars each pushing 30–100 messages per minute — and diffing consecutive payloads showed they were identical, or differed only in link quality. The firmware just re-broadcasts its full state about once a second. The advertised knobs (detection_delay, sensitivity) changed nothing.

What worked: Z2M’s per-device debounce option —

debounce: 2
debounce_ignore:
  - presence

— which collapsed 50 msgs/min to 6–9 on the MQTT side while keeping presence transitions instant. The radio-side chatter remains (only replacing the hardware fixes that), but the CC2652P absorbs it without drama. It was the previous driver/radio combination that couldn’t.

Credit where it’s due

Two nudges:

To the Zigbee2MQTT team and Koenkk’s ecosystem: the fact that a not-officially-supported cross-stack migration comes down to one JSON field is a testament to how well the open coordinator backup format, the adapter abstraction, and the debug logging are built. Per-device options like debounce, the frontend, the discovery integration — this project carries an absurd amount of the smart-home world on volunteer shoulders. Sponsor it if you rely on it.

On working with an AI pair-operator: this migration was executed interactively with Claude Code driving recon and cutover — and the decisive moment was not automation, it was diagnosis: when the restore crash-looped, it read the herdsman source straight out of the container image, traced the strategy decision to the packed-channel-list comparison, and proposed the one-line backup edit with a debug-log verification plan. Checksum discipline, a live log monitor during the soak, and payload-diffing the flooders came from the same place. The human contribution was judgment: what to risk, when to cut over, which physical plugs to pull. That division of labor felt right.

The blueprint, condensed

  1. Recon first: confirm nothing else talks to your coordinator socket (looking at you, leftover ZHA entries).
  2. Update Z2M, then stop it and back up the data dir. Checksum the copy.
  3. New coordinator: flash current Z-Stack coordinator firmware, Ethernet mode, reserved IP, before touching the network.
  4. Edit serial:adapter: zstack, new port. Set coordinator_backup.json channel_mask to your actual channel.
  5. Power off the old coordinator. Start. Verify restoreBackup in debug logs.
  6. Wait for routes (minutes, not seconds). Test mains routers first, then battery devices.
  7. Re-pair only what stays dead — original friendly names mean HA entities survive re-pairing too.
  8. Measure your chattiest devices; debounce the unfixable ones.
  9. Keep the old coordinator as a cold spare. Never power both.

Total downtime for the radio cutover: about 30 minutes, most of it deliberate verification. The network came back healthier than it went down.

Postscript, four days later: the firmware ambush

The network came back healthier than it went down — and stayed that way for exactly two days. Then, on a quiet Tuesday afternoon, every light in the house stopped listening. Sensors kept reporting; commands went nowhere. It felt like the migration had failed after all. It hadn’t. This is the sequel, and it has different villains.

The symptom signature worth memorizing

Every outbound command failed instantly with Z-Stack error MAC_BAD_STATE (0x19), while inbound traffic streamed in undisturbed — a thousand sensor reports per half hour, zero command deliveries. That asymmetry is the tell:

RX works, TX dead-on-arrival = the radio chip’s firmware is wedged. Not antenna, not interference, not routing. Interference looks like MAC_NO_ACK (transmits that nobody acknowledges). MAC_BAD_STATE means the MAC layer refuses to even try. We burned a soft chip reset, a full coordinator reboot, a physical power-cycle, and a transmit_power config experiment confirming that — nothing cleared it, because the broken code just booted right back up.

The villain: auto-updated beta firmware

The SLZB-06 ships with “Zigbee firmware automatic update” enabled. What the toggle doesn’t say: SMLIGHT pushes their development builds through that channel. The build our coordinator had auto-installed — and the one we “fixed” it with by reflashing to “the latest version” — were both tagged Dev firmware (Beta) in exactly one place: the fine print of the manual firmware picker, three clicks deep. They’re experimental UART/DMA-optimization builds. The device’s own log page was showing “Firmware crash detected” with a downloadable crash dump.

So an unattended process flashed prerelease radio firmware under a live production mesh, the firmware crashed, and the failure mode it left behind survived every reset in the book.

Rule: a Zigbee coordinator is infrastructure, not a phone. Turn firmware auto-update off. Update deliberately, in a maintenance window, with Z2M stopped and a fresh backup verified — and only flash builds not tagged Dev/Beta. The newest number in the list is not the most stable; on our device the latest stable coordinator build (Koenkk’s public 20250321 release) was a year older than the betas being auto-pushed.

The recovery — and the two lines that make next time boring

Flashing the stable build fixed the radio and wiped the chip’s network config, which dropped us straight back into Phase 2 of this very post: herdsman refused the backup, tried to form a fresh network, and collided with our own routers. Same crash loop, new decade.

This time the durable fix wasn’t editing the backup — it was admitting the config had been coasting on defaults. Z2M had been running with no explicit network identity at all, so every NV wipe turned into a forensic exercise. Two lines in configuration.yaml, values read straight out of coordinator_backup.json:

advanced:
  pan_id: 6754
  ext_pan_id: [0, 18, 75, 0, 56, 167, 198, 7]

With the identity pinned, commissioning succeeded in eight seconds — the coordinator re-established the same network and the routers simply accepted it. Devices started re-announcing on their own within minutes.

One last Z2M behavior worth knowing before you need it: when herdsman commissions a network it considers new, Z2M resets database.db — suddenly every device is “Entity unknown”. Don’t panic and don’t re-pair: it writes database.db.backup first. Stop Z2M, copy the backup over database.db, start — every name, every Home Assistant entity, back as if nothing happened.

Postscript amendments to the blueprint

  1. Disable the coordinator’s firmware auto-update. Day one. Before it’s load-bearing.
  2. Only flash firmware not tagged Dev/Beta — check the manual picker’s fine print, because nothing else will tell you.
  3. Pin pan_id and ext_pan_id in your Z2M config (copy them from coordinator_backup.json). This is the difference between a two-minute recovery and an afternoon of source-diving the next time anything wipes the chip.
  4. MAC_BAD_STATE on every TX while RX flows = reflash the radio firmware. Stop trying resets; the problem boots back up with it.
  5. If Z2M ever greets you with “Entity unknown” across the board: database.db.backup has your network. Restore it before you re-pair anything.

The diagnosis arc this time was the same human–AI division of labor as the migration itself — the AI traced the herdsman strategy decision in the container source again, drove the coordinator’s web UI to find the (Beta) tags the dashboard hides, and caught the database backup before it was overwritten; the human pulled the plugs and made the call on what to flash. The mesh is now on boring stable firmware, with its identity written down, and auto-update off. Which is to say: it’s finally infrastructure.

Postscript II, one day later: everything was lying to us

The “boring stable firmware” lasted eighteen hours. Then the mesh started flapping: perfect after every restart, degrading within the hour, commands timing out to devices that were happily publishing. We spent a day chasing it through every layer — and what we found rewrites the lessons list, because three independent faults had been stacked on top of each other the whole week, each one masking the others.

Fault one: the wrong board variant — from the vendor’s own updater

Koenkk’s official firmware ships two builds per release: CC1352P2_CC2652P_launchpad_* for TI’s reference boards, and CC1352P2_CC2652P_other_* for third-party sticks. The difference isn’t cosmetic — it’s which GPIO pins drive the 20 dBm power amplifier and the RF switch. Koenkk’s hardware table maps every SMLIGHT device to the other build.

SMLIGHT’s own firmware picker had served us the launchpad build.

Run the wrong variant and you get a radio that mostly receives but transmits through a mis-driven front-end: link quality halved at the exact moment of the flash, commands flaky in proportion to load, transmit_power settings doing nothing comprehensible (they tune a PA the firmware can’t actually reach). Nothing in any log says “wrong board variant.” The version number — identical between both builds — reassures you everything is fine.

Rule: before flashing any Z-Stack firmware, check Koenkk’s hardware table for your exact board’s variant — even when the file comes from the device vendor’s own updater. Especially then.

Fault two: a Wi-Fi AP squatting on the mesh’s doorstep

With firmware finally provably correct and symptoms persisting, we stopped trusting opinions and measured the spectrum: a raw ZNP energy scan across all 16 Zigbee channels, coordinator-side, z2m stopped.

ch23: 226 ████████████████████████████
ch24: 255 ███████████████████████████████   ← meter pegged
ch25:  32 ████                              ← our channel, next door

A Wi-Fi access point on Wi-Fi channel 11 — re-rolled there by a router firmware update earlier in the week — was saturating Zigbee channels 23–24 from a few meters away, desensing our channel-25 receiver. One channel-exclusion in the Wi-Fi controller later: 226/255 → 0/1. The wall of noise simply vanished.

Rule: when a mesh degrades in both directions (RX link quality AND TX reliability), scan the spectrum before blaming software. And remember Wi-Fi auto-channel re-rolls on every router reboot — yesterday’s clean spectrum is not today’s.

Fault three: our own instruments

The most expensive lesson of the week wasn’t in the radio at all. Over two days, four separate measurement bugs sent us chasing ghosts:

  • docker logs --since silently returned empty after the host’s unclean reboot (while --tail worked) — making an ongoing device storm look like total calm.
  • A health probe’s JSON payload lost its quotes through three layers of shell escaping, arriving as invalid JSON that z2m silently ignored — so our “command round-trip probe” had never probed anything, and an entire afternoon of “still broken” verdicts measured nothing.
  • A 9-device parallel test panel was itself congesting the adapter queue it was trying to measure.
  • And z2m’s internal retry queue replayed commands seeded by an always-open dashboard tab — every ~28 seconds, forever, surviving the tab’s closure — so the “mysterious poller” we hunted across four hosts was a ghost of our own UI session.

Rule: validate every automated probe against a known-good manual run before trusting its verdicts. When a system looks impossibly quiet — or impossibly broken — suspect the instrument first. We now keep one rule of thumb: no diagnostic conclusion from a probe that hasn’t been seen working at least once.

The ending: a deliberate, scoped rebuild

After a week of archaeology, the human made the right call: stop excavating, start fresh — but scoped. Not 70 devices: the two rooms that matter daily. New network identity (generated keys instead of z2m’s publicly-known default, fresh PAN, channel 20 straight from the scan data), groups created before pairing, every device factory-reset and named to the standard at join time, remote reset-pulses scripted through the smart relays that feed the fixtures.

Nineteen devices joined in one evening. The first night’s metrics: zero MAC errors, zero-to-three failed transmissions per half hour. The old network — across three coordinators and three years — never produced a single zero-failure interval in its life.

Final amendments to the blueprint

  1. Verify the board variant against Koenkk’s hardware table before every radio flash. Vendor updaters serve wrong builds too.
  2. Energy-scan before and after RF suspicion — it’s one raw ZNP frame, and it turns “I think it’s interference” into a number.
  3. Exclude your Zigbee channel’s Wi-Fi neighbors in the Wi-Fi controller, not just today’s channels — auto-channel will re-roll onto them eventually.
  4. Generate a real network key. The z2m default is public knowledge; a rebuild is your chance to retire it.
  5. When you rebuild, scope it. Two rooms done solid beats seventy devices done anxious. The deferred rooms join the proven network later.
  6. Write down what your instruments looked like when they worked. Half of debugging is knowing whether you can trust the needle.

The mesh is now nineteen devices on a clean channel with real keys, growing room by room on a foundation that was measured — not assumed — to be sound. The other fifty devices wait in an archived tarball, names and all, for their batches.

It took three weeks, three coordinators, two firmware vendors, one Wi-Fi AP, and four broken instruments to learn what the last line of this post already said: it’s finally infrastructure. This time we have the graphs to prove it.

Postscript III, three days later: the graphs came in

Every previous ending in this post was written on hope — “came back healthier,” “finally infrastructure,” “graphs to prove it” — and every one of them was proven premature within hours. So this last postscript waited for data instead of declaring victory.

After the scoped rebuild, we ran a 72-hour soak: a script on the host probes a known-good device every 30 minutes — an actual command round-trip, the one signal that can’t be faked by a quiet network — and logs link quality, transmit failures, and MAC errors alongside it.

The 72-hour scoreboard:

  • 92 ticks. Zero transmit failures. Not one round-trip probe came back dead across three days.
  • Max MAC errors: 2 per half hour (the alert threshold was 50). Most intervals: zero.
  • Link quality median ~213 — against 148 on the old network’s best-ever day, and 76 on the wrong-variant firmware.

And it wasn’t a quiet three days. The new network survived, on its own data and without intervention, three real infrastructure events: a host reboot, a RAID-mirror drive replacement under the hypervisor, and a network blip that briefly cut the coordinator off-net. Each time it came back by itself — the one moment the coordinator vanished, z2m correctly crash-looped until the link returned, then reconnected on the next retry. That’s not luck; that’s the pinned-identity config and the generated keys doing exactly the job they were added for.

The thing that finally worked wasn’t a cleverer fix than the dozen before it. It was scoping down and measuring. Two rooms, rebuilt deliberately, then watched with an instrument we’d verified actually worked — after a week of instruments that didn’t. The remaining fifty devices are still in their archived tarball, names and all, waiting for their batches; they’ll join a network that has now earned the word we kept using prematurely.

Infrastructure. This time the graphs agree.

BUILT.