How I Updated Security Rules Across a Switch Fleet Without a Single Outage

The change that keeps network engineers up at night

Every network engineer knows the specific flavor of dread that comes with modifying an access-control list on a remote switch. You’re connected to that device over the very management path you’re about to change. Get the order of operations wrong, and you lock yourself — and everyone else — out of a switch sitting in a building hundreds or thousands of miles away. Now multiply that across an entire fleet, where some devices already have the new rule, some have the old one, and some have neither.

That was the situation on a client project this past week. A block of trusted management addresses needed to migrate to a new range across a large fleet of switches spread across many sites. The old way to handle this is exactly what you’d imagine: an engineer logs into each device by hand, checks the current state, types the change, verifies it, saves it, and moves to the next one. It’s slow, it’s mind-numbing, and the boredom is precisely what produces the mistake that takes a site offline.

If there’s one lesson I keep coming back to, it’s that the dangerous changes are the ones worth automating first — not because they’re frequent, but because the cost of a human slip is so high.

Building the safety in first

So I built a tool to do it, and I built the safety guarantees before I built anything else.

The first principle was simple: never create a window where a device can lock you out. The tool always adds the new permission before it removes the old one, and it only removes the old rule after it has confirmed the new one is in place and working. There is never a moment where a switch has no valid path back in.

The second principle was that nothing destructive happens until a human has seen exactly what would happen. The tool defaults to a dry-run. It sweeps each site, discovers which switches are actually reachable, reads the current state of each one, and produces a clear report: which devices are already compliant, which need the change, and which are in some unexpected state that deserves a human’s eyes. Green means leave it alone. Amber means it needs work. Nothing changes on the network until someone looks at that report and flips the switch to live.

The third principle was that the tool should be safe to run a hundred times. It keeps a running ledger of what it has already done, so a device that’s already been brought into compliance simply gets skipped on the next pass. You can stop halfway through, come back the next day, and pick up exactly where you left off. And because that ledger can live on a shared drive, a whole team can run the same tool from different laptops and see each other’s progress instead of stepping on one another.

How it actually went

We didn’t trust it blindly. The rollout followed the same pattern I’d use for any high-stakes change: dry-run first, then a single canary site, then production. At the canary site the tool found twenty reachable switches, confirmed thirteen were already compliant, and cleanly remediated the seven that weren’t — preserving the thirteen it didn’t need to touch. The production run went the same way. No lockouts. No outages. No 2 a.m. phone calls.

What used to be a multi-day, all-hands, hold-your-breath manual slog became a job you kick off, watch a report from, and trust. The engineers got their time back, and — more importantly — the work got safer, because the part most likely to cause an outage is now handled identically every single time.

This is the work I love

This is exactly the kind of problem I think automation should be pointed at: not the flashy stuff, but the repetitive, high-consequence work where consistency matters more than cleverness. Take a task a skilled human does well but slowly, encode the safety rules they carry in their head, and let the machine handle the tedium without ever getting bored or careless.

If you’ve got a process like this — something repetitive, risky, and important enough that you’ve been doing it by hand because you don’t trust a script with it — that’s exactly the kind of thing I build. Take a look at what an AI assistant tailored to your operation can do, or get in touch and tell me about the change that keeps you up at night.

How I Updated Security Rules Across a Switch Fleet Without a Single Outage

The change that keeps network engineers up at night

Building the safety in first

How it actually went

This is the work I love

Interested in AI automation?