Writing on software, systems, and hard-won lessons.
Writing on developer experience, systems thinking, and the mistakes behind both - covering AI workflows, continuous improvement, and the mental models that drive better decisions.
"A normal accident is where everyone tries very hard to play safe, but unexpected interaction of two or more failures causes a cascade of failures." - Charles Perrow, Normal Accidents (1984, revised 2012)
Instead of just asking "Will this work?", we should ask "When this breaks, what does it take down with it?"
The Trigger: Someone is home late from a night out and hungry and they start frying up a midnight snack. The kitchen is cluttered with hoarder-vibe stacks of newspapers, and the smoke detector has a recently deceased battery. The cascade isn't just the fire, it's everything positioned to feed it.
The Stress Shift: If the cook falls asleep mid-fry, the safety net shifts to the smoke detector to alert the snoozer. But no battery means no wakey, so the responsibility shifts again to the housemate's nose that only started to smell trouble once a small fire started.
The Ripple: Walking into a flaming kitchen is stressful at the best of times, but if that person panics and throws water on a grease fire, then that turns the problem into a ceiling high fireball. The human response just became the worst part of the cascade.
When I was 5, a family friend gave me a Commodore 64 program that told you what day of the week you were born on. It only asked for two digits of the year. Even at that age, I remember thinking: this won't work for people born in 2000.
To save money on expensive memory, early programmers used two digits for years (like "99"), but this tiny shortcut happened to be baked into almost every system on Earth. When the year 2000 hit, it caused a domino effect where computers calculated "00 minus 99" as negative 99, causing invoices to show massive debts, bank interests to vanish, and safety systems in power plants to glitch.
That's the pattern. Something appears small at first but when we take a step back we can start to see the interdependence of other systems.
In 1978, NASA scientist Donald Kessler warned that as more satellites go up, collisions become inevitable. Each collision creates debris. That debris hits other satellites, creating more debris. A chain reaction that could make entire orbital bands unusable.
It's already happening slowly. In 2009, a defunct Russian satellite hit an operational Iridium communications satellite at 42,000 km/h, producing thousands of fragments. Those fragments are still up there, crossing other orbits. In August 2024, a Chinese rocket broke apart in low Earth orbit, adding another 700+ pieces to the problem.
As of early 2026, there are over 11,800 satellites spinning around Earth, and more than half of them belong to Starlink. These satellites have to swerve to avoid hitting each other every two minutes, which only works if every single piece of software and every operator performs perfectly, every time.
If a solar storm or a software bug freezes just one satellite so it can't move, the "domino effect" starts immediately. Even though we are trying to clean up the space junk and forcing old satellites to drop out of orbit, experts warn that the crash-and-smash chain reaction has likely already begun, it's just happening in slow motion.
This is cascade failure thinking at planetary scale. Tightly coupled (orbits intersect), interactively complex (thousands of objects, unpredictable fragments), and the safety measures themselves add more objects to the system.
Trace the whole chain. When something breaks, don't just stop at "Login page is broken". Find what happened around the same time the problem started. Chances are, that was the cause and is a link in the chain. For example, someone upgraded a server in AWS then that triggered an alert that looked like an IP address was hammering the server, so that IP was blocked, but it was our internal IP so blocking that caused the login error. Instead of being stuck spending time on the login issue, we traced back potentially related events and found the root cause.
Hold a Pre-Mortem. Before you launch, pretend the project has already failed and is a total disaster. Now, work backward. This forces you to spot the dominos before they start falling, rather than just cleaning up the mess afterward.
Practice Small Chaos. Try deleting a database in a test environment and then restore from backup. Instead of just timing how long it takes to recover, also use the application like a user would while it's happening. Do they see a raw "database connection failed" error or a friendly downtime page? Go through this disaster recovery process, and I'm sure you'll find gaps in the process and ways to streamline and automate.
Signal-Based Adaptive Orchestration (SBAO). When you chain AI agents together, one hallucination becomes fact for the next agent, and next thing you know everything becomes confident nonsense. Cascade failure thinking is an important part of working with AI because it builds the signals you need to know when to bring in more AI agents or more humans to check before that bad data cascades into a problem.
Second-order thinking. Cascade thinking asks "what breaks next?" Second-order thinking asks "what happens when we fix it?" If Server A fails and you divert all traffic to Server B, will the increased load cause a hardware failure there too? The fix becomes the next domino.
Antifragility. Cascade thinking is about how systems break. Antifragility is about designing systems that get stronger from the small failures that would otherwise cascade. They're two sides of the same coin.
Swiss cheese model (James Reason). Each layer of defence has holes. A cascade happens when the holes line up. In software development, the developer misunderstands the requirement, the tester doesn't think of edge case scenarios, the code reviewer doesn't spot the flaw in logic. The combination causes a bug to slip through into production.
The original egoless programming principles are about mindset. Be kind. Don't take it personally. Critique the code, not the coder. Good advice, but mindset strains under pressure, fatigue, and deadli...
With chatbots and coding agents we've all experienced those moments that make us stop trusting the first answer. From cheerleading to making stuff up to drifting off course. That's why sometimes we ne...