There is a particular kind of satisfaction in encountering a system that simply works. Not the flashy, buzzword-laden architecture diagram pinned to a conference talk slide, but the quiet kind -- the service that has been running in production for three years, processing millions of requests, and has needed exactly two deploys in the last six months. Both of which were minor dependency bumps.
I have been thinking about what separates these systems from the rest. After two decades of building software -- some of it terrible, some of it decent, a vanishingly small amount of it genuinely good -- I have arrived at an observation that will surprise no one and yet somehow still fails to penetrate most engineering cultures: the most reliable systems are the most boring ones.
This is not a new insight. It echoes Dan McKinley's "Choose Boring Technology" from 2015, Dijkstra's axioms about simplicity, and the quiet wisdom of every senior engineer who has ever said "have you considered just using Postgres?" in a design review. But I want to explore something adjacent: not just the choice of boring technology, but the architecture of boredom itself.
The Shape of Reliability
Reliable systems share a recognizable shape. They tend to be narrow in scope, deep in handling, and wide in their tolerance for the unexpected. If you squint, they look like rectangles: defined edges, predictable dimensions, no surprises.
Unreliable systems, by contrast, look like fractals. Every feature introduces a new edge case. Every integration spawns three more configuration options. The surface area expands faster than the team's ability to reason about it.
"The primary cause of software failure is complexity. The primary source of complexity is the attempt to make software do many things."
— adapted from John Gall, SystemanticsConsider a concrete example. I maintain a service that processes webhook events from a payment provider. Its job is simple: receive the event, validate the signature, transform the payload into our internal format, and write it to the database. That is all it does.
The original version, written by a well-meaning team three years before I inherited it, did considerably more. It also sent email notifications, updated a cache layer, triggered downstream analytics events, and published to a message queue for "future consumers" that never materialized. It failed approximately once a week.
What I removed
The refactor was almost entirely subtractive:
- Stripped email notifications into a separate service triggered by database changes
- Eliminated the cache layer entirely (the database was fast enough)
- Moved analytics to an async consumer reading from the database's change stream
- Deleted the message queue integration and its 400 lines of retry logic
The result was a service that went from 2,200 lines to 380 lines. It has not failed once in eighteen months.
// The entire webhook handler. That's it. export async function handleWebhook( req: WebhookRequest, db: Database ): Promise<WebhookResult> { // 1. Validate signature const isValid = verifySignature( req.body, req.headers['x-webhook-signature'], config.webhookSecret ); if (!isValid) { return { status: 401, body: 'Invalid signature' }; } // 2. Transform const event = transformPayload(req.body); // 3. Persist (idempotent upsert) await db.upsertEvent(event); return { status: 200, body: 'OK' }; }
A note on idempotency. The upsertEvent call is keyed on the webhook provider's event ID. If we receive the same event twice -- which happens more often than you would expect -- it simply overwrites the existing record. No special deduplication logic. No distributed locks. Just a unique constraint on the database.
Constraints as Architecture
The interesting thing about that refactor was not what I built. It was what I chose not to build. Every feature I removed was a constraint I imposed: this service will not send emails, will not manage caches, will not publish events. Each constraint simplified the failure modes. Each simplified failure mode made the service more predictable. Predictability is the foundation of reliability.
This is counterintuitive for many engineers, especially early in their careers. We are trained to build. We are rewarded for building. The pull request that adds a feature gets celebrated; the pull request that removes one gets questioned. "Why are we losing functionality?"
But functionality is not free. Every feature has a maintenance cost, a cognitive cost, and a failure cost. The calculus is rarely explicit, but it is always there.
"Every line of code you write is a liability. Every line of code you delete is an asset."
The Twelve-Factor Illusion
I want to be careful here, because I am not arguing against good engineering practices. The twelve-factor app methodology, for instance, contains genuine wisdom: store config in the environment, treat logs as event streams, maximize robustness with fast startup and graceful shutdown.
But I have watched teams adopt twelve-factor as a checklist -- ticking boxes without understanding the underlying philosophy. They containerize everything because "that's what you do" without asking whether their single Go binary actually benefits from Docker. They implement health checks that check nothing meaningful. They set up log aggregation pipelines that nobody reads.
# I have seen this exact health check in production. # It tells you nothing. healthcheck: test: ["CMD", "echo", "healthy"] interval: 30s timeout: 10s retries: 3 # Compare with something that actually verifies the service: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health/deep"] interval: 30s timeout: 10s retries: 3
The difference between the two is the difference between architecture and cargo-culting. The first says "I have a health check" to satisfy a linter. The second says "I have verified that my service can reach its database, resolve DNS, and respond to HTTP" because that information actually affects operational decisions.
What I Actually Check For
After years of production incidents -- some of them mine, many of them inherited -- I have developed a short list of architectural properties I look for in systems that need to be reliable. None of them are glamorous:
- Single responsibility at the service level. Not the class level, not the function level. The service. One service, one job. If you cannot describe what a service does in one sentence without using "and," it is probably two services.
- Idempotent operations by default. Every write operation should be safe to retry. This is easy with upserts and hard with side effects, which is why minimizing side effects matters.
- Shallow dependency trees. If Service A calls Service B which calls Service C, you have a fragility chain. Each link multiplies your failure probability. Keep it flat.
- Graceful degradation as a first-class feature. Not "what happens when the cache is down" but "the system is designed to work without the cache from day one."
- Observable by default. Structured logs, meaningful metrics, and traces that actually help you find problems. Not dashboards. Dashboards are for showing executives. Logs are for debugging at 3am.
The Human Element
There is one more thing I have learned that has nothing to do with technology: reliable systems are built by teams that are allowed to be bored.
I mean this literally. Teams under constant pressure to ship features, demonstrate velocity, and justify their headcount will inevitably produce complex systems. Complexity is the natural byproduct of haste. Simplicity requires time -- time to think, time to refactor, time to say "actually, we should not build that."
The best engineering manager I ever worked for had a policy: every team member got one day per sprint -- she called it "maintenance day" -- to do nothing but improve existing systems. No features, no tickets, no standups. Just look at the codebase and make it better. Delete dead code. Improve error messages. Write the test you skipped last month.
Her team's services had the lowest incident rate in the company. This was not a coincidence.
"The purpose of software engineering is not to produce code. It is to produce systems that work, and continue working, long after the original authors have moved on."
I think about that framing often. "Systems that work, and continue working." It reframes every technical decision from "what is the most elegant solution" to "what is the most survivable solution." Elegance is nice. Survival is mandatory.
So the next time you are in a design review and someone proposes adding Kafka, or splitting the monolith, or introducing a new caching layer, ask the boring question: "What happens if we just... don't?" You might be surprised how often the answer is: everything works fine.
And that, I have come to believe, is the highest compliment you can pay a system. It works fine. It has always worked fine. Nobody talks about it. Nobody has to.
That is the quiet architecture of reliability. And it is beautiful.