Wait, what? A robust system should be designed not to fail, right?
That’s why engineers often wrap their code in retries, caching, exceptions, and other mechanisms to handle every problem. But let’s pause and think about that for a moment.
Imagine this: your code tries to connect to a database, fails twice, and succeeds on the third attempt. Does it really make a difference whether it retries one or three times? Should there be a retry at all? Databases should be highly available and scalable to handle any request. Maybe those retries are just kicking the can down the road, hiding a bigger scalability problem. And when that problem hits, it might come with a major outage or, worse, data loss. By not failing fast and loudly, you’ve essentially set the stage for a bigger disaster in the future.
Here’s another common scenario: your backend service catches an exception, and the handler returns a default value to the client. No big deal, right? Except, over time, you notice users are churning out because they weren’t getting the experience they expected. Now, when you finally start digging into the root cause, it takes way too long because everything seems to be “working.” To make matters worse, the exception wasn’t logged. All this because you didn’t want to send a 500 error to a few clients, and now a much larger audience is impacted.
Failing fast and loudly makes your system scream for attention when things go wrong. With solid monitoring, you’ll be able to trigger an incident quickly, minimizing Mean Time to Identify (MTTI), and focus on resolving the root cause fast.
The mindset should be:
- Accept that the products and services we build will fail, no matter how hard we try to avoid it.
- Failing poorly is far more costly than failing loudly, because when things are vague, finding the root cause is harder.
- Be thoughtful about client-facing failures, but it’s better to frustrate a small percentage of users than face a larger, system-wide outage. Whether you like it or not, your clients help debug your system.
The key is fast identification and recovery. Have meaningful logging and alerts in place so that when things go wrong, you know about it immediately—and can fix it fast.
If you want more content like this, press ‘Like’!
