Monday Morning Quarterbacking the AWS outage last Wednesday

Mon 30 November 2020
misc

I've seen a lot of wharrgarbl in the last few days that boils down to "AWS must do better!" in relation to US-East-1 falling over last Wednesday.

Long story short, it was a systemic failure where a lot of services which depended on an internal service (Kinesis) died when it failed.

The inevitable conclusion by sundry bloggers is that Amazon is not delivering what people are expecting, which is not quite the same thing as not delivering what they are contracted to deliver.

To paraphrase Vijay Gill, though, you own your redundancy. There's no such thing as a one-size-fits-all solution or level of fault tolerance that is appropriate for every use case.

It was a business decision (as often as not made by non-business-people which is as bad as engineering decisions made by non-engineering people) to ignore the possibility of region-wide failures, not analyze the probability of it happening based on the SLAs (which, shockingly, Amazon consistently does better than only to have people start to expect it of them), and decide whether it is worth the cost (operational complexity, FTEs, etc) to guard against it. Multi-cloud and multi-region requires effort. Effort requires resources. Resources require some combination of time and money. Is the juice worth the squeeze?

In a lot of cases it’s not. And that’s fine. Joe’s Muffler Shop and Crab Shack having its e-commerce site offline for a period of time during which half the internet seemed to be down doesn’t cause much in the way of lost business opportunity or reputational damage. But you don’t get to whine about not getting a Mercedes when you paid for a Toyota and pocketed the savings in both capex and opex.