Cloud Reliability Expectations | technotes.seastrom.com

Wed 01 March 2017
misc

In case you didn't hear, Amazon's S3 service took a massive header yesterday.

In the aftermath of various email threads, slack chatter, and a few snarky comments in text messages and in person, I sat down last night with an adult beverage to contemplate where things went wrong. Pretty far reaching consequences for an outage of a single service going down, right?

After some noodling, I came to the conclusion that the problem isn't that S3 is not up to scratch. Quite the opposite actually. I contend that S3 is too good.

Let me explain. Amazon offers a service level agreement (SLA) on their S3 product. It pays out at at 10% discount if S3 is up less than 99.9% of the time, and at a 25% discount if it's up less than 99% of the time, on a monthly basis.

What's 99.9% uptime? It means 43 minutes and 12 seconds of unplanned outage per month. S3 can be reasonably expected to be engineered to be slightly more reliable than that, since it's never in anyone's best interests come review time to have been the reason your boss had to pay out more than a nominal amount of SLA credits. But no service should be engineered to be grossly more reliable than it needs to be to meet commitments - that's just wasting money and results in a different kind of negative review.

There's plenty of evidence to support the notion that Amazon is doing just-enough-to-get-by on their datacenters, from anecdotal accounts of them intentionally running their datacenters at 40c (Google published a paper that said they could get away with 30c, and if 30 saves you money than 40 ought to save you more, right?) to setting the roof on fire doing careless hot work on a datacenter that already had a roof membrane to other forms of just-in-time git-r-done infrastructure building. From a big-picture and business perspective this absolutely seems to be the Right Thing to do, and on the surface at least doesn't seem to bust their SLA.

I know what you're thinking. S3 doesn't go down nearly often enough or long enough to add up to 43 minutes a month; there must be some other factor at play that causes things to be like this.

There is. It's called "luck". As they say on the investment fund advertisements, "past performance does not guarantee future results".

When customer expectations are out of whack with design goals, a fundamentally untenable sitaution is goig to evolve. Business decisions about how much effort to put into resiliency in one's software may be out of sync with the underlying reality.

In short, we shouldn't be asking ourselves why S3 went down yesterday, we should be asking ourselves why it doesn't go down more often.

I could suggest that Amazon deliberately detune their reliability to more closely match their design parameters, but that would only cause customer flight to services that don't.

Product idea: A network appliance that uses dummynet, NIST Net, tcng (does the latter even work with current kernels?), or something similar to create a cough curated S3 experience that matches Amazon's contractual commitments. One could put this box between developers and/or their OT&E environment and the Internet in order to provide conditioned availabilty and guard against complacency.

Or one could just decide that paying for a 3 9s service and getting something actually closer to 4 9s most years is A-OK, and that on the balance the right response to an S3 outage is to just knock off for the day and hit the pub. After all, human intervention won't be required when S3 returns since your developers wrote resilient code that can deal with the connection to S3 going away and coming back without getting vapor locked, right?

In all seriousness, how much reliability your application needs is a business decision, not a technical one, and adding extra 9s is a super costly problem. There's no guarantee you can do better yourself than $CLOUDPROVIDER does, and the odds are pretty good that you'll actually do worse.

But if you pay for a 3 9s service, actually have a 4 9s service delivered, and start to rely upon it as if it's designed to be that good? No sympathy.