A few thoughts on yesterday's six hour Facebook outage

  • Tue 05 October 2021
  • misc

Monday Night Football started airing in 1970. 51 years hence you'd think that Tuesday Morning Quarterbacking would have replaced Monday Morning Quarterbacking in the popular vernacular but it hasn't. With that in mind, here are a few observations on the still-emerging news related to yesterday's multi-hour Facebook outage.

  • It looks like the outage went way further than just just the DNS servers. Majdi and I think the working title of at least a section of the post mortem should be "Tell me that you haven't read RFC 2182 without saying you haven't read RFC 2182".

  • Vijay Gill quoted an offhand observation I made over 20 years ago -- out of band communications are saving throw when you totally f*ck up. Recovery efforts were hurt by lack of ability to quickly roll back the changes that had been pushed.

  • Recovery efforts were further hampered by dependencies that nobody had thought of. Since we can't guarantee that large complex systems will never suffer a catastrophic failure, we can at least improve our odds for minimizing time to restore service by conducting black start exercises -- what happens if the network isn't there? what happens if the DNS isn't there? how do we come back when our usual external dependencies are broken?

  • Automation makes everything faster, smoother, better - including shooting your foot off.