M*A*S*H as a metaphor for the Internet Spottiness of August 12th


  • Sun 24 August 2014
  • misc

Widespread Internet outages and slowdowns that occurred on Tuesday August 12th were covered in IEEE Spectrum in one of the better articles I've read from a getting-it-right perspective.

People who read this blog of course are capable of absorbing a little more technical nuance.

Routers on the Internet that handle large amounts of traffic are generally not composed entirely of general-purpose CPUs - they are split into what's called the "control plane" and the "data plane".

The "control plane" is where detailed information about network topology lives, in the form of routing protocol tables (stuff like BGP, OSPF, IS-IS). It is where the command line interpreter via which the network manger interacts with it lives. The control plane is a general purpose CPU (usually Intel, MIPS, or PowerPC). Memory for general-purpose CPUs is fairly inexpensive, though one is often constrained by the number of memory sockets or traces on the CPU card as to how much memory one can pile on. The control plane is capable of forwarding IP packets but not fast (and is often called into action for things like replying to pings, generating max-hops-exceeded messages for traceroute, etc).

The "data plane" contains the distilled set of best routes to each destination that the router knows about, perhaps including a default, or catch-all route. The data plane is fast but stupid - it is made of application-specific integrated circuits, and a special kind of memory called TCAM. This stuff is expensive, and generally non-expandable except by replacing the board it lives on. On the plus side, the task of the data plane is fairly straightforward - send the packet to the proper exit interface based on its destination address, and do so fast.

Conceptually, the data plane's role is very much like the MAS*H signpost: http://www.mash4077tv.com/features/prop_spotlight_signpost/

What happened in these last days when the global routing table crossed 512k routes is akin to nailing one too many signs on the signpost thus causing it to blow over in the wind. The map is still in perfect shape and is available if you jump through hoops, but the quick reference is gone. The fallback technique is to go see Radar O'Reilly when you need directions (IP_Input on the control plane) and when he gets around to it Colonel Potter will get around to handling the matter personally.

This is why what people were seeing was often not 100% failure, but more like 98%+ packet loss (annoyingly, enough to keep monitoring software confused in many cases as to the true nature of things).