When networks don’t work as expected I often end up being the guy who gets called. Doesn’t matter whether it’s a friend’s house or stuff from my group at the office, I enjoy a good technical puzzle and am generally happy to help out.
Without going into a lot of preamble about the original underlying problem at work (involving performance characteristics of a CDN), I was at the point of gathering packet dumps to look at for clues. After all, the packet dump doesn’t lie… or does it?
The dumps had truncated TCP packets with a snaplen of 1500 bytes. Interesting. We run a jumbo-clean backbone… but ifconfig says the MTU on the interface in question is set to 1500 bytes. With that as the MTU on one end, we shouldn’t be negotiating a TCP MSS of more than 1460. What gives?
Maybe ifconfig is fibbing. Linux stack, y’know? Anything is possible. OK, maximum smoke - let’s see what we can get with a snaplen of 64k.
tcpdump -n -s 65535 -i eth0 -G 300 -w "hls-in-%Y%m%dT%H%M%S.pcap" host foo or host bar
That got me files that were twice as big per five minute block, and they weren’t giving me errors about truncation… but the packets that I was getting handed off to me had a TCP segment of between 14k and 19k. Weird.
I called up a colleague who could run packet collection on the other end in an attempt to figure out exactly what was being put onto the wire. Maybe things were being fragmented in some kind of unholy way and this was just a reassembly. But you’d expect to see a more constant size for the reassembled packet when moving decent sized files, not a range of sizes.
Before we could get the packets going though, Wes dragged John to my cube and since John keeps up with the Wireshark developers and goes to their conferences, he knew exactly what was up… and no, it wasn’t fragmentation.
I knew that the NIC was fairly state of the art. The server is a fairly recent Dell 710. Linux tells me that the NIC is a <a href=http://www.broadcom.com/collateral/pb/5709C-PB02-R.pdf>Broadcom NetXtreme II BCM5709</a> The datasheet says:
TCP processing engine
- Full Fast-path TCP processing
- Support for IPv4 and IPv6
What do I expect of a “TCP processing engine with Full Fast-path TCP processing”? Well, I expect IP and TCP checksum offloading. I’m pretty sure I expect fragment reassembly. In fact further down the list of features it claims as much. You can tell it’s a fairly studly NIC since it has an iSCSI controller in it.
What I wasn’t expecting is that actually the NIC is a fully functional TCP endpoint, and will re-factor several small TCP packets into one big TCP packet with the equivalent layer 4 payload, and pass it off to the system. Here in the future <a href=http://en.wikipedia.org/wiki/Interrupt_coalescing>interrupt coalescing</a> (waiting until you have N packets or X microseconds have passed before issuing an interrupt) is well-developed. Receive polling for ethernet interfaces (just checking every M microseconds for pending traffic) has been around since the days of the Intel PRO/100 and the DEC Tulip.
The gains from these mechanisms (and window scaling) are such that efforts to raise end to end MTU on the Internet at large <a href=http://staff.psc.edu/mathis/MTU/>have been substantially blunted</a>. Consequently I’m a little dubious about purported gains from refactoring several small packets into one big one before pushing it up the stack.
What’s interesting, though, is that this happens regardless of whether you are the endpoint or not. If one is running a sniffer, say for an IDS or similar security appliance, <a href=http://blog.securityonion.net/2011/10/when-is-full-packet-capture-not-full.html> the packets your sniffer sees are not the same ones that are on the wire</a>. Needless to say, the security guys are not impressed. :)
The good news is you can turn it off.
for i in rx tx sg tso ufo gso gro lro; do ethtool -K eth0 $i off; done
The mystery is why it defaults to on, in the desktop version of Ubuntu 13.10. I wonder which other NICs do this.
Putting on my <a href=http://en.wikipedia.org/wiki/Grey_hat>gray hat</a> for a moment, I wonder what kind of other stuff could be loaded on the NIC if one were to be willing to tweak the firmware… some kind of tiny piece of software to do one’s bidding when triggered remotely, completely unbeknownst to the host operating system and immune to its countermeasures.
And that, my friends, is what happens when peripherals get too smart for their own good…