Back on August 11, 2016, Heroku experienced increased routing latency in the EU region of the common runtime. While the official follow-up report describes what happened and what we've done to avoid this in the future, we found the root cause to be puzzling enough to require a deep dive into Linux networking.
The following is a write-up by SRE member Lex Neva (what's SRE?) and routing engineer Fred Hebert (now Heroku alumni) of an interesting Linux networking "gotcha" they discovered while working on incident 930.
The Incident
Our monitoring systems paged us about a rise in latency levels across the board in the EU region of the Common Runtime. We quickly saw that the...