Sockets in a Bind

Back on August 11, 2016, Heroku experienced increased routing latency in the EU region of the common runtime. While the official follow-up report describes what happened and what we've done to avoid this in the future, we found the root cause to be puzzling enough to require a deep dive into Linux networking.

The following is a write-up by SRE member Lex Neva (what's SRE?) and routing engineer Fred Hebert (now Heroku alumni) of an interesting Linux networking "gotcha" they discovered while working on incident 930.

The Incident

Our monitoring systems paged us about a rise in latency levels across the board in the EU region of the Common Runtime. We quickly saw that the usual causes didn’t apply: CPU usage was normal, packet rates were entirely fine, memory usage was green as a summer field, request rates were low, and socket usage was well within the acceptable range. In fact, when we compared the EU nodes to their US counterparts, all metrics were at a nicer level than the US ones, except for latency. How to explain this?

One of our engineers noticed that connections from the routing layer to dynos were getting the POSIX error code EADDRINUSE, which is odd.

For a server socket created with listen(), EADDRINUSE indicates that the port specified is already in use. But we weren’t talking about a server socket; this was the routing layer acting as a client, connecting to dynos to forward an HTTP request to them. Why would we be seeing EADDRINUSE?

TCP/IP Connections

Before we get to the answer, we need a little bit of review about how TCP works.

Let’s say we have a program that wants to connect to some remote host and port over TCP. It will tell the kernel to open the connection, and the kernel will choose a source port to connect from. That’s because every IP connection is uniquely specified by a set of 4 pieces of data:

( <SOURCE-IP> : <SOURCE-PORT> , <DESTINATION-IP> : <DESTINATION-PORT> )

No two connections can share this same set of 4 items (called the “4-tuple”). This means that any given host (<SOURCE-IP>) can only connect to any given destination (<DESTINATION-IP>:<DESTINATION-PORT>) at most 65536 times concurrently, which is the total number of possible values for <SOURCE-PORT>. Importantly, it’s okay for two connections to use the same source port, provided that they are connecting to a different destination IP and/or port.

Usually a program will ask Linux (or any other OS) to automatically choose an available source port to satisfy the rules. If no port is available (because 65536 connections to the given destination (<DESTINATION-IP>:<DESTINATION-PORT>) are already open), then the OS will respond with EADDRINUSE.

This is a little complicated by a feature of TCP called “TIME_WAIT”. When a given connection is closed, the TCP specification declares that both ends should wait a certain amount of time before opening a new connection with the same 4-tuple. This is to avoid the possibility that delayed packets from the first connection might be misconstrued as belonging to the second connection.

Generally this TIME_WAIT waiting period lasts for only a minute or two. In practice, this means that even if 65536 connections are not currently open to a given destination IP and port, if enough recent connections were open, there still may not be a source port available for use in a new connection. In practice even fewer concurrent connections may be possible since Linux tries to select source ports randomly until it finds an available one, and with enough source ports used up, it may not find a free one before it gives up.

Port exhaustion in Heroku’s routing layer

So why would we see EADDRINUSE in connections from the routing layer to dynos? According to our understanding, such an error should not happen. It would indicate that 65536 connections from a specific routing node were being made to a specific dyno. This should mean that the theoretical limit on concurrent connections should be far more than a single dyno could ever hope to handle.

We could easily see from our application traffic graphs that no dyno was coming close to this theoretical limit. So we were left with a concerning mystery: how was it possible that we were seeing EADDRINUSE errors?

We wanted to prevent the incident from ever happening again, and so we continued to dig - taking a dive into the internals of our systems.

Our routing layer is written in Erlang, and the most likely candidate was its virtual machine’s TCP calls. Digging through the VM’s network layer we got down to the sock_connect call which is mostly a portable wrapper around the linux connect() syscall.

Seeing this, it seemed that nothing in there was out of place to cause the issue. We’d have to go deeper, in the OS itself.

After digging and reading many documents, one of us noticed this bit in the now well-known blog post Bind before connect:

Bind is usually called for listening sockets so the kernel needs to make sure that the source address is not shared with anyone else. It's a problem. When using this techique [sic] in this form it's impossible to establish more than 64k (ephemeral port range) outgoing connections in total. After that the attempt to call bind() will fail with an EADDRINUSE error - all the source ports will be busy.

[...]

When we call bind() the kernel knows only the source address we're asking for. We'll inform the kernel of a destination address only when we call connect() later.

This passage seems to be describing a special case where a client wants to make an outgoing connection with a specific source IP address. We weren’t doing that in our Erlang code, so this still didn’t seem to fit our situation well. But the symptoms matched so well that we decided to check for sure whether the Erlang VM was doing a bind() call without our knowledge.

We used strace to determine the actual system call sequence being performed. Here’s a snippet of strace output for a connection to 10.11.12.13:80:

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
*bind*(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.11.12.13")}, 16) = 0

To our surprise, bind() was being called! The socket was being bound to a <SOURCE-IP>:<SOURCE-PORT> of 0.0.0.0:0. Why?

This instructs the kernel to bind the socket to any IP and any port. This seemed a bit useless to us, as the kernel would already select an appropriate <SOURCE-IP> when connect() was called, based on the destination IP address and the routing table.

This bind() call seemed like a no-op. But critically, this call required the kernel to select the <SOURCE-IP> right then and there, without having any knowledge of the other 3 parts of the 4-tuple: <SOURCE-IP>, <DESTINATION-IP>, and <DESTINATION-PORT>. The kernel would therefore have only 65536 possible choices and might return EADDRINUSE, as per the bind() manpage:

EADDRINUSE (Internet domain sockets) The port number was specified as zero in the socket address structure, but, upon attempting to bind to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range ip(7).

Unbeknownst to us, we had been operating for a very long time with far lower of a tolerance threshold than expected -- the ephemeral port range was effectively a limit to how much traffic we could tolerate per routing layer instance, while we thought no such limitation existed.

The Fix

Reading further in Bind before connect yields the fix: just set the SO_REUSEADDR socket option before the bind() call. In Erlang this is done by simply passing {reuseaddr, true}.

At this point we thought we had our answer, but we had to be sure. We decided to test it.

We first wrote a small C program that exercised the current limit:

#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <unistd.h>

int main(int argc, char **argv) {
  /* usage: ./connect_with_bind <num> <dest1> <dest2> ... <destN>
   *
   * Opens <num> connections to port 80, round-robining between the specified
   * destination IPs.  Then it opens the same number of connections to port
   * 443.
   */

  int i;
  int fds[131072];
  struct sockaddr_in sin;
  struct sockaddr_in dest;

  memset(&sin, 0, sizeof(struct sockaddr_in));

  sin.sin_family = AF_INET;
  sin.sin_port = htons(0);  // source port 0 (kernel picks one)
  sin.sin_addr.s_addr = htonl(INADDR_ANY);  // source IP 0.0.0.0

  for (i = 0; i < atoi(argv[1]); i++) {
    memset(&dest, 0, sizeof(struct sockaddr_in));
    dest.sin_family = AF_INET;
    dest.sin_port = htons(80);

    // round-robin between the destination IPs specified
    dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

    fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
    connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
  }

  sleep(5);

  fprintf(stderr, "GOING TO START CONNECTING TO PORT 443\n");

  for (i = 0; i < atoi(argv[1]); i++) {
    memset(&dest, 0, sizeof(struct sockaddr_in));
    dest.sin_family = AF_INET;
    dest.sin_port = htons(443);
    dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

    fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
    connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
  }

  sleep(5);
}

We increased our file descriptor limit and ran this program as follows:

./connect_with_bind 65536 10.11.12.13 10.11.12.14 10.11.12.15

This program attempted to open 65536 connections to port 80 on the three IPs specified. Then it attempted to open another 65536 connections to port 443 on the same IPs. If only the 4-tuple were in play, we should be able to open all of these connections without any problem.

We ran the program under strace while monitoring ss -s for connection counts. As expected, we began seeing EADDRINUSE errors from bind(). In fact, we saw these errors even before we’d opened 65536 connections. The Linux kernel does source port allocation by randomly selecting a candidate port and then checking the N following ports until it finds an available port. This is an optimization to prevent it from having to scan all 65536 possible ports for each connection.

Once that baseline was established, we added the SO_REUSEADDR socket option. Here are the changes we made:

--- connect_with_bind.c 2016-12-22 10:29:45.916723406 -0500
+++ connect_with_bind_and_reuse.c   2016-12-22 10:31:54.452322757 -0500
@@ -17,6 +17,7 @@
   int fds[131072];
   struct sockaddr_in sin;
   struct sockaddr_in dest;
+  int one = 1;

   memset(&sin, 0, sizeof(struct sockaddr_in));

@@ -33,6 +34,7 @@
     dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

     fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+    setsockopt(fds[i], SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));
     bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
     connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
   }
@@ -48,6 +50,7 @@
     dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

     fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+    setsockopt(fds[i], SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));
     bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
     connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
   }

We ran it like this:

./connect_with_bind_and_reuse 65536 10.11.12.13 10.11.12.14 10.11.12.15

Our expectation was that bind() would stop returning EADDRINUSE. The new program confirmed this fairly rapidly, and showed us once more that what you may expect from theory and practice has quite a gap to be bridged.

Knowing this, all we had to do is confirm that the {reuseaddr, true} option for the Erlang side would work, and a quick strace of a node performing the call confirmed that the appropriate setsockopt() call was being made.

Giving Back

It was quite an eye-opening experience to discover this unexpected connection limitation in our routing layer. The patch to Vegur, our open-sourced HTTP proxy library, was deployed a couple of days later, preventing this issue from ever biting us again.

We hope that sharing our experience here, we might save you from similar bugs in your systems.

Video Transcript