Pumas, Routers & Keepalives—Oh my!

This week, Heroku made Router 2.0 generally available, bringing features like HTTP/2, performance improvements and reliability enhancements out of the beta program!

Throughout the Router 2.0 beta, our engineering team has addressed several bugs, all fairly straight-forward with one exception involving Puma-based applications. A small subset of Puma applications would experience increased response times upon enabling the Router 2.0 flag, reflected in customers’ Heroku dashboards and router logs. After thorough router investigation and peeling back Puma’s server code, we realized what we had stumbled upon was not actually a Router 2.0 performance issue. The root cause was a bug in Puma! This blog takes a deep dive into that investigation, including some tips for avoiding the bug on the Heroku platform while a fix in Puma is being developed. If you’d like a shorter ride (aka. the TL;DR), skip to The Solution section of this blog. For the full story and all the technical nitty gritty, read on.

Reproduction

The long response times issue first surfaced through a customer support ticket for an application running a Puma + Rails web server. As the customer reported, in high load scenarios, the performance differences between Router 2.0 and the legacy router were disturbingly stark. An application scaled to 2 Standard-1X dynos would handle 30 requests per second just fine through the legacy router. Through Router 2.0, the same traffic would produce very long tail response times (95th and 99th percentiles). Under enough load, throughput would drop and requests would fail with H12: Request Timeout. The impact was immediate upon enabling the http-routing-2-dot-0 feature flag:

bad_latencies_r2_enabled

At first, our team of engineers had difficulty reproducing the above, despite running a similarly configured Puma + Rails app on the same framework and language versions. We consistently saw good response times from our app.

Then we tried varying the Rails application’s internal response time. We injected some artificial server lag of 200 milliseconds and that’s when things really took off:

bad_latencies_200ms

This was quite the realization! In staging environments, Router 2.0 is subject to automatic load tests that run continuously, at varied request rates, body sizes, protocol versions. etc.. These request rates routinely reach much higher levels than 30 requests per second. However, the target applications of these load tests did not include a Heroku app running Puma + Rails with any significant server-side lag.

Investigation 🔍

With a reproduction in-hand, we were now in a position to investigate the high response times. We spun up our test app in a staging environment and started injecting a steady load of 30 requests per second.

Our first thought was that perhaps the legacy router is faster at forwarding requests to the dyno because its underlying TCP client manages connections in a way that plays nicer with the Puma server. We hopped on a router instance and began dumping netstat connection states for one of our Puma app's web dynos :

Connections from legacy router → dyno

root@router.1019708 | # netstat | grep ip-10-1-38-72.ec2:11059
tcp 0 0 ip-10-1-87-57.ec2:28631 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:30717 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:15205 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:17919 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:24521 ip-10-1-38-72.ec2:11059 TIME_WAIT

Connections from Router 2.0 → dyno

root@router.1019708 | # netstat | grep ip-10-1-38-72.ec2:11059
tcp 0 0 ip-10-1-87-57.ec2:24630 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:22476 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:38438 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38444 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31034 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38448 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:41882 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23622 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31060 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31042 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23648 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31054 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:23638 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38436 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:31064 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:22492 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:38414 ip-10-1-38-72.ec2:11059 TIME_WAIT
tcp 0 0 ip-10-1-87-57.ec2:42218 ip-10-1-38-72.ec2:11059 ESTABLISHED
tcp 0 0 ip-10-1-87-57.ec2:41880 ip-10-1-38-72.ec2:11059 TIME_WAIT

In the legacy router case, it seemed like there were fewer connections sitting in TIME_WAIT. This TCP state is a normal stop point along the lifecycle of a connection. It means the remote host (dyno) has sent a FIN indicating the connection should be closed. The local host (router) has sent back an ACK, acknowledging the connection is closed.

time_wait

The connection hangs out for some time in TIME_WAIT, with the value varying among operating systems. The Linux default is 2 minutes. Once that timeout is hit, the socket is reclaimed and the router is free to re-use the address + port combination for a new connection.

With this understanding, we formed a hypothesis that the Router 2.0 HTTP client was churning through connections really quickly. Perhaps the new router was opening connections and forwarding requests at a faster rate than the legacy router, thus overwhelming the dyno.

Router 2.0 is written in Go and relies upon the language’s standard HTTP package. Some research turned up various tips for configuring Go’s http.Transport to avoid connection churn. The main recommendation involved tuning MaxIdleConnsPerHost . Without explicitly setting this configuration, the default value of 2 is used.

type Transport struct {
    // MaxIdleConnsPerHost, if non-zero, controls the maximum idle
    // (keep-alive) connections to keep per-host. If zero,
    // DefaultMaxIdleConnsPerHost is used.
    MaxIdleConnsPerHost int
    ...
}

const DefaultMaxIdleConnsPerHost = 2

The problem with a low cap on idle connections per host is that it forces Go to close connections more often. For example, if this value is set to a higher value, like 10, our HTTP transport will keep up to 10 idle connections for this dyno in the pool. Only when the 11th connection goes idle does the transport start closing connections. With the number limited to 2, the transport will close more connections which also means opening more connections to our dyno. This could put strain on the dyno as it requires Puma to spend more time handling connections and less time answering requests.

We wanted to test our hypothesis, so we set DefaultMaxIdleConnsPerHost: 100 on the Router 2.0 transport in staging. The connection distribution did change and now Router 2.0 connections were more stable than before:

root@router.1020195 | # netstat | grep 'ip-10-1-2-62.ec2.:37183'
tcp 0 0 ip-10-1-34-185.ec:36350 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:11956 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:51088 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:60876 ip-10-1-2-62.ec2.:37183 ESTABLISHED

To our dismay, this had zero positive effect on our tail response times. We were still seeing the 99th percentile at well over 2 seconds for a Rails endpoint that should only take about 200 milliseconds to respond.

We tried changing some other configurations on the Go HTTP transport, but saw no improvement. After several rounds of updating a config, waiting for the router artifact to build, and then waiting for the deployment to our staging environment, we began to wonder—can we reproduce this issue locally?

Going local

Fortunately, we already had a local integration test set-up for running requests through Router 2.0 to a dyno. We typically utilize this set-up for verifying features and fixes, rarely for assessing performance. We subbed out our locally running “dyno” for a Puma server with a built-in 200ms lag on the /fixed endpoint. We then fired off 200 requests over 10 different connections with hey:

❯ hey -q 200 -c 10 -host 'purple-local-staging.herokuapp.com' http://localhost:80/fixed

Summary:
  Total:    8.5804 secs
  Slowest:    2.5706 secs
  Fastest:    0.2019 secs
  Average:    0.3582 secs
  Requests/sec:    23.3090

  Total data:    600 bytes
  Size/request:    3 bytes

Response time histogram:
  0.202 [1]    |
  0.439 [185]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.676 [0]    |
  0.912 [0]    |
  1.149 [0]    |
  1.386 [0]    |
  1.623 [0]    |
  1.860 [0]    |
  2.097 [1]    |
  2.334 [6]    |■
  2.571 [7]    |■■


Latency distribution:
  10% in 0.2029 secs
  25% in 0.2038 secs
  50% in 0.2046 secs
  75% in 0.2086 secs
  90% in 0.2388 secs
  95% in 2.2764 secs
  99% in 2.5351 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.0003 secs, 0.2019 secs, 2.5706 secs
  DNS-lookup:    0.0002 secs, 0.0000 secs, 0.0034 secs
  req write:    0.0003 secs, 0.0000 secs, 0.0280 secs
  resp wait:    0.3570 secs, 0.2018 secs, 2.5705 secs
  resp read:    0.0002 secs, 0.0000 secs, 0.0175 secs

Status code distribution:
  [200]    200 responses

As you can see, the 95th percentile of response times is over 2 seconds, just as we had seen while running this experiment on the platform. We were now starting to worry that the router itself was inflating the response times. We tried targeting Puma directly at localhost:3000, bypassing the router altogether:

❯ hey -q 200 -c 10 http://localhost:3000/fixed

Summary:
  Total:    8.3314 secs
  Slowest:    2.4579 secs
  Fastest:    0.2010 secs
  Average:    0.3483 secs
  Requests/sec:    24.0055

  Total data:    600 bytes
  Size/request:    3 bytes

Response time histogram:
  0.201 [1]    |
  0.427 [185]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.652 [0]    |
  0.878 [0]    |
  1.104 [0]    |
  1.329 [0]    |
  1.555 [0]    |
  1.781 [0]    |
  2.007 [0]    |
  2.232 [2]    |
  2.458 [12]    |■■■


Latency distribution:
  10% in 0.2017 secs
  25% in 0.2019 secs
  50% in 0.2021 secs
  75% in 0.2026 secs
  90% in 0.2042 secs
  95% in 2.2377 secs
  99% in 2.4433 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.0002 secs, 0.2010 secs, 2.4579 secs
  DNS-lookup:    0.0001 secs, 0.0000 secs, 0.0016 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0012 secs
  resp wait:    0.3479 secs, 0.2010 secs, 2.4518 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200]    200 responses

Wow! These results suggested the issue is reproducible with any ‘ole Go HTTP client and a Puma server. We next wanted to test out a different client. The load injection tool, hey is also written in Go, just like Router 2.0. We next tried ab which is written in C:

❯ ab -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:
Server Hostname:        127.0.0.1
Server Port:            3000

Document Path:          /fixed
Document Length:        3 bytes

Concurrency Level:      10
Time taken for tests:   8.538 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      35000 bytes
HTML transferred:       600 bytes
Requests per second:    23.42 [#/sec] (mean)
Time per request:       426.911 [ms] (mean)
Time per request:       42.691 [ms] (mean, across all concurrent requests)
Transfer rate:          4.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       2
Processing:   204  409  34.6    415     434
Waiting:      204  409  34.7    415     434
Total:        205  410  34.5    415     435

Percentage of the requests served within a certain time (ms)
  50%    415
  66%    416
  75%    416
  80%    417
  90%    417
  95%    418
  98%    420
  99%    429
 100%    435 (longest request)

Another wow! The longest request took about 400 milliseconds, much lower than the 2 seconds above. Had we just stumbled upon some fundamental incompatibility between Go’s standard HTTP client and Puma? Not so fast.

A deeper dive into the ab documentation surfaced this option:

❯ ab -h
Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
    ...
    -k              Use HTTP KeepAlive feature

That’s different than hey’s default of enabling keepalive by default. Could that be significant? We re-ran ab with -k:

❯ ab -k -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:
Server Hostname:        127.0.0.1
Server Port:            3000

Document Path:          /fixed
Document Length:        3 bytes

Concurrency Level:      10
Time taken for tests:   8.564 seconds
Complete requests:      200
Failed requests:        0
Keep-Alive requests:    184
Total transferred:      39416 bytes
HTML transferred:       600 bytes
Requests per second:    23.35 [#/sec] (mean)
Time per request:       428.184 [ms] (mean)
Time per request:       42.818 [ms] (mean, across all concurrent requests)
Transfer rate:          4.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0       6
Processing:   201  405 609.0    202    2453
Waiting:      201  405 609.0    202    2453
Total:        201  406 609.2    202    2453

Percentage of the requests served within a certain time (ms)
  50%    202
  66%    203
  75%    203
  80%    204
  90%   2030
  95%   2242
  98%   2267
  99%   2451
 100%   2453 (longest request)

Now the output looked just like the hey output. Next, we ran hey with keepalives disabled:

❯ hey -disable-keepalive -q 200 -c 10 http://localhost:3000/fixed

Summary:
  Total:    8.3588 secs
  Slowest:    0.4412 secs
  Fastest:    0.2091 secs
  Average:    0.4115 secs
  Requests/sec:    23.9269

  Total data:    600 bytes
  Size/request:    3 bytes

Response time histogram:
  0.209 [1]    |
  0.232 [3]    |■
  0.255 [1]    |
  0.279 [0]    |
  0.302 [0]    |
  0.325 [0]    |
  0.348 [0]    |
  0.372 [0]    |
  0.395 [0]    |
  0.418 [172]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.441 [23]    |■■■■■


Latency distribution:
  10% in 0.4140 secs
  25% in 0.4152 secs
  50% in 0.4160 secs
  75% in 0.4171 secs
  90% in 0.4181 secs
  95% in 0.4187 secs
  99% in 0.4344 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.0011 secs, 0.2091 secs, 0.4412 secs
  DNS-lookup:    0.0006 secs, 0.0003 secs, 0.0017 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0011 secs
  resp wait:    0.4102 secs, 0.2035 secs, 0.4343 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
  [200]    200 responses

Again, no long tail response times and the median values comparable to the first run with ab.

Even better, this neatly explained the performance difference between Router 2.0 and the legacy router. Router 2.0 adds support for HTTP keepalives by default, in line with HTTP/1.1 spec. In contrast, the legacy router closes connections to dynos after each request. Keepalives usually improve performance, reducing time spent in TCP operations for both the router and the dyno. Yet, the opposite was true for a dyno running Puma.

Diving deep into Puma

Note that we suggest reviewing this brief Puma architecture document if you’re unfamiliar with the framework and want to get the most out of this section. To skip the code review, you may fast-forward to The Solution.

This finding was enough of a smoking gun to send us deep into the the Puma server code, where we homed in on the process_client method. Let’s take a look at that code with a few details in mind:

Each Puma thread can only handle a single connection at at time. A client is a wrapper around a connection.
The handle_request method handles exactly 1 request. It returns false when the connection should be closed and true when it should be kept open. A client with keepalive enabled will end up in the true condition on line 470.
fast_check is only false once we’ve processed @max_fast_inline requests serially off the connection and when there are more connections waiting to be handled.
For some reason, even when the number of connections exceeds the max number of threads, @thread_pool.backlog > 0 is often times false.
Altogether, this means the below loop usually executes indefinitely until we’re able to bail out when handle_request returns false.

process_client_2 Code snippet from puma/lib/puma/server.rb in Puma 6.4.2.

When does handle_request actually return false? That is also based on a bunch of conditional logic, the core of it is in the prepare_response method. Basically, if force_keep_alive is false, handle_request will return false. (This is not exactly true. It’s more complicated, but that’s not important for this discussion.)

prepare_resp_2
Code snippet from puma/lib/puma/request.rb in Puma 6.4.2.

The last thing to put the puzzle together: max_fast_inline defaults to 10. That means Puma will process at least 10 requests serially off a single connection before handing the connection back to the reactor class. Requests that may have come in a full second ago are just sitting in the queue, waiting for their turn. This directly explains our 10*200ms = 2 seconds of added response time for our longest requests!

We figured setting max_fast_inline=1 might fix this issue, and it does sometimes. However, under sufficient load, even with this setting, response times will climb. The problem is the other two OR’ed conditions circled in blue and red above. Sometimes the number of busy threads is less than the max and sometimes, there are no new connections to accept on the socket. However, these decisions are made at a point in time and the state of the server is constantly changing. They are subject to race conditions since other threads are concurrently accessing these variables and taking actions that modify their values.

The Solution

After reviewing the Puma server code, we came to the conclusion that the simplest and safest way to bail out of processing requests serially would be to flat-out disable keepalives. Explicitly disabling keepalives in the Puma server means handing the client back to the reactor after each request. This is how we ensure requests are served in order.

Once confirming these results with the Heroku Ruby language owners, we opened a Github issue on the Puma project and a pull request to add an enable_keep_alives option to the Puma DSL. When set to false, keepalives are completely disabled. The option will be released soon, likely in Puma 6.5.0.

We then re-ran our load tests with enable_keep_alives disabled in Puma and Router 2.0 enabled on the app:

// config/puma.rb
...
enable_keep_alives false

The response times and throughput improved, as expected. Additionally, once disabling Router 2.0, the response times stayed the same:

good_latencies_both

Moving forward

Keeping keepalives

Keeping connections alive reduces time spent in TCP operations. Under sufficient load and scale, avoiding this overhead cost can positively impact apps’ response times. Additionally, keepalives are the de facto standard in HTTP/1.1 and HTTP/2. Because of this, Heroku has chosen to move forward with keepalives as the default behavior for Router 2.0.

Through raising this issue on the Puma project, there has already been movement to fix the bad keepalive behavior in the Puma server. Heroku engineers remain active participants in discussions arounds these efforts and are committed to solving this problem. Once a full fix is available, customers will be able to upgrade their Puma versions and use keepalives safely, without risk of long response times.

Disabling keepalives as a stopgap

In the meantime, we have provided another option for disabling keepalives when using Router 2.0. The following labs flag may be used in conjunction with Router 2.0 to disable keepalives between the router and your web dynos:

heroku labs:enable http-disable-keepalive-to-dyno -a my-app

Note that this flag has no effect when using the legacy router as keepalives between the legacy router and dyno are not supported. For more information, see Heroku Labs: Disabling Keepalives to Dyno for Router 2.0.

Other options for Puma

You may find that your Puma app does not need keepalives disabled in order to perform well while using Router 2.0. We recommend testing and tuning other configuration options, so that your app can still benefit from persistent connections between the new router and your dyno:

Increase the number of threads. More threads means Puma is better able to handle concurrent connections.
Increase the number of workers. This is similar to increasing the number of threads.
Decrease the max_fast_inline number. This will limit the number of requests served serially off a connection before handling queued requests.

Other languages & frameworks

Our team also wanted to see if this same issue would present in other languages or frameworks. We ran load tests, injecting 200 milliseconds of server-side lag over the top languages and frameworks on the Heroku platform. Here are those results.

Language/Framework	Router	Web dynos	Server-side lag	Throughput	P50 Response Time	P95 Response Time	P99 Response Time
Puma	Legacy	2 Standard-1X	200 ms	30 rps	215 ms	287 ms	335 ms
Puma with keepalives	Router 2.0	2 Standard-1X	200 ms	23 rps	447 ms	3,455 ms	5,375 ms
Puma without keepalives	Router 2.0	2 Standard-1X	200 ms	30 rps	215 ms	271 ms	335 ms
NodeJS	Legacy	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms
NodeJS	Router 2.0	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms
Python	Legacy	4 Standard-1X	200 ms	30 rps	223 ms	607 ms	799 ms
Python	Router 2.0	4 Standard-1X	200 ms	30 rps	223 ms	607 ms	735 ms
PHP	Legacy	2 Standard-1X	200 ms	30 rps	207 ms	367 ms	431 ms
PHP	Router 2.0	2 Standard-1X	200 ms	30 rps	207 ms	367 ms	431 ms
Java	Legacy	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms
Java	Router 2.0	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms
Go	Legacy	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms
Go	Router 2.0	2 Standard-1X	200 ms	30 rps	207 ms	207 ms	207 ms

These results indicate the issue is unique to Puma, with Router 2.0 performance comparable to the legacy router in other cases.

Conclusion

We were initially surprised by this keepalive behavior in the Puma server. Funny enough, we believe Heroku’s significance in the Puma/Rails world and the fact that the legacy router does not support keepalives may have been factors in this bug persisting for so long. Reports of it had popped up in the past (see Issue 3443, Issue 2625 and Issue 2331), but none of these prompted a fool-proof fix. Setting enable_keep_alives false does completely eliminate the problem, but this is not the default option. Now, Puma maintainers are taking a closer look at the problem and benchmarking potential fixes in a fork of the project. The intention is to fix the balancing of requests without closing TCP connections to the Puma server.

Our Heroku team is thrilled that we were able to contribute in this way and help move the Puma/Rails community forward. We’re also excited to release Router 2.0 as GA, unlocking new features like HTTP/2 and keepalives to your dynos. We encourage our users to try out this new router! For advice on how to go about that, see Tips & Tricks for Migrating to Router 2.0.

Video Transcript