Routing and Web Performance on Heroku: a FAQ

Hi. I'm Adam Wiggins, cofounder and CTO of Heroku.

Heroku has been my life’s work. Millions of apps depend on us, and I take that responsibility very personally.

Recently, Heroku has faced criticism from the hacker community about how our HTTP router works, and about web performance on the platform in general. I’ve read all the public discussions, and have spent a lot of time over the past month talking with our customers about this subject.

The concerns I've heard from you span past, present, and future.

The past: some customers have hit serious problems with poor web performance and insufficient visibility on their apps, and have been left very frustrated as a result. What happened here? The present: how do you know if your app is affected, and if so what should you do? And the future: what is Heroku doing about this? Is Heroku a good place to run and scale an app over the long term?

To answer these questions, we’ve written a FAQ, found below. It covers what happened, why the router works the way that it does, whether your app is affected by excessive queue time, and what the solution is.

As to the future, here’s what we’re doing. We’re ramping up hands-on migration assistance for all users running on our older stack, Bamboo, or running a non-concurrent backend on our new stack, Cedar. (See the FAQ for why this is the fix.) We’re adding new features such as 2X dynos to make it easier to run concurrent backends for large Rails apps. And we're making performance and visibility a bigger area of product attention, starting with some tools we've already released in the last month.

If you have a question not answered by this FAQ, post it as a comment here, on Hacker News, or on Twitter. I’ll attempt to answer all such questions posted in the next 24 hours.

To all our customers who experienced real pain from this: we're truly sorry. After reading this FAQ, I hope you feel we're taking every reasonable step to set things right, but if not, please let us know.

Adam


Overview

Q. Is Heroku’s router broken?

A. No. While hundreds of pages could be written on this topic, we’ll address some of this in Routing technology. Summary: the current version of the router was designed to provide the optimum combination of uptime, throughput, and support for modern concurrent backends. It works as designed.

Q. So what’s this whole thing about then?

A. Since early 2011, high-volume Rails apps that run on Heroku and use single-threaded web servers sometimes experienced severe tail latencies and poor utilization of web backends (dynos). Lack of visibility into app performance, including incorrect queue time reporting prior to the New Relic update in February 2013, made diagnosing these latencies (by customers, and even by Heroku’s own support team) very difficult.

Q. What types of apps are affected?

A. Rails apps running on Thin, with six or more dynos, and serving 1k reqs/min or more are the most likely to be affected. The impact becomes more pronounced as such apps use more dynos, serve more traffic, or have large request time variances.

Q. How can I tell if my app is affected?

A. Add the free version of New Relic (heroku addons:add newrelic) and install the latest version of the newrelic_rpm gem, then watch your queue time. Average queue times above 40ms are usually indicative of a problem.

Some apps with lower request volume may be affected if they have extremely high request time variances (e.g., HTTP requests lasting 10+ seconds) or make callbacks like this OAuth example.

Q. What’s the fix?

A. Switch to a concurrent web backend like Unicorn or Puma on JRuby, which allows the dyno to manage its own request queue and avoid blocking on long requests.

This requires that your app be on our most current stack, Cedar.

Q. Can you give me some help with this?

A. Certainly. We’ve already emailed all customers with apps running on Thin with more than six dynos with self-migration instructions, and a way to reach us for direct assistance.

If you haven’t received the email and want help making the switch, contact us for migrating to Cedar or migrating to Unicorn.

Routing technology

Q. Why does the router work the way that it does?

A. The Cedar router was built with two goals in mind: (1) to support the new world of concurrent web backends which have become the standard in Ruby and all other language communities; and (2) to handle the throughput and availability needs of high-traffic apps.

Read detailed documentation of Heroku’s HTTP routing.

Q. Even with concurrent web backends, wouldn’t a single global request queue still use web dynos more efficiently?

A. Probably, but it comes with trade-offs for availability and performance. The Heroku router favors availability, stateless horizontal scaling, and low latency through individual routing nodes. Per-app global request queues require a sacrifice on one or more of these fronts. See Kyle Kingsbury’s post on the CAP theorem implications for global request queueing.

After extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections.

Q. So does that mean you aren’t working on improving HTTP performance?

A. Not at all. We're always looking for new ways to make HTTP requests on Heroku faster, more reliable, and more efficient. For example, we’ve been experimenting with backpressure routing for web dynos to signal to the router that they are overloaded.

You, our customers, have told us that it’s not routing algorithms you ultimately care about, but rather overall web performance. You want to serve HTTP requests as quickly as possible, for fast page loads or API calls for your users. And you want to be able to quickly and easily diagnose performance problems.

Performance and visibility are what matters, and that’s what we’ll work on. This will include ongoing improvements to dynos, the router, visibility tools, and our docs.

Retrospective

Q. Did the Bamboo router degrade?

A. Yes. Our older router was built and designed during the early years of Heroku to support the Aspen and later the Bamboo stack. These stacks did not support concurrent backends, and thus the router was designed with a per-app global request queue. This worked as designed originally, but then degraded slowly over the course of the next two years.

Q. Were the docs wrong?

A. Yes, for Bamboo. They were correct when written, but fell out of date starting in early 2011. Until February 2013, the documentation described the Bamboo router only sending one connection at a time to any given web dyno.

Q. Why didn’t you update Bamboo docs in 2011?

A. At the time, our entire product and engineering team was focused on our new product, Cedar. Being so focused on the future meant that we slipped on stewardship of our existing product.

Q. Was the "How It Works" section of the Heroku website wrong?

A. Yes. Similar to the docs, How It Works section of our website described the router as tracking which dynos were tied up by long HTTP requests. This was accurate when written, but gradually fell out of date in early 2011. Unlike the docs, we completely rewrote the homepage in June of 2011 and it no longer referenced tracking of long requests.

Q. Was the queue time metric in New Relic wrong?

A. Yes, for the same 2011—2013 period from previous questions. The metric was transmitted to the New Relic instrumentation in the app via a set of HTTP headers set by the Heroku router. The root cause was the same as the Bamboo router degradation: the code didn't change, but scaling out the router nodes caused the data to become increasingly inaccurate and eventually useless. With New Relic's help, we fixed this in February 2013 by calculating queue time using a different method.

Q. Why didn’t Heroku take action on this until Rap Genius went public?

A. We’re sorry that we didn’t take action on this based on the customer complaints via support tickets and other channels sooner. We didn’t understand the magnitude of the confusion and frustration caused by the out-of-date Bamboo docs, incorrect queue time information in New Relic, and the general lack of visibility into web performance on the platform. The huge response to the Rap Genius post showed us that this touched a nerve in our community.

The Future

Q. What are we doing to make things right from here forward?

A. We’ve been working with many of our customers to get their queue times down, get them accurate visibility into their app’s performance, and make sure their app is fast and running on the right number of dynos. So far, the results are good.

Q. What about everyone else?

A. If we haven’t been in touch yet, here’s what we’re doing for you:

  • Migration assistance: We’ll give you hands-on help migrating to a concurrent backend, either individually or in online workshops. This includes the move to Cedar if you’re still on Bamboo. If you’re running a multi-dyno app on a non-concurrent backend and haven’t received an email, drop us a line about Thin to Unicorn or Bamboo to Cedar.
  • 2X dynos: We’re fast-tracking the launch of 2X dynos, to provide double the memory and allow for double (or more) Unicorn concurrency for large Rails apps. This is already available in private beta in use by several hundred customers, and will be available in public beta shortly.
  • New visibility tools: We’re putting more focus on bringing you new performance visibility features, such as the log2viz dashboard, CPU and memory use logging, and HTTP request IDs. We’ll be working to do much more on this front to make sure that you can diagnose performance problems when they happen and know what to do about it.

Want something else not mentioned here? Let us know.

Browse the blog archives or subscribe to the full-text feed.