Routing Performance Update

Over the past couple of years Heroku customers have occasionally reported unexplained latency on Heroku. There are many causes of latency—some of them have nothing to do with Heroku—but until this week, we failed to see a common thread among these reports. We now know that our routing and load balancing mechanism on the Bamboo and Cedar stacks created latency issues for our Rails customers, which manifested themselves in several ways, including:

  • Unexplainable, high latencies for some requests
  • Mismatch between reported queuing and service time metrics and the observed reality
  • Discrepancies between documented and observed behaviors

For applications running on the Bamboo stack, the root cause of these issues is the nature of routing on the Bamboo stack coupled with gradual, horizontal expansion of the routing cluster. On the Cedar stack, the root cause is the fact that Cedar is optimized for concurrent request routing, while some frameworks, like Rails, are not concurrent in their default configurations.

We want Heroku to be the best place to build, deploy and scale web and mobile applications. In this case, we’ve fallen short of that promise. We failed to:

  • Properly document how routing works on the Bamboo stack
  • Understand the service degradation being experienced by our customers and take corrective action
  • Identify and correct confusing metrics reported from the routing layer and displayed by third party tools
  • Clearly communicate the product strategy for our routing service
  • Provide customers with an upgrade path from non-concurrent apps on Bamboo to concurrent Rails apps on Cedar
  • Deliver on the Heroku promise of letting you focus on developing apps while we worry about the infrastructure

We are immediately taking the following actions:

  • Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
  • Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
  • Adding metrics that let customers determine queuing impact on application response times
  • Providing additional tools that developers can use to augment our latency and queuing metrics
  • Working to better support concurrent-request Rails apps on Cedar

The remainder of this blog post explains the technical details and history of our routing infrastructure, the intent behind the decisions we made along the way, the mistakes we made and what we think is the path forward.

How routing works on the Bamboo stack

In 2009, Heroku introduced the Bamboo stack. It supported only one language, one web framework and one embedded webserver. These were: Ruby (MRI 1.8), Rails (2.x) and Thin, respectively.

The Bamboo stack does not support concurrency. On Bamboo, a single process can serve only one request at a time. To support this architecture, Heroku’s HTTP router was designed to queue requests at the router level. This enabled it to efficiently distribute requests to all available dynos.

The Bamboo router never used a global per-application request queue. The router is a clustered service where each node in the cluster maintains its own per-application request queue. This is less efficient than routing with a global request queue, but it is a reasonable compromise as long as the cluster is small.

To see why, let’s look at a simplistic example. In the two diagrams below, requests are coming in through three router nodes and being passed to two dynos. The majority of requests take 50ms, while a rare slow request takes 5000ms. In the first diagram, you can see how a slow request, coming in to Router 1, is passed to Dyno 1. Until Dyno 1 is finished with that request, Router 1 will not send any more requests to that dyno. However, Routers 2 and 3 may still send requests to that dyno.

Meanwhile, as illustrated in the next diagram, because Routers 2 and 3 are not aware that Dyno 1 is busy, they may still queue up one request each for Dyno 1. These requests are delayed until Dyno 1 finishes processing the slow request.

The inefficiency in request routing gets worse as the number of routers increases. This is essentially what’s been happening with Rails apps running on the Bamboo stack. Our routing cluster remained small for most of Bamboo’s history, which masked this inefficiency. However, as the platform grew, it was only a matter of time before we had to scale out and address the associated challenges.

Routing on Cedar

As part of the new Cedar stack, we chose to evolve our router design to achieve the following:

  • Support additional HTTP features like long polling and chunked responses
  • Support multi-threaded and multi-process runtimes like JVM, Node.js, Unicorn and Puma
  • Stateless architecture to optimize for reliability and scalability

Additionally, to meet the scalability requirements of Cedar we chose to remove the queuing logic and switch to random assignment. This new routing design was released exclusively on Cedar and was significantly different from the old design. What’s important to note is we intended customers to get the new routing behavior only when they deployed applications to Cedar.

Degradation of Bamboo routing

In theory, customers who had relied on the behavior of Bamboo routing could continue to use the Bamboo stack until they were ready to migrate to Cedar. Unfortunately that is not what happened. As traffic on Heroku grew, we added new nodes to the routing cluster rendering the per-node request queues less and less efficient, until Bamboo was effectively performing random load balancing.

We did not document this evolution for our customers nor update our reporting to match the changing behavior. As a result, customers were presented with confusing metrics. Specifically, our router logs captured the service time and the depth of the per app request queue and present that to customers, who in turn were relying on these metrics to determine scaling needs. However, as the cluster grew, the time-and-depth metric for an individual router was no longer a relevant way to determine latency in your app.

As a result, customers experienced what was effectively random load balancing applied to their Bamboo applications. This was not caused by an explicit change to the Bamboo routing code. Nor was it related to the new routing logic on Cedar. It was a pure side-effect of the expansion of the routing cluster.

No path for concurrent Rails apps on Cedar

We launched Cedar in beta in May 2011 with support for Node.js and Ruby on Rails. Our documentation recommends the use of Thin, which is a single-threaded, evented web server. In theory, an evented server like Thin can process multiple concurrent requests, but doing this successfully depends on the code you write and the libraries you use. Rails, in fact, does not yet reliably support concurrent request handling. This leaves Rails developers unable to leverage the additional concurrency capabilities offered by the Cedar stack, unless they move to a concurrent web server like Puma or Unicorn.

Rails apps deployed to Cedar with Thin can rather quickly end up with request queuing problems. Because the Cedar router no longer does any queuing on behalf of the app, requests queued at the dyno must wait until the single Rails process works its way through the queue. Many customers have run into this issue and we failed to take action and provide them with a better approach to deploying Rails apps on Cedar.

Next Steps

To reiterate, here is what we are doing now:

  • Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
  • Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
  • Adding metrics that let customers determine queuing impact on application response times
  • Providing additional tools that developers can use to augment our latency and queuing metrics
  • Working to better support concurrent-request Rails apps on Cedar

If you have thoughts or questions, please comment below or reach out to me directly at jesperj@heroku.com.

Bamboo Routing Performance

Yesterday, one of our customers let us know about significant performance issues they have experienced on Heroku. They raised an important issue and I want to let our community know about it. In short, Ruby on Rails apps running on Bamboo have experienced a degradation in performance over the past 3 years as we have scaled.

We failed to explain how our product works. We failed to help our customers scale. We failed our community at large. I want to personally apologize, and commit to resolving this issue.

Our goal is to make Heroku the best platform for all developers. In this case, we did not succeed. But we will make it right. Here’s what we are working on now:

  • Posting an in-depth technical review tomorrow
  • Quickly providing more visibility into your app’s queue of web requests
  • Improving our documentation and website to accurately reflect our product
  • Giving you tools to understand and improve the performance of your apps
  • Working closely with our customers to develop long-term solutions

I am committing to listening to you, acting quickly to meet your needs and making sure Heroku is a platform that you trust for all of your applications. If you have additional concerns, please let me know. My email address is oren.teich@heroku.com.

Oren Teich GM, Heroku

Waza 2013 - Keynote Speakers

Waza (技) 2013 is less than a month away and we are excited to have a full lineup of speakers who will be talking about their perspectives on art and technique. In between the talks, take part in an unique blend of conversation and craft through the hands-on workshops led by artisans teaching their trades from origami creations, to take-home woodblock prints, and even a hand-crafted and dyed quilt. Take part in this celebration of skill and making at Waza 2013.

Waza Keynotes:

Michael Lopp: Rands in Repose

Michael has been blogging since 2002 as his alter-ego Rands in Repose.

Our favorite recent quote: "Engineers don’t hate process. They hate process that can’t defend itself." We couldn’t agree more.

Kirby Ferguson: Everything is a Remix.

Kirby is a filmmaker, storyteller, and remixer. He is known for his fantastic video series Everything is a Remix

Favorite quote: “We are not self-made. We are dependent on one another. Admitting this to ourselves isn't an embrace of mediocrity and derivativeness, it's a liberation from our misconceptions.”

He’s been featured at TED.

Speaker Lineup:

In addition to these fantastic keynotes, we have a full day of sessions that include:

Afterparty, sponsored by Github

Waza doesn’t end after the last speaker leaves the stage. All attendees are invited to join Heroku and Github at the Waza afterparty. Live music, drinks, and a great time is on the schedule.

Don’t miss out - Register Today!

Cross-Site Request Forgery Vulnerability Resolution

On Friday January 18, security researcher Benjamin Manns notified Heroku of a security vulnerability related to our add-ons program. At a high level, the vulnerability could have resulted in disclosing our Cross-Site Request Forgery tokens (these tokens are used to prevent browser hijacking) to third parties.

We quickly addressed the vulnerability and on Sunday, we deployed a patch to remediate the issue. We also reviewed our code for related vulnerabilities and conducted a review of our audit logs to determine the impact of the vulnerability. We found no instances of this issue being exploited.

We wish to thank Mr. Manns for his work and commitment to responsible disclosure. You can access his write up here: http://www.benmanns.com/posts/security-vulnerability-found-in-heroku-and-rails-form-tag/

We would also like to reaffirm our commitment to the security and integrity of our customers’ data and code. Nothing is more important to us.

Oren Teich, Chief Operating Officer

Dataclips 2.0 – Unlock the value of your data

An organization's data is its most valuable asset. Unfortunately, that data is usually trapped inside a database with few ways to access it by a privileged handful of people. Too often reports are manually generated and their results pasted into emails; dashboards get built but rapidly become outdated and never answer the right questions.

We have so many great tools for collaborating around our source code, why is data still in the dark ages? At Heroku Postgres, we believe that your data should flow like water. Only the most up-to-date data should be available any time you have a decision to make. Instead of being trapped in disparate systems, you should be able to move data smoothly between development, staging, and production. It should flow across apps, between teams, and between services.

That’s why we built Dataclips, a tool for sharing live query results. Think of it as pastebin for SQL, or gist for your data. Each dataclip is a sharable handle to a live query and is available in a variety of formats.

Read more about the new version of Dataclips 2.0 over on the Heroku Postgres blog and how it can help you gain better insight into your data.

Browse the blog archives or subscribe to the full-text feed.