Tuesday Postmortem

Tuesday was not a good day for Heroku and as a result it was not a good day for our customers. I want to take the time to explain what happened, how we addressed the problem, and what we’re doing in the future to keep it from happening again.

Over the past few weeks we have seen unprecedented growth in the rate of new applications being added to the platform. This growth has exacerbated a problem with our internal messaging systems that we’ve known about and been working to address. Unfortunately, the projects that we have underway to address the problem were planned based on previous growth rates and are not yet complete.

A slowdown in our internal messaging systems caused a previously unknown bug in our distributed routing mesh to be triggered. This bug caused the routing mesh to fail. After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service. This incompatibility forced us to move back to the newer routing mesh code, which required us to perform a “hot patch” of the production system to fix the initial bug. This patch was successful and all applications were returned to service.

As a result of the problems we have seen over the past couple of weeks, culminating with yesterday’s outage, we have reprioritized our ongoing projects. Several engineers have been dedicated to making short-term changes to the platform with an eye toward incrementally improving the stability of our messaging systems as quickly as possible.

The first of these projects was deployed to our production systems last night and is already making an impact. One of our operations engineers, Ricardo Chimal, Jr., has been working for some time on improving the way we target messages between components of our platform. We completed internal testing of these changes yesterday and they were deployed to our production cloud last night at 19:00 PDT (02:00 UTC).

After these changes were deployed, we immediately saw a dramatic improvement in the CPU utilization of our messaging system. The graph above was generated by Cloudkick, one of the tools that we use to manage our infrastructure, which shows a roughly 5x improvement from this first batch of changes on one of our messaging servers. Ricardo’s excellent work is already making a big impact and we expect this progress to continue as additional improvements are rolled out over the coming days.

View our official reason for outage statement here:
http://status.heroku.com/incident/93

We know that you rely on Heroku to run your businesses and that we let you down yesterday. We’re sorry for the trouble today’s problems caused you. We appreciate your faith in us; we’re not going to rest until we’ve lived up to it.

Browse the blog archives or subscribe to the full-text feed.