As an SRE (Service Reliability Engineer) at Heroku, one of the things I’m exposed to is how much work happens behind the scenes in order to create what we call “non-events” for you, our users. A non-event is turning something that would typically create work for an application hosted on traditional infrastructure into something that the user won’t even notice. This is something we put a lot of energy into because we believe in letting our users run apps instead of managing infrastructure. We make this investment because we know that for every hour you spend managing infrastructure, that’s an hour less spent on building or maintaining your application. We know that you need to be able to iterate quickly in order to have a competitive advantage and you’ll have a more difficult time doing that if you’re also managing infrastructure.
Two examples of these non-events from recent weeks are the “Shellshock” security flaw and Amazon having to reboot a large number of instances due to a security vulnerability in their hypervisor. This post is about what happened behind the scenes at Heroku to shield our users.
First, we’ll start by discussing the Shellshock security flaw. Shellshock, known more formally as CVE-2014-6271, is a security vulnerability which could allow an attacker to remotely execute arbitrary code. What made this one particularly severe was how widespread it was -- it was present on nearly any UNIX-like system, from Jane and John Doe’s MacBook Air, to their home router, to a Heroku dyno. On Wednesday, September 24th, when we were made aware of this vulnerability, we responded immediately. Thanks to the fast response from our Security team and multiple engineering teams, we were able to deploy multiple patches to thousands of servers to protect our users inside of twelve hours. Our customers’ apps became more secure before many of them even heard about this vulnerability, and that’s the way we like it.
Next, we had to ensure that Amazon’s mass reboots were not going to disrupt or degrade your applications. For the unaware, Amazon announced two weeks ago that they were going to be rebooting large numbers of instances in order to patch an unnamed (at the time) security vulnerability. Of course, Amazon was not alone, many infrastructure providers responded to this vulnerability in the same manner. Reboots are a necessary part of maintenance and the ongoing operation of a service, which is why we ensure that our platform is erosion resistant.
When Amazon originally reached out to us on Wednesday, September 24th, we were not yet aware of which or how many of our instances were going to be rebooted. Rather than wait and risk getting caught off-guard, we put together a plan that should have allowed us to survive the loss of each availability zone over four to five days in both the US and the EU regions. In addition to the technical plan, we also made sure that our on-call engineers and incident commanders were ready for a long weekend.
When we were finally able to learn which instances were going to be rebooted, we were able to scale this plan back significantly, even avoiding the maintenance period for some services by preemptively evacuating the availability zones that were scheduled for maintenance. In the end, we did have one minor logging incident that was related to the Amazon reboots, but because of our careful planning we were able to turn this maintenance window largely into a non-event for most of our customers.
One of the major benefits of running your applications on Heroku versus any of the alternatives are the non-events. On Heroku, non-events are built-in but on other platforms, you still need a team of people who have the time and expertise to absorb the Shellshock and AWS mass reboot surprises. They must manage your infrastructure, track security vulnerabilities, carry a pager, and turn surprises into non-events for you. With Heroku, all of these things are done for you -- allowing you to focus on building great experiences for your users.