Heroku Behind the Curtain: Patching the glibc Security Hole

If you’re a developer, it’s unlikely you’ve ever said "I wish I could spend a whole day patching critical security holes in my infrastructure!" (If you do, we’re hiring). And if you’re running a business, it’s unlikely you’ve ever said “Yes! I would like my developers to lose a day’s worth of feature-building on security patches!”.

At Heroku, we believe you shouldn’t have to spend the time required to patch, test, and deploy security fixes. Because of that, some of Heroku’s most important features are ones you never see: we keep our platform reliable and secure for your apps so you don’t have to.

Recently Google Security and Red Hat both discovered a high severity bug in a fundamental system library—glibc. This library is in common usage across the internet. If a server with a vulnerable version of the library were to make a DNS request to a malicious resolver, the DNS server could potentially execute code on the system making the request.

What do we do when a security vulnerability lands?

Heroku took the glibc issue very seriously. We’ve done a lot of work to make sure our dyno containers are secure and we do everything possible to keep our customers safe.

Our first step in any security incident is an immediate assessment by our security team. They work with our engineering teams to determine how big of a risk any vulnerability is to us. In this case, the potential for remote code execution meant that we considered it a high priority patch for any system running glibc and querying DNS. That’s pretty much all of them.

We patched our entire runtime fleet for both our Common Runtime and Private Spaces platforms. We also patched our Cedar stack image, to ensure that all the code you’re running in your dynos stays safe. Last and most complicated, we patched our Data platform (Postgres and Redis) while keeping your data safe and available.

How do we do this with a minimum of downtime?

We have standard practices for rolling out changes such as upgrades, new features, or security patches. These practices vary depending on the platform we’re applying the changes to.

For the Common Runtime and Private Spaces this is built into our infrastructure. When we push a new software version, we build a new base image (which is what dynos live on top of) and cycle your dynos off to a fresh instance using the new image.

An automated Continuous Integration (CI) pipeline updates our base images every time we update one of our components and includes the latest Ubuntu base image. This new base image is automatically used by automated tests and new runtimes in staging. We cut a new release based on it and trigger the upgrade: first to our existing staging fleet, then to a small subset of production, then the entire production fleet. Tests run between each of these stages.

In this case, we ensured that the latest base image contained the patched version of glibc and manually triggered our normal staged update and testing process.

What about the dynos themselves? They are security-hardened Linux containers which are isolated from the base image they run on. Dynos are composed of a few pieces—your code, which we call a slug, your language-specific buildpack, and what we call stack images. Stack images provide basic system resources like glibc, and we make sure they have the latest security patches. In this case, we updated our stack image at the same time we updated our base image, as soon as we had a patched version of glibc. As dynos cycle, they pick up the new stack image.

This dyno cycle takes 24 hours by default, and if we feel the need to move more quickly, we can force a faster refresh. In this case we waited for our normal 24 hour cycle to prevent customers noticing disruption as everything restarts in quick succession. Our assessment of the vulnerability was that the risk was not so high that it would be worth potentially affecting running apps.

What about data?

Postgres, Redis, and any data store are more complicated to update. Customers store irreplaceable information that we can’t architect around like we do for our runtimes. We need to be both available and secure, patching servers while keeping your data flowing.

To solve this problem for Heroku Postgres, we have follower databases. These automatically get a copy of all the data sent to your main database. When we need to update quickly, we can create a new, updated follower first and then change the follower to the main database. This does cause a short period of downtime, which is why we allow you to set maintenance windows so you can anticipate interruptions. In this case, based on our assessment of the vulnerability, most customers received their updates in their expected maintenance window.

For Heroku Key-Value Store, we have a similar story, including maintenance windows. Again, most customers received their updates in their expected maintenance window.

What about Heroku itself?

A lot of Heroku runs on our own platform! We use all the tools we have to keep our customers secure and stable for ourselves. We did schedule some maintenance to patch our own internal databases with the above follower changeover process. This maintenance affected our API, Deployment, and orchestration system for a few minutes on each database. It didn’t affect our routers or runtimes—so your customers would have been able to reach your app even during maintenance.

Keep calm, carry on

The need to patch occasional security holes is serious and unavoidable. Before this glibc issue, there was last year’s GHOST issue, and Heartbleed before that. At Heroku, we believe these patches shouldn’t disrupt your flow. We work very hard to handle security issues and other platform maintenance with minimum impact to you or your apps—so you can carry on with your work without distraction.

Video Transcript