Over the past few weeks, Heroku proactively updated our entire Redis fleet with a version of Redis not vulnerable to CVE-2018-11218. This was an embargoed vulnerability, so we did this work without notifying our customers about the underlying cause. As always, our goal was to update all Heroku Key-Value Store instances well before the embargo expired.
As a Data Infrastructure Engineer at Heroku, I wanted to share how we manage large fleet operations such as this one. The most important aspect of our job is keeping customers safe from security vulnerabilities, while also minimizing disruption and downtime. Those two objectives are often at odds with each other, so we work hard to reduce the impact of these kinds of updates.
When patching a security vulnerability or doing any fleet-wide operations, there are three main concepts we care about: designing infrastructure components to be immutable, having and following a well-defined operations process, and being aware of and mitigating employee fatigue or burnout.
I’ll talk more about each of these in detail, but it's important to first understand how high-availability works with Heroku Key-Value Store.
Heroku Key-Value Store and High Availability
All paid Heroku Key-Value Store plans have a High Availability (HA) feature. When the primary Redis instance fails, it is automatically replaced with a replica, called a standby. Standbys update asynchronously. This means that any developer using a paid Heroku Key-Value Store plan gets HA Redis without having to do any setup or ongoing operations.
How does Heroku Patch a Security Vulnerability?
Immutable Infrastructure Helps Us Scale
The Heroku Data infrastructure team uses the principle of immutable infrastructure. In a traditional mutable server infrastructure, engineers and administrators can SSH into servers and change configurations or upgrade packages manually. These servers are “mutable” because they can change once created.
With immutable infrastructure, servers are never modified after they're deployed. If we need to change something, we build new servers from a common image and deploy them as replacements.
While mutable infrastructure may work well for smaller fleets, it is not a viable option for us. Heroku manages millions of databases. At that scale, it is not feasible to manage customizations on specific servers unless they are fully-automated through our control planes. Additionally, if we find a change that would benefit one customer, we want to make sure all our customers get the benefit of that change.
A Well-Defined Operations Process Minimizes Errors
It is important to have a well-defined and clear understanding of our automation code when performing operations on many servers at once. One small error could impact thousands of customers, and leave their application with more downtime than necessary.
Here's how we perform a fleet-wide Redis patch:
- We create a new server image that includes all the changes we need to roll out (security patches, OS upgrades, configuration updates, etc.). We test this image, review the results, and then flag it for release.
- We replace all High Availability standbys with new ones that use the updated image.
- Once all standbys are patched:
- We schedule maintenance for all our customers on paid databases plans; these maintenances are scheduled in their next maintenance window, with a warning of at least three business days. Customers can wait for the scheduled maintenance or run the maintenance ahead of the scheduled time if desired.
- Ensure the standby is in sync with the primary. This ensures that we do not lose data due to faulty replication. We watch for the correct functioning of a High Availability standby at all times during its lifetime. We page an operator in the rare event of a replication failure that goes unresolved for too long.
- Break the replication link. This means the standby is no longer following the primary, and is now capable of accepting writes.
- Push a release. As an addon provider, Heroku Key-Value Store updates the config vars on your application to point at the new primary.
- De-provision the old primary and create a new standby.
Customers receive an email first to notify them about the upcoming maintenance and then another to notify them of completion. Once all HA Redis instances are updated, we gradually replace all our hobby plan databases to be running on the updated image.
A Focus on Employee Health Helps Us Respond When Needed Most
Fleet rolls generally result in more pages for the on-call operator to triage. This is due to the increased likelihood of hitting edge cases we have never seen before and therefore never automated. We work hard at fixing these issues before they cause issues, it's also important to keep an eye on the health of the on-call operator for when they do.
We have a few mechanisms to help reduce fatigue and burnout on our team:
- Our fleet roll code only schedules replacement operations during the current on-call operator's business hours. This limits burnout by reducing the risk of the fleet roll waking them up at night.
- Our control plane software automatically manages all fleet rolls. We determine a safe concurrency level to prevent too much activity from happening all at once.
Outcome
In the end, we successfully completed a fleet roll for all Heroku Key-Value Store customers within five weeks of the vulnerability notification. We beat the public announcement by a week too.
We invest heavily in our people and process to make these kinds of updates both possible and predictable. In fact, we follow a similar process for the rest of Heroku Data infrastructure, including Heroku Postgres and Apache Kafka on Heroku. It's all part of our commitment to provide you with the best possible service.
If we do our job well, you won't notice any of this effort, but we take pride in that too.