How We Found and Fixed a Filesystem Corruption Bug

As part of our commitment to security and support, we periodically upgrade the stack image, so that we can install updated package versions, address security vulnerabilities, and add new packages to the stack. Recently we had an incident during which some applications running on the Cedar-14 stack image experienced higher than normal rates of segmentation faults and other “hard” crashes for about five hours. Our engineers tracked down the cause of the error to corrupted dyno filesystems caused by a failed stack upgrade. The sequence of events leading up to this failure, and the technical details of the failure, are unique, and worth exploring.

Background

Heroku runs application processes in dynos, which are lightweight Linux containers, each with its own, isolated filesystem. Our runtime system composes the container’s filesystem from a number of mount points. Two of these mount points are particularly critical: the /app mount point, which contains a read-write copy of the application, and the / mount point, which contains the container’s stack image, a prepared filesystem with a complete Ubuntu installation. The stack image provides applications running on Heroku dynos with a familiar Linux environment and a predictable list of native packages. Critically, the stack image is mounted read-only, so that we can safely reuse it for every dyno running on the same stack on the same host.

Heroku Filesystem Diagram

Given the large number of customer dynos we host, the stack upgrade process is almost entirely automated, and it’s designed so that a new stack image can be deployed without interfering with running dynos so that our users aren’t exposed to downtime on our behalf. We perform this live upgrade by downloading a disk image of the stack to each dyno host and then reconfiguring each host so that newly-started dynos will use the new image. We write the newly-downloaded image directly to the data directory our runtime tools use to find images to mount, so that we have safety checks in the deployment process, based on checksum files, to automatically and safely skip the download if the image is already present on the host.

Root Causes

Near the start of December, we upgraded our container tools. This included changing the digest algorithms and filenames used by these safety checks. We also introduced a latent bug: the new version of our container tools didn't consider the checksum files produced by previous versions. They would happily install any disk image, even one that was already present, as long as the image had not yet been installed under the new tools.

We don’t often re-deploy an existing version of a stack image, so this defect might have gone unnoticed and would eventually have become irrelevant. We rotate hosts out of our runtime fleet and replace them with fresh hosts constantly, and the initial setup of a fresh host downloads the stack image using the same tools we use to roll out upgrades, which would have protected those hosts from the defect. Unfortunately, this defect coincided with a second, unrelated problem. Several days after the container tools upgrade, one of our engineers attempted to roll out an upgrade to the stack image. Issues during this upgrade meant that we had to abort the upgrade, and our standard procedure to ensure that all container hosts are running the same version when we abort an upgrade involves redeploying the original version of the container.

During redeployment, the safety check preventing our tools from overwriting existing images failed, and our container tools truncated and overwrote the disk image file while it was still mounted in running dynos as the / filesystem.

Technical Impact

The Linux kernel expects that a given volume, whether it’s backed by a disk or a file, will go through the filesystem abstraction whenever the volume is mounted. Reads and writes that go through the filesystem are cached for future accesses, and the kernel enforces consistency guarantees like “creating a file is an atomic operation” through those APIs. Writing directly to the volume bypasses all of these mechanisms, completely, and (in true Unix fashion) the kernel is more than happy to let you do it.

During the incident, the most relevant consequence for Heroku apps involved the filesystem cache: by truncating the disk image, we’d accidentally ensured that reads from the image would return no data, while reads through the filesystem cache would return the data from the previously-present filesystem image. There’s very little predictability to which pages will be in the filesystem cache, so the most common effect on applications was that newly-loaded programs would partially load from the cache and partially load from the underlying disk image, mid-download. The resulting corrupted programs crashed, often with a segmentation fault, the first time they executed an instruction that attempted to read any of the missing data, or the first time they executed an instruction that had, itself, been damaged.

During the incident, our response lead put together a small example to verify the effects we’re seeing. If you have a virtual machine handy, you can reproduce the problem yourself, without all of our container infrastructure. (Unfortunately, a Docker container won’t cut it: you need something that can create new mount points.)

Create a disk image with a simple program on it. We used sleep.

dd if=/dev/zero of=demo.img bs=1024 count=10240
mkfs -F -t ext4 demo.img
sudo mkdir -p /mnt/demo
sudo mount -o loop demo.img /mnt/demo
sudo cp -a /bin/sleep /mnt/demo/sleep
sudo umount /mnt/demo

Make a copy of the image, which we’ll use later to simulate downloading the image:
```
cp -a demo.img backup.img
```
Mount the original image, as a read-only filesystem:
```
sudo mount -o loop,ro demo.img /mnt/demo
```
In one terminal, start running the test program in a loop:
```
while /mnt/demo/sleep 1; do
    :
done
```

In a second terminal, replace the disk image out from underneath the program by truncating and rewriting it from the backup copy:

while cat backup.img > demo.img; do
    # flush filesystem caches so that pages are re-read
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
done

Reliably, sleep will crash, with Segmentation fault (core dumped). This is exactly the error that affected customer applications.

This problem caught us completely by surprise. While we had taken into account that overwriting a mounted image would cause problems, none of us fully understood what those problems would be. While both our monitoring systems and our internal userbase alerted us to the problem quickly, neither was able to offer much insight into the root cause. Application crashes are part of the normal state of our platform, and while an increase in crashes is a warning sign we take seriously, it doesn’t correlate with any specific causes. We were also hampered by our belief that our deployment process for stack image upgrades was designed not to modify existing filesystem images.

The Fix

Once we identified the problem, we migrated all affected dynos to fresh hosts, with non-corrupted filesystems and with coherent filesystem caches. This work took the majority of the five hours during which the incident was open.

In response to this incident, we now mark filesystem images as read-only on the host filesystem once they’re installed. We’ve re-tested this, under the conditions that lead to the original incident, and we’re confident that this will prevent this and any overwriting-related problems in the future.

We care deeply about managing security, platform maintenance and other container orchestration tasks, so that your apps "just work" - and we're confident that these changes make our stack management even more robust.

Video Transcript

How We Found and Fixed a Filesystem Corruption Bug

Background

Root Causes

Technical Impact

The Fix