This post is an update on a previous post about how Heroku handles incident response.

As a service provider, when things go wrong, you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like Heroku’s.

At Heroku we’ve codified our practices around these aspects into an incident response framework. Whether you’re just interested in how incident response works at Heroku, or looking to adopt and apply some of these practices for yourself, we hope you find this inside look helpful.

Incident Response and the Incident Commander Role

We describe Heroku’s...

Incidents are inevitable. Any platform, large or small will have them. While resiliency work will definitely be an important factor in reducing the number of incidents, hoping to remove all of them (and therefore reach 100% uptime) is not an achievable goal.

We should, however, learn as much as we can from incidents, so we can avoid repeating them.

In this post, we will look at one of those incidents, #2105, see how it happened (spoiler: I messed up), and what we’re doing to avoid it from happening again (spoiler: I’m not fired).

Your app is slow. It does not spark joy. This post will use memory allocation profiling tools to discover performance hotspots, even when they're coming from inside a library. We will use this technique with a real-world application to identify a piece of optimizable code in Active Record that ultimately leads to a patch with a substantial impact on page speed.

In addition to the talk, I've gone back and written a full technical recap of each section to revisit it any time you want without going through the video.

I make heavy use of theatrics here, including a Japanese voiceover artist, animoji, and some edited clips of Marie Kondo's Netflix TV show. This recording was done...

There are always challenges when it comes to debugging applications. Node.js' asynchronous workflows add an extra layer of complexity to this arduous process. Although there have been some updates made to the V8 engine in order to easily access asynchronous stack traces, most of the time, we just get errors on the main thread of our applications, which makes debugging a little bit difficult. As well, when our Node.js applications crash, we usually need to rely on some complicated CLI tooling to analyze the core dumps.

YAML files dominate configuration in the cloud native ecosystem. They’re used by Kuberentes, Helm, Tekton, and many other projects to define custom configuration and workflows. But YAML has its oddities, which is why the Cloud Native Buildpacks project chose TOML as its primary configuration format.

TOML is a minimal configuration file format that's easy to read because of its simple semantics. You can learn more about TOML from the official documentation, but a simple buildpack TOML file looks like this:

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.