This post is an update on a previous post about how Heroku handles incident response.

As a service provider, when things go wrong, you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like Heroku’s.

At Heroku we’ve codified our practices around these aspects into an incident response framework. Whether you’re just interested in how incident response works at Heroku, or looking to adopt and apply some of these practices for yourself, we hope you find this inside look helpful.

Incident Response and the Incident Commander Role

We describe Heroku’s...

Incidents are inevitable. Any platform, large or small will have them. While resiliency work will definitely be an important factor in reducing the number of incidents, hoping to remove all of them (and therefore reach 100% uptime) is not an achievable goal.

We should, however, learn as much as we can from incidents, so we can avoid repeating them.

In this post, we will look at one of those incidents, #2105, see how it happened (spoiler: I messed up), and what we’re doing to avoid it from happening again (spoiler: I’m not fired).

Your app is slow. It does not spark joy. This post will use memory allocation profiling tools to discover performance hotspots, even when they're coming from inside a library. We will use this technique with a real-world application to identify a piece of optimizable code in Active Record that ultimately leads to a patch with a substantial impact on page speed.

In addition to the talk, I've gone back and written a full technical recap of each section to revisit it any time you want without going through the video.

I make heavy use of theatrics here, including a Japanese voiceover artist, animoji, and some edited clips of Marie Kondo's Netflix TV show. This recording was done...

Moving shipping containers is heavy work. Moving a traditional industry into the digital age is a different kind of heavy job. Software development agency GNAR took on the challenge and built an ops management platform for RMS Intermodal, one of the largest rail yard operators in the U.S. Their IoT solution gave RMS a data-driven view of their operations for the first time, resulting in a whole new definition of “efficiency” for the company.

GNAR diagram

Serendipity manifests a new idea

It all started at a wedding. GNAR Founder Brandon Stewart found himself chatting with Adam Gray, the son of RMS Intermodal’s President, and the conversation turned to RFID tracking. Brandon had done some work on an...

There are always challenges when it comes to debugging applications. Node.js' asynchronous workflows add an extra layer of complexity to this arduous process. Although there have been some updates made to the V8 engine in order to easily access asynchronous stack traces, most of the time, we just get errors on the main thread of our applications, which makes debugging a little bit difficult. As well, when our Node.js applications crash, we usually need to rely on some complicated CLI tooling to analyze the core dumps.

Browse the blog archives or subscribe to the full-text feed.