All posts tagged with incident response

Incidents are inevitable. Any platform, large or small will have them. While resiliency work will definitely be an important factor in reducing the number of incidents, hoping to remove all of them (and therefore reach 100% uptime) is not an achievable goal.

We should, however, learn as much as we can from incidents, so we can avoid repeating them.

In this post, we will look at one of those incidents, #2105, see how it happened (spoiler: I messed up), and what we’re doing to avoid it from happening again (spoiler: I’m not fired).

Retrospectives are a valuable tool for software engineering teams. Heroku consistently uses retrospectives to review operational incidents, root cause problems, and generate remediation tasks to improve our systems. Increasingly we use retrospectives for another purpose: to improve teamwork and interactions on projects. Here we intentionally avoid technical discussions and focus on the emotional and human aspects of work, with the goal of creating positive insights into how to improve as a team.

Browse the blog archives or subscribe to the full-text feed.