Posted by Mark McGranaghan
As a service provider, when things go wrong you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like Heroku’s.
At Heroku we’ve codified our practices around these aspects into an incident response framework. Whether you’re just interested in how incident response works at Heroku, or looking to adopt and apply some of these practices for yourself, we hope you find this inside look helpful.
We describe Heroku’s incident response framework below. It’s based on the Incident Command System used in natural disaster response and other emergency response fields. Our response framework and the Incident Commander role in particular help us successfully respond to a variety of incidents.
When an incident occurs, we follow these steps:
Move to a central chat room. Before starting work on the incident, move to a shared “Platform Incidents” HipChat room. This ensures everyone is on the same page about the initial response.
Designate IC. The Incident Commander (“IC”) is the leader of the response effort. The IC doesn’t fix issues directly or communicate personally with customers. Instead they’re responsible for the health of the incident response: ensuring that the right responders are involved, that everyone has the information they need, that all issues are covered, and that incident resolution is proceeding well overall.
By default the IC is the first person to notice the problem, but for significant incidents the role is usually transferred to a dedicated IC. Several people at Heroku are specifically trained to be ICs and can be paged into a situation with a HipChat bot:
Update public status site. Our customers want information about incidents as quickly as possible, even if it is preliminary. As soon as possible, the IC designates someone to take on the communications role (“comms”) with a first responsibility of updating the status site with our current understanding of the incident and how it’s affecting customers. The admin section of Heroku’s status site helps the comms operator to get this update out quickly:
The status update then appears on status.heroku.com and is sent to customers and internal communication channels via SMS, email, and HipChat bot:
Send out internal sitrep. Next the IC compiles and sends out the first situation report (“sitrep”) to the internal team describing the incident. It includes what we know about the problem, who is working on it and in what roles, and open issues. As the incident evolves, the sitrep acts as a concise description of the current state of the incident and our response to it. A good sitrep provides information to active incident responders, helps new responders get quickly up to date about the situation, and gives context to other observers like customer support staff.
The Heroku status site has a form for the sitrep, so that the IC can update it and the public-facing status details at the same time. When a sitrep is created or updated, it’s automatically distributed internally via email and HipChat bot. A versioned log of sitreps is also maintained for later review:
Assess problem. The next step is to asses the problem in more detail. The goals here are to gain better information for the public status communication (e.g. what users are affected and how, what they can do to work around the problem) and more detail that will help engineers fix the problem (e.g. what internal components are affected, the underlying technical cause). The IC collects this information and reflects it in the sitrep so that everyone involved can see it.
Mitigate problem. Once the response team has some sense of the problem, it will try to mitigate customer-facing effects if possible. For example, we may put the Platform API in maintenance mode to reduce load on infrastructure systems, or boot additional instances in our fleet to temporarily compensate for capacity issues. A successful mitigation will reduce the impact of the incident on customer apps and actions, or at least prevent the customer-facing issues from getting worse.
Coordinate response. In coordinating the response, the IC focuses on bringing in the right people to solve the problem and making sure that they have the information they need. The IC can use a HipChat bot to page in additional teams as needed (the page will route to the on-call person for that team), or page individuals directly.
The IC may also create a shared Google Doc for the team to collect notes together in real time, or start a high-bandwidth video call for more quickly working through issues than is possible with text chat.
Manage ongoing response. As the response evolves, the IC acts as an information radiator to keep the team informed about what’s going on. The IC will keep track of who’s active on the response, what problems have been solved and are still open, the current resolution methods being attempted, when we last communicated with customers, and reflect this back to the team regularly with the sitrep mechanism. Finally, the IC is making sure that nothing falls through the cracks: that no problems go unaddressed and that decisions are made in a timely manner.
Post-incident cleanup. Once the immediate incident has been resolved, the IC calls for the team to unwind any temporary changes made during the response. For example, alerts may have been silenced and need to be turned back on. The team double-checks that all monitors are green and that all incidents in PagerDuty have been resolved.
Post-incident follow-up. Finally, the IC will tee up post-incident follow up. Depending on the severity of the incident, this could be a quick discussion in our normal weekly operational review or a dedicated internal post-mortem with associated public post-mortem post. The post-mortem process often informs changes that we should make to our infrastructure, testing, and process; these are tracked over time within engineering as incident remediation items.
The incident response framework described above draws from decades of related work in natural disaster response, firefighting, aviation, and other fields that need to manage response to critical incidents. We try to learn from this body of work where possible to avoid inventing our incident response policy from first principles.
Two areas of previous work particularly influenced how we approach incident response:
Incident Command System. Our framework draws most directly from the Incident Command System used to manage natural disaster and other large-scale incident responses. This prior art informs our Incident Commander role and our explicit focus on facilitating incident response in addition to directly addressing the technical issues.
Crew Resource Management. The ideas of Crew Resource Management (“CRM”) originated in aviation but have since been successfully applied to other fields such as medicine and firefighting. We draw lessons on communication, leadership, and decision-making from CRM into our incident response thinking.
We believe that learning from fields outside of software engineering is a valuable practice, both for operations and other aspects of our business.
Heroku’s incident response framework helps us quickly resolve issues while keeping customers informed about what’s happening. We hope you’ve found these details about our incident response framework interesting and that they may even inspire changes in how you think about incident response at your own company.
At Heroku we’re continuing to learn from our own experiences and the work of others in related fields. Over time this will mean even better incident response for our platform and better experiences for our customers.