Stuff Goes Bad

The Heroku Routing team does a lot of work with Erlang, both in terms of development and maintenance, to make sure the platform scales smoothly as it continues to grow.

Over time we've learned some hard-earned lessons about making systems that can scale with some amounts of reliability (or rather, we've definitely learned what doesn't work), and about what kind of operational work we may expect to have to do in anger.

This kind of knowledge usually remains embedded within the teams that develop it, and tends to die when individuals leave or change roles. When new members join the team, it gets transmitted informally, over incident simulations, code reviews, and other similar practices, but never in a really persistent manner.

For the past year or so, bit by bit, I've tried to grab the broad lines of this knowledge and to put it into a manual, that we're proud to release today.

From the introduction:

This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snippets and practices that helped developers debug production systems that were built in Erlang.

This is our attempt at bridging the gap between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they've made it to production.

This manual adds to the Routing team's efforts to interact with the Erlang (and polyglot) community at large, sharing knowledge with teams from all over the place. It is available in PDF for free, under a Creative Commons License, at erlang-in-anger.com

It comes just in time for the Chicago Erlang conference, dedicated to real world applications in Erlang, where you'll be able to talk to a few members of Heroku's Routing team, and a bunch of regulars from the Erlang community.

We hope this will prove useful to the community!

Also, Heroku is hiring! Check out our jobs page for opportunities to work on production systems at scale.

Video Transcript

Stuff Goes Bad

More from the author

Fred Hebert