Improved production stability with circuit breakers

Fun fact: the Heroku API consumes more endpoints than it serves. Our availability is heavily dependent on the availability of the services we interact with, which is the textbook definition of when to apply the circuit breaker pattern.

And so we did:

API web queue, p95 latencies

Circuit breakers really helped us keep the service stable despite third-party interruptions, as this graph of p95 HTTP queue latency shows.

Here I'll cover the benefits, challenges and lessons learned by introducing this pattern to a large scale production app.

A brief reminder that everything fails

Our API composes over 20 services – some public (S3, Twilio), some internal (run a process, map DNS record to an app) and some provided by third parties (provision New Relic to a new app).

The one thing they share in common is failure:

Max S3 latencies measured by the Heroku API in the last month

Amazon operates some of the most reliable services we consume, and yet if you make enough calls you'll see a request hanging for 28s. If there's one thing I learned operating a large-scale app for years is to just expect incredibly high tail latencies, 500s, broken sockets and bad deploys on fridays.

Circuit breaker basics

Fuses and circuit breakers were introduced to prevent house fires: when electrical wiring was first being built into houses, people would sometimes plug too much stuff into their circuits, warming up the wires – sometimes enough to start a fire.

Circuit breakers automatically detect this destructive pattern of usage, and interrupt the system before things get worse.

So the idea with the software pattern is to wrap external service calls and start counting failures. At a certain threshold the circuit breaker trips, preventing any additional requests from going through. This is a great moment to fallback to cached data or a degraded state, if available.

Both sides can benefit from this: your app becomes more stable as it saves resources that would otherwise be spent calling an unresponsive service – and the receiving end will often be able to recover faster, as it doesn't have to handle incoming traffic during outages.

Implementation

I maintain a Ruby gem called CB2 to implement circuit breakers in Redis:

breaker = CB2::Breaker.new(
  service: "aws"       # identify each circuit breaker individually
  duration: 60,        # keep track of errors over a 1 min window
  threshold: 5,        # open the circuit breaker when error rate is at 5%
  reenable_after: 600, # keep it open for 30 seconds
  redis: Redis.new)    # redis connection it should use to keep state

Once a circuit breaker is defined, you can use it to wrap service calls and handle open circuits:

begin
  breaker.run do
    some_api_request()
  end
rescue CB2::BreakerOpen
  alternate_response() # fallback to cached data, or raise a user-friendly exception
end

Constant vs percent-based breakers

The reason I wrote this gem is because all circuit breakers in Ruby tripped after a specified error count.

While simpler, these breakers just don't scale well in production: make 10x more calls, and your circuits will open 10x more often. Avoid this by specifying the threshold as a percentage, which is in fact what Netflix uses.

Deploying and monitoring your breakers

Enough talk about what are circuit breakers and how they're implemented.

Just like every other operational pattern, you should expect to spend only a fraction of your time writing code. Most of the effort goes ensuring a smooth deploy, and monitoring your production stack.

Roll-out strategy

Introduce logging-only circuit breakers at first. You'll want to review and tweak their parameters before affecting production traffic.

Expect to run into surprises. In particular with internal APIs, where availability might not be as well defined or understood, you might see breakers changing state often. This is a great time to bring visibility into failures to other teams in your ecosystem.

Monitoring

You obviously want to know when a breaker trips. For starters this could be as simple as sending an email to the team, although in the long-term the best way to go is to inject circuit breaker state changes to your monitoring infrastructure.

At Heroku we use Librato extensively. Their graph annotations are a great way to store circuit breaker changes together with any system-wide changes you might have, like deployments:

Sample Librato chart with circuit breaker annotations

But beware of race conditions when capturing circuit breaker state changes! CB2 and all of the libraries I've seen do not guarantee that only a single process will detect the state change. You'll want to use a lock or similar mechanism to avoid sending duplicated emails or annotations when circuit breakers trip under heavy traffic.

Timeout early, timeout often

Services operating slowly are often more damaging than services failing fast, so in order to get the most of this pattern you'll want to specify timeouts for all service calls.

And be explicit about them! most HTTP libraries have timeouts disabled or set really high. By the time you start seeing one-minute timeout errors in your logs you're probably already dependent on slow responses, which is a big part of the problem here: introducing and lowering timeouts in large-scale apps can be quite challenging.

You should also prefer timeouts set as close to your network library as possible, be it Postgres, Redis, AMQP or HTTP. At least in Ruby the generic timeout is just unreliable and should be used only as last resource.

Video Transcript