How I Broke `git push heroku main`

engineering , Software Craftsman

Incidents are inevitable. Any platform, large or small will have them. While resiliency work will definitely be an important factor in reducing the number of incidents, hoping to remove all of them (and therefore reach 100% uptime) is not an achievable goal.

We should, however, learn as much as we can from incidents, so we can avoid repeating them.

In this post, we will look at one of those incidents, #2105, see how it happened (spoiler: I messed up), and what we’re doing to avoid it from happening again (spoiler: I’m not fired).


Dissecting Kubernetes Deployments

engineering , Software Craftsman

Kubernetes is a container orchestration system that originated at Google, and is now being maintained by the Cloud Native Computing Foundation. In this post, I am going to dissect some Kubernetes internals—especially, Deployments and how gradual rollouts of new containers are handled.

What Is a Deployment?

This is how the Kubernetes documentation describes Deployments:

A Deployment controller provides declarative updates for Pods and ReplicaSets.

A Pod is a group of one or more containers which can be started inside a cluster. A pod started manually is not going to be very useful though, as it won't automatically be restarted if it crashes. A ReplicaSet ensures that a Pod...


Simulate Third-Party Downtime

engineering , Software Craftsman

I spend most of my time at Heroku working on our support tools and services; help.heroku.com is one such example. Heroku's help application depends on the Platform API to, amongst other things, authenticate users, authorize or deny access, and fetch user data.

So, what happens to tools and services like help.heroku.com during a platform incident? They must remain available to both agents and customers—regardless of the status of the Platform API. There is simply no substitute for communication during an outage.

To ensure this is the case, we use api-maintenance-sim, an app we recently open-sourced, to regularly simulate Platform API incidents.

this-is-fine

Simulating downtime

During a Platform...


Time Out Quickly

engineering , Software Craftsman

Working with our support team, I often see customers having timeout problems. Typically, their applications will start throwing H12 errors.

The decision to timeout requests quickly wasn't made to avoid having long-running requests on our router, nor to only have fast apps on our platform, but because standard web servers do not handle these types of requests particularly well.


Subscribe to the full-text RSS feed for Damien Mathieu.