Sometimes, innovation is born in the midst of a crisis. Unexpected challenges and a sense of urgency force companies to look for new ways of keeping the business going, even as the odds stack up against them. This was the case for one of our biggest customers Matalan, a major fashion and homeware retailer in the U.K. The company operates 230 brick and mortar stores across the country, and 30 international franchise stores within Europe and the Middle East. It also maintains a thriving online channel that runs on the SHIFT platform.


This summer, we announced the beta release of our new streaming data connectors between Heroku Postgres and Apache Kafka on Heroku. These connectors make Change Data Capture (CDC) possible on Heroku with minimal effort. Anyone with a Private or Shield Space, as well as a Postgres and an Apache Kafka add-on in that space, can use Streaming Data Connectors today at no additional charge.

Customers use connectors to build streaming data pipelines between Salesforce and external stores like a Snowflake data lake or an AWS Kinesis queue for integration with other data sources. They also refactor monoliths into microservices, implement an event-based architecture, archive data in lower-cost...


This post is an update on a previous post about how Heroku handles incident response.

As a service provider, when things go wrong, you try to get them fixed as quickly as possible. In addition to technical troubleshooting, there’s a lot of coordination and communication that needs to happen in resolving issues with systems like Heroku’s.

At Heroku we’ve codified our practices around these aspects into an incident response framework. Whether you’re just interested in how incident response works at Heroku, or looking to adopt and apply some of these practices for yourself, we hope you find this inside look helpful.

Incident Response and the Incident Commander Role

We describe Heroku’s...


Incidents are inevitable. Any platform, large or small will have them. While resiliency work will definitely be an important factor in reducing the number of incidents, hoping to remove all of them (and therefore reach 100% uptime) is not an achievable goal.

We should, however, learn as much as we can from incidents, so we can avoid repeating them.

In this post, we will look at one of those incidents, #2105, see how it happened (spoiler: I messed up), and what we’re doing to avoid it from happening again (spoiler: I’m not fired).


Your app is slow. It does not spark joy. This post will use memory allocation profiling tools to discover performance hotspots, even when they're coming from inside a library. We will use this technique with a real-world application to identify a piece of optimizable code in Active Record that ultimately leads to a patch with a substantial impact on page speed.

In addition to the talk, I've gone back and written a full technical recap of each section to revisit it any time you want without going through the video.

I make heavy use of theatrics here, including a Japanese voiceover artist, animoji, and some edited clips of Marie Kondo's Netflix TV show. This recording was done...


Browse the blog archives or subscribe to the full-text feed.