Evolving the Backend Storage for Platform Metrics

One of our most important goals at Heroku is to be boring. Don’t get us wrong, we certainly hope that you’re excited about the Heroku developer experience — as heavy users of Heroku ourselves, we certainly are! But, even more so, we hope that you don’t have to spend all that much time thinking about Heroku. We want you to be able to spend your time thinking about the awesome, mission-critical things you’re building with Heroku, rather than worrying about the security, reliability, or performance of the underlying infrastructure they run on.

Keeping Heroku “boring” enough to be trusted with your mission-critical workloads takes a lot of not-at-all-boring work, however! In this post, we’d like to give you a peek behind the curtain at an infrastructure upgrade we completed last year, migrating to a new-and-improved storage backend for platform metrics. Proactively doing these kinds of behind-the-scenes uplifts is one of the most important ways that we keep Heroku boring: staying ahead of problems by continuously making things more secure, more reliable, and more efficient for our customers.

Metrics (as a Service)

A bit of context before we get into the details: our Application Metrics and Alerting, Language Runtime Metrics, and Autoscaling features are powered by an internal-to-Heroku service called “MetaaS,” short for “Metrics as a Service.” MetaaS collects many different “observations” from customer applications running on Heroku, like the amount of time it took to serve a particular HTTP request. Those raw observations are aggregated to calculate per-application, per-minute statistics like the median, max, and 99th-percentile response time. The resulting time series metrics are rendered on the Metrics tab of the Heroku dashboard, as well as used to drive alerting and autoscaling.

At the core of MetaaS lie two high-scale, multi-tenant data services. Incoming observations — a couple hundred thousand every second — are initially ingested into Apache Kafka. A collection of stream-processing jobs consume observations from Kafka as they arrive, calculate the various different statistics we track for each customer application, publish the resulting time series data back to Kafka, and ultimately write it to a database (Apache Cassandra at the time) for longer-term retention and query. MetaaS’s time-series database stores many terabytes of data, with tens of thousands of new data points written every second and several thousand read queries per second at peak.

An architecture diagram of MetaaS matching the preceding description

A Storied Legacy

MetaaS is a “legacy” system, which is to say that it was originally designed a while ago and is still here getting the job done today. It’s been boring in all the ways that we like our technology to be boring; we haven’t needed to think about it all that much because it’s been reliable and scalable enough to meet our needs. Early last year, however, we started to see some potential “excitement” brewing on the horizon.

MetaaS runs on the same Apache Kafka on Heroku managed service that we offer to our customers. We’re admittedly a little biased, but we think the team that runs it does a great job, proactively taking care of maintenance and tuning for us to make sure things continue to be boring. The Cassandra clusters, on the other hand, were home-grown just for MetaaS. Over time, as is often the way with legacy systems, our operational experience with Cassandra began to wane. Routine maintenance became less and less routine. After a particularly-rough experience with an upgrade in one of our test environments, it became clear that we were going to have a problem on our hands if we didn’t make some changes.

The general shape of Cassandra — a horizontally-scalable key/value database — remained a great fit for our needs. But we wanted to move to a managed service, operated and maintained by a team of experts in the same way our Kafka clusters are. After considering a number of options, we landed on AWS’s DynamoDB. Like Cassandra, DynamoDB traces its heritage (and its name) back to the system described in the seminal Amazon Dynamo paper. Other Heroku teams were already using DynamoDB for other use cases, and it had a solid track record for reliability, scalability, and performance.

A Careful Migration

Once the plan was made and the code was written, all that remained was the minor task of swapping out the backend storage of a high-scale, high-throughput distributed system without anyone noticing (just kidding, this was obviously going to be the hard part of the job 😅).

Thankfully, the architecture of MetaaS gave us a significant leg up here. We already had a set of stream-processing jobs for writing time-series data from Kafka to Cassandra. The first step of the migration was to stand up a parallel set of stream-processing jobs to write that same data to DynamoDB as well. This change had no observable impact on the rest of the system, and it allowed us to build confidence that DynamoDB was working and scaling as we expected.

As we began to accumulate data in DynamoDB, we moved on to the next phase of the migration: science! We’re big fans of the open source scientist library from our friends over at GitHub, and we adapted a very similar approach for this migration. We began running a small percent of read queries to MetaaS in “Science Mode”: continuing to read from Cassandra as usual, but also querying DynamoDB in the background and logging any queries that produced different results. We incrementally dialed the experiment up until 100% of production queries were being run through both codepaths. This change also had no observable impact, as MetaaS was still returning the data from Cassandra, but it allowed us to find and fix a couple of tricky edge cases that hadn't come up in our more-traditional pre-production testing.

An updated architecture diagram showing parallel writes to both Cassandra and DynamoDB

A Smooth Landing

Once our science experiment showed that DynamoDB was consistently returning the same results as Cassandra, the migration was now simply a matter of time. MetaaS stores data for a particular retention period, after which it ages out and is deleted (using the convenient TTL support that both Cassandra and DynamoDB implement). This meant that we didn’t need to orchestrate a lift-and-shift of data from Cassandra to DynamoDB. Once we were confident that the same data was being written to both places, we could simply wait for any older data in Cassandra to age out.

Starting with our test environments, we began to incrementally cut a small percent of queries over to only read from DynamoDB, moving carefully in case there were any reports of weird behavior that had somehow been missed by our science experiment. There were none, and 100% of queries to MetaaS have been served from DynamoDB since May of last year. We waited a few weeks just to be sure that we wouldn’t need to roll back, thanked our Cassandra clusters for their years of service, and put them to rest.

A screenshot of a Slack message saying "Goodnight, buddy. Sleep well" to mark the shutdown of the final Cassandra cluster

Conclusion

With a year of experience under our belt now, we’re feeling confident we made the right choice. DynamoDB has been boring, exactly as we hoped it would be. It’s been reliable at scale. We’ve spent a grand total of zero time thinking about how to patch the version of log4j it uses. And, for bonus points, it’s been both faster and cheaper than our self-hosted Cassandra clusters were. See if you can guess what time of day we finished the migration based on this graph of 99th-percentile query latency:

A graph showing 99th-percentile latencies varying between 20 milliseconds and 80 milliseconds before 18:00 dropping to a consistent 15 milliseconds after 18:00.

Our favorite part of this story? Unless you were closely watching page load times for the Heroku Dashboard’s Metrics tab at the time, you didn’t notice a thing. For a lot of the work we do here at Heroku, that’s the ultimate sign of success: no one even noticed. Things just got a little bit newer, faster, or more reliable under the covers.

For the moment, MetaaS is back to being a legacy system, doing its job with a minimum of fuss. If you’re interested in the next evolution of telemetry and observability for Heroku, check out the OpenTelemetry item on our public roadmap. It’s an area we’re actively working on, and we would love your input!

This post is a collaborative effort between Heroku and AWS, and it is published on both the Heroku Blog and the AWS Database Blog.