|||

Video Transcript

X

Router 2.0: The Road to Beta

Last month, Heroku announced the beta release of Router 2.0, the new Common Runtime router!

As part of our commitment to infrastructure modernization, Heroku is making upgrades to the Common Runtime routing layer. The beta release of Router 2.0 is an important step along this journey. We’re excited to give you an inside look at all we’ve been doing to get here.

In both the Common Runtime and Private Spaces, the Heroku router is responsible for serving requests to customers’ web dynos. In 2024, Router 2.0 will replace the existing Common Runtime router. We’re being transparent about this project so that you, our customers, are motivated to try out Router 2.0 now, while it’s in beta. As an early adopter, you can help us validate that things are working as they should, particularly for your apps and your use cases. You’ll also be first in line to try out the new features we’re planning to add, like HTTP/2.

Why a New Router?

Now, you may be asking, why build a new router instead of improving the existing one? Our primary motivator has been faster and safer delivery of new routing features for our customers. For a couple of reasons, this has been difficult to achieve with the Common Runtime’s legacy routing layer.

The current Common Runtime router is written in Erlang. It’s built around a custom HTTP server library that supports Heroku-specific features, such as H-codes, dyno sleeping, and router logs. For over 10 years, this router, dubbed “Hermes” internally, has served all requests to Heroku’s Common Runtime. At the time of Hermes’ launch, Erlang was an appropriate choice since the language places emphasis on concurrency, scalability, and fault tolerance. In addition, Erlang offers a powerful process introspection toolchain that has served our networking engineers well when debugging in-memory state issues. Our engineers embraced the language fully, also choosing to write the previous version of our logging system, Logplex, in Erlang.

However, as the years passed, development on the Hermes codebase proved difficult. The popularity of Erlang within Heroku began to taper off. The open-source and internal libraries that Hermes depends on stopped receiving the volume of contributions they once had. Productivity declined due to these factors, making significant router upgrades risky. After a few failed upgrade attempts, our team decided to pin the software versions of relevant Erlang components. This action wasn’t without trade-offs. Being pinned to an old version of Erlang became a blocker to delivering now common-place features like HTTP/2. Thus, we decided to put Hermes into maintenance mode and focus on its replacement.

Choosing a Language

Before kicking off design sessions, our team discussed what broader goals we had for the replacement. In establishing our priorities, the team came to a consensus around three main goals:

  • Write the router in a language everyone on our team knows well. With Erlang knowledge limited to just a couple of engineers on the team, we wanted to rewrite the router in a different language. That language had to be something our team already knew well.
  • Write the router in a language with a strong open-source community. A robust community unlocks the ability to quickly adopt new specs, write features, fix bugs, and respond to CVEs. It also expands the candidate pool when it comes time to hire new engineers.
  • Share as much code as possible between the Common Runtime and Private Spaces routers. Since the Common Runtime and Private Spaces routers share most of the same features, there’s no reason for the codebases to differ much. Additionally, it’s faster and easier to deliver a feature if we only have to write it once.

With these goals in mind, the language to choose for Router 2.0 was clear — Go.

Not only is the Private Spaces router already written in Go, but the language has become our standard choice for developing new components of Heroku’s runtime. This story isn’t at all unique. Many others in the DevOps and cloud hosting world today have chosen Go for its performance, built-in concurrency handling, automatic garbage collection — the list goes on. Simply put, it’s a language designed specifically for building big dynamic distributed systems. Because of these factors, the Go community outside and within Heroku has flourished, with Go expertise in abundance across our runtime engineering teams.

Today, by writing Router 2.0 in Go, we’re creating a piece of software to which everyone on our team can contribute. Furthermore, by doubling down on the language of the Private Spaces router, we unify the source code and routing behavior of these two products. Historically, these codebases have been entirely distinct, meaning that any implementation our engineers introduce must be written twice. To combat this, we’ve extracted the common functionality of the two routers into an internal HTTP library. With a unified codebase, the delivery of features and fixes becomes faster and simpler, reducing the cognitive burden on our engineers who operate and maintain the routers.

Developing the router is only half the story, though. The other half is about introducing this service to the world as safely and seamlessly as possible.

Architecture

You may recall that back in 2021, Heroku announced the completion of an infrastructure upgrade to the Common Runtime that brought customers better performing dynos and lower request latencies. This upgrade involved an extensive migration from our old, “classic” cloud environment to our more performant and secure “sharded” environment. We wanted to complete this migration without disrupting any active traffic or asking customers to change their DNS setups. To do this, our engineers put an L4 reverse proxy in front of Hermes, straddling the classic and sharded environments. The idea was to slowly shift traffic over to the sharded environments, with the L4 proxy splitting connections to both the classic and the new “in-shard” Hermes instances.

Also a part of this migration, TLS termination on custom domains was transitioned from Hermes to the L4 proxy.

IMG_2180 This L4 proxy is the component that has formed the basis for Router 2.0. Over the past year, our networking team has been developing an L7 router to sit in-memory behind the L4 proxy. Today, the L4 proxy + Router 2.0 process runs alongside Hermes, communicating over the localhost network on our router instances. Putting these two processes side by side, instead of on separate hosts, means we limit the number of network hops between clients and backend dynos.

The Strangler Pattern

For apps still on the default routing path, client connections are established with the L4 proxy, which directs traffic through Hermes. IMG_2488 When an app has Router 2.0 enabled, the L4 proxy instead funnels traffic over an in-memory listener to Router 2.0, then out to the app’s web dynos. Hermes is cut out of the network path. IMG_5679 This sort of architecture has a particular name — the “Strangler pattern” — and it involves inserting a form of middleman between clients and the old system you want to replace. The middleman directs traffic, dividing it between the old system and a new system that is built out incrementally. The major advantage of such a setup is that “big bang” changes or “all-at-once” cut-overs are completely avoided. However, both the old and the new systems live on the same production hot path while the development of the new system is in progress. What has this meant for Router 2.0? Well, we had to lay a complete production-ready foundation early on.

Living on the Hot Path

Heroku has always been an opinionated hosting and deployment platform that caters to general use cases. In our products, we optimize for stability while delivering innovation. Within the framing of Router 2.0, this commitment to stability meant we had to do a few things before releasing beta.

Automate Router Deployments

Up until recently, deploying Router 2.0 meant creating a new release and manually triggering router fleet cycles across all our production clouds. This process wasn’t only tedious and time-consuming, but it was also really error prone. We fixed this by building out an automation pipeline, outfitted with gates on availability metrics, performance metrics, and smoke tests. Anytime a router release fails on just one of these health indicators, it doesn’t advance to the next stage of deployment.

Load Test Continuously

An important aspect of vetting the new sharded environments in 2021 was load testing the L4 proxy/Hermes combo. At the time, this was a significant manual undertaking. After manually running these tests, it became obvious that we would need a more practical load testing story while developing Router 2.0. In response, we built a load testing system to continuously push our staging routers to their limits and trigger scaling policies, so that we can also validate our autoscaling setup. This framework has been immensely valuable for Router 2.0 development, catching bugs and regressions before they ever hit production. The results of these load tests feed right back into our deployment pipeline, blocking any deploys that don’t live up to our internal service level objectives.

Introduce Network Error Logging

Traditionally, routing health has been measured through the use of “checkee” apps. These are web-server applications that we deploy across our production Common Runtime clouds and constantly probe from corresponding ”checker“ apps that run in Private Spaces. The checker-checkee duo allows us to mimic and measure our customers’ routing experience. In recent years, the gaps in this model have become more apparent. Namely, our checkees only represent the tiniest fraction of traffic pumping through the router at any given time. In addition, we can’t within our checkers possibly account for all the various client types and configurations that may be used to connect to the platform.

To address the gap, we introduced Network Error Logging (NEL) to both Hermes and Router 2.0. It’s an experimental W3C standard that enables the measurement of routing layer performance by collecting real-time data about network failures from web browsers. Google Chrome, Microsoft Edge, and certain mobile clients already support the spec. NEL ensures our engineers maintain a more holistic understanding of the routing experience actually felt by clients.

The Future

Completely retiring Hermes will take time. We’re only at the end of the beginning of that journey. As detailed in the Dev Center article, Router 2.0 isn’t complete yet because it doesn’t support the full list of features on our HTTP Routing page. We’re working on it. We’ll soon be adding HTTP/2 support, one of the most requested features, to both the Common Runtime and Private Spaces. However, in the Common Runtime, HTTP/2 will only be available when your app is using Router 2.0.

Our aim is to achieve feature parity with Hermes, plus a little more, over the next few months. Once we’re there, we’ll focus on a migration plan that involves flagging apps into Router 2.0 automatically. Much like in the migration from classic environments to sharded environments, we’ll break the process out into phases based on small batches of apps in similar dyno tiers. This approach gives us time to pause between phases and assess the performance of the new system.

Participating

We hope that you, our customers, can help us validate the new router well before it becomes the default. You can enable Router 2.0 for a Common Runtime app, by running:

heroku labs:enable http-routing-2-dot-0 -a <app>

If you choose to enroll, you can submit feedback by commenting on the Heroku Public Roadmap item or creating a support ticket.

Conclusion

Delivering new features to a platform like Heroku is never as simple as flipping an on/off switch. When we deliver something to our customers, there’s always a mountain of behind-the-scenes effort put into it. Simply stated, we write a lot of software to ensure the software that you see works the way it should.

We’re proud of the work we’ve done so far on Router 2.0, and we’re excited for what’s coming next. If you enroll your applications in the beta, keep an eye on the Router 2.0 Dev Center page and the Heroku Changelog. We’ll be posting updates about new features as they become available.

Thanks for reading and happy coding!

Originally published: October 30, 2023

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.