At Heroku, we're always striving to provide the best operational experience with the services we offer. As we’ve recently launched Heroku Kafka, we were excited to help out with testing of the latest release of Apache Kafka, version 0.10, which landed earlier this week. While testing Kafka 0.10, we uncovered what seemed like a 33% throughput drop relative to the prior release. As others have noted, “it’s slow” is the hardest problem you’ll ever debug, and debugging this turned out to be very tricky indeed. We had to dig deep into Kafka’s configuration and operation to uncover what was going on.

Background

We've been benchmarking Heroku Kafka for some time, as we prepared for the...


For almost two years now, the Heroku Dashboard has provided a metrics page to display information about memory usage and CPU load for all of the dynos running an application. Additionally, we've been providing aggregate error metrics, as well as metrics from the Heroku router about incoming requests: average and P95 response time, counts by status, etc.

Almost all of this information is being slurped out of an application's log stream via the Log Runtime Metrics labs feature. For applications that don't have this flag enabled, which is most applications on the platform, the relevant logs are still generated, but bypass Logplex, and are instead sent directly to our metrics...


I spend most of my time at Heroku working on our support tools and services; help.heroku.com is one such example. Heroku's help application depends on the Platform API to, amongst other things, authenticate users, authorize or deny access, and fetch user data.

So, what happens to tools and services like help.heroku.com during a platform incident? They must remain available to both agents and customers—regardless of the status of the Platform API. There is simply no substitute for communication during an outage.

To ensure this is the case, we use api-maintenance-sim, an app we recently open-sourced, to regularly simulate Platform API incidents.

this-is-fine

Simulating downtime

During a Platform...


The asset pipeline is the slowest part of deploying a Rails app. How slow? On average, it's over 20x slower than installing dependencies via $ bundle install. Why so slow? In this article, we're going to take a look at some of the reasons the asset pipeline is slow and how we were able to get a 12x performance improvement on some apps with Sprockets version 3.3+.

The Rails asset pipeline uses the sprockets library to take your raw assets such as javascript or Sass files and pre-build minified, compressed assets that are ready to be served by a production web service. The process is inherently slow. For example, compiling Sass file to CSS requires reading the file in, which...


Heroku has years of experience operating our world-class platform, and we have developed many internal tools to operate it along the way; however, with the introduction of Heroku Private Spaces, much of the infrastructure was built from the ground up and we needed new tools to operate this new platform. At the center of this, we built a new operations console to give ourselves a bird's eye view of the entire system, be able to drill down into issues in a particular space, and everything in between.

The operations console is a single-page React application with a reverse proxy on the backend to securely access data from a variety of sources. The console itself started off from a mashup...


Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.