At Heroku, we're always working towards improving operational stability with the services we offer. As we recently launched Apache Kafka on Heroku, we've been increasingly focused on hardening Apache Kafka, as well as our automation around it. This particular improvement in stability concerns Kafka's compacted topics, which we haven't talked about before. Compacted topics are a powerful and important feature of Kafka, and as of 0.9, provide the capabilities supporting a number of important features.

Meet the Bug

The bug we had been seeing is that an internal thread that's used by Kafka to implement compacted topics (which we'll explain more of shortly) can die in...


During the development of the recently released Heroku SSL feature, a lot of work was carried out to stabilize the system and improve its speed. In this post, I will explain how we managed to improve the speed of our TLS handshakes by 4-5x.

The initial reports of speed issues were sent our way by beta customers who were unhappy about the low level of performance. This was understandable since, after all, we were not greenfielding a solution for which nothing existed, but actively trying to provide an alternative to the SSL Endpoint add-on, which is provided by a dedicated team working on elastic load balancers at AWS. At the same time, another of the worries we had was to figure out how...


One of the interesting patterns that we’ve seen, as a result of managing one of the largest fleets of Postgres databases, is one or two tables growing at a rate that’s much larger and faster than the rest of the tables in the database. In terms of absolute numbers, a table that grows sufficiently large is on the order of hundreds of gigabytes to terabytes in size. Typically, the data in this table tracks events in an application or is analogous to an application log. Having a table of this size isn’t a problem in and of itself, but can lead to other issues; query performance can start to degrade and indexes can take much longer to update. Maintenance tasks, such as vacuum, can also become...


At Heroku, we're always striving to provide the best operational experience with the services we offer. As we’ve recently launched Heroku Kafka, we were excited to help out with testing of the latest release of Apache Kafka, version 0.10, which landed earlier this week. While testing Kafka 0.10, we uncovered what seemed like a 33% throughput drop relative to the prior release. As others have noted, “it’s slow” is the hardest problem you’ll ever debug, and debugging this turned out to be very tricky indeed. We had to dig deep into Kafka’s configuration and operation to uncover what was going on.

Background

We've been benchmarking Heroku Kafka for some time, as we prepared for the...


For almost two years now, the Heroku Dashboard has provided a metrics page to display information about memory usage and CPU load for all of the dynos running an application. Additionally, we've been providing aggregate error metrics, as well as metrics from the Heroku router about incoming requests: average and P95 response time, counts by status, etc.

Almost all of this information is being slurped out of an application's log stream via the Log Runtime Metrics labs feature. For applications that don't have this flag enabled, which is most applications on the platform, the relevant logs are still generated, but bypass Logplex, and are instead sent directly to our metrics...


Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.