The Heroku Connect team ran into problems with existing task scheduling libraries. Because of that, we wrote RedBeat, a Celery Beat scheduler that stores scheduled tasks and runtime metadata in Redis. We’ve also open sourced it so others can use it. Here is the story of why and how we created RedBeat.

Background

Heroku Connect, makes heavy use of Celery to synchronize data between Salesforce and Heroku Postgres. Over time, our usage has grown, and we came to rely more and more heavily on the Beat scheduler to trigger frequent periodic tasks. For a while, everything was running smoothly, but as we grew cracks started to appear. Beat, the default Celery scheduler, began to behave...


Back on August 11, 2016, Heroku experienced increased routing latency in the EU region of the common runtime. While the official follow-up report describes what happened and what we've done to avoid this in the future, we found the root cause to be puzzling enough to require a deep dive into Linux networking.

The following is a write-up by SRE member Lex Neva (what's SRE?) and routing engineer Fred Hebert (now Heroku alumni) of an interesting Linux networking "gotcha" they discovered while working on incident 930.

The Incident

Our monitoring systems paged us about a rise in latency levels across the board in the EU region of the Common Runtime. We quickly saw that the...


As part of our commitment to security and support, we periodically upgrade the stack image, so that we can install updated package versions, address security vulnerabilities, and add new packages to the stack. Recently we had an incident during which some applications running on the Cedar-14 stack image experienced higher than normal rates of segmentation faults and other “hard” crashes for about five hours. Our engineers tracked down the cause of the error to corrupted dyno filesystems caused by a failed stack upgrade. The sequence of events leading up to this failure, and the technical details of the failure, are unique, and worth exploring.

Background

Heroku runs application...


At Heroku, we're always working towards improving operational stability with the services we offer. As we recently launched Apache Kafka on Heroku, we've been increasingly focused on hardening Apache Kafka, as well as our automation around it. This particular improvement in stability concerns Kafka's compacted topics, which we haven't talked about before. Compacted topics are a powerful and important feature of Kafka, and as of 0.9, provide the capabilities supporting a number of important features.

Meet the Bug

The bug we had been seeing is that an internal thread that's used by Kafka to implement compacted topics (which we'll explain more of shortly) can die in...


During the development of the recently released Heroku SSL feature, a lot of work was carried out to stabilize the system and improve its speed. In this post, I will explain how we managed to improve the speed of our TLS handshakes by 4-5x.

The initial reports of speed issues were sent our way by beta customers who were unhappy about the low level of performance. This was understandable since, after all, we were not greenfielding a solution for which nothing existed, but actively trying to provide an alternative to the SSL Endpoint add-on, which is provided by a dedicated team working on elastic load balancers at AWS. At the same time, another of the worries we had was to figure out how...


Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.