|||

Video Transcript

X

Apache Kafka 0.10

At Heroku, we're always striving to provide the best operational experience with the services we offer. As we’ve recently launched Heroku Kafka, we were excited to help out with testing of the latest release of Apache Kafka, version 0.10, which landed earlier this week. While testing Kafka 0.10, we uncovered what seemed like a 33% throughput drop relative to the prior release. As others have noted, “it’s slow” is the hardest problem you’ll ever debug, and debugging this turned out to be very tricky indeed. We had to dig deep into Kafka’s configuration and operation to uncover what was going on.

Background

We've been benchmarking Heroku Kafka for some time, as we prepared for the public launch. We started out benchmarking to help provide our users with guidance on the performance of each Heroku Kafka plan. We realized we could also give back to the Kafka community by testing new versions and sharing our findings. Our benchmark system orchestrates multiple producer and consumer dynos, allowing us to fully exercise a Kafka cluster and determine its limits across the various parameters of its use.

Discovery

We started benchmarking the Kafka 0.10.0 release candidate shortly after it was made available. In the very first test we ran, we noted a 33% performance drop. Version 0.9 on the same cluster configuration provided 1.5 million messages per second in and out at peak, and version 0.10.0 was doing just under 1 million messages per second. This was pretty alarming. There could be some major disincentives to upgrade to this version with such a large reduction in throughput, if this condition were present for all users of Kafka.

We set out to determine the cause (or causes) of this decrease in throughput. We ran dozens of benchmark variations, testing a wide variety of hypotheses:

  • Does this only impact our largest plan? Or are all plans equally impacted?
  • Does this impact a single producer, or do we have to push the boundaries of the cluster to find it?
  • Does this impact plaintext and TLS connections equally?
  • Does this impact large and small messages equally?
  • And many other variations.

We investigated many of the possible angles suggested by our hypotheses, and turned to the community for fresh ideas to narrow down the cause. We asked the Kafka mailing list for help, reporting the issue and all the things we had tried. The community immediately dove into trying to reproduce the issue and also responded with helpful questions and pointers for things we could investigate.

During our intensive conversation with the community and review of the conversations that lead up to the 0.10 release candidate, we found this fascinating thread about another performance regression in 0.10. This issue didn't appear to line up with the problem we had found, but it helped provide more insight into Kafka that helped us understand the root cause of our particular problem. We found this other issue to be very counter-intuitive: increasing the performance of a Kafka broker can actually negatively impact performance of the whole system. Kafka relies very heavily on batching, and if the broker becomes faster, the producers batch less often. Version 0.10 included several improvements to the broker's performance, and that caused odd performance impacts that have since been fixed.

To help us proceed in a more effective and deliberate manner, we started applying Brendan Gregg's USE method to a broker during benchmarks. The USE method helps structure performance investigations, and is very easy to apply, yet also very robust. Simply, it says:

  1. Make a list of all the resources used by the system (network, CPU, disks, memory, etc)
  2. For each resource, look at the:
    1. Utilization: the average time the resource was busy servicing work
    2. Saturation: the amount of extra work queued for the resource
    3. Errors: the count of error events

We started going through these, one by one, and rapidly confirmed that Kafka is blazingly fast and easily capable of maxing out the performance of your hardware. Benchmarking it will quickly identify the bottlenecks in your setup.

What we soon found is that the network cards were being oversaturated during our benchmarking of version 0.10, and they were dropping packets because of the number of bytes they were asked to pass around. When we benchmarked version 0.9, the network cards were being pushed to just below their saturation point. What changed in version 0.10? Why did it lead to saturation of the networking hardware under the same conditions?

Understanding

Kafka 0.10 brings with it a few new features. One of the biggest ones, which was added to support Kafka Streams and other time-based capabilities, is that each message now has a timestamp to record when it is created. This timestamp accounts for an additional 8 bytes per message. This was the issue. Our benchmarking setup was pushing the network cards so close to saturation that an extra 8 bytes per message was the problem. Our benchmarks run with replication factor 3, so an additional 8 bytes per message is an extra 288 megabits per second of traffic over the whole cluster:

$$ {288}\ \text{Mbps}\ \ \ \ =\ \ \ {8} \textstyle \frac{\text{bytes}}{\text{message}}\ \ \ \times\ \ \ {1.5}\ \text{million} \frac{\text{messages}}{\text{second}}\ \ \ \times\ \ \ {8} \frac{\text{bits}}{\text{byte}}\ \ \ \times\ \ \ {3}\ \tiny\text{(1 for producer, 2 for replication)} $$

This extra traffic is more than enough to oversaturate the network. Once the network cards are oversaturated, they start dropping packets and doing retransmits. This dramatically reduces network throughput, as we saw in our benchmarks.

To further verify our hypothesis, we reproduced this under Kafka 0.9. When we increased the message size by 8 bytes, we saw the same performance impact.

Giving Back

Ultimately, there's not much to do here in terms of fixing this issue by making changes to Kafka’s internals. Any Kafka cluster that runs close to the limits of its networking hardware will likely see issues of this sort, no matter what version. Kafka 0.10 just made the issue more apparent in our analyses, due to the increase in baseline message size. These issues would also happen if you as a user added a small amount of overhead to each message and were driving sufficient volume through your cluster. Production use cases tend to have a lot of variance in message size (usually a lot more than 8 bytes), so we expect most production uses of Kafka to not be impacted by the overhead in 0.10. The real trick is not to saturate your network in the first place, so it pays to model out an approximation of your data relative to your configuration.

We contributed a documentation patch about the impact of the increased network traffic so that other Kafka users don't have to go through the same troubleshooting steps. For Heroku Kafka, we've been looking at a few networking improvements we can make to the underlying cluster configurations to help mitigate the impact of the additional message overhead. We've also been looking at improved monitoring and bandwidth recommendations to better understand the limits of possible configurations, and to be able to provide a graceful and stable operational experience with Heroku Kafka.

Kafka 0.10 is in beta on Heroku Kafka now. For those of you in the Heroku Kafka beta, you can provision a 0.10 cluster like so:

heroku addons:create heroku-kafka --version 0.10

We encourage you to check it out. Kafka Streams, added in version 0.10, makes many kinds of applications much easier to build. Kafka Streams works extremely well with Heroku, as it's just a (very powerful) library you embed in your application.

If you aren't on the beta, you can request access here: https://heroku.com/kafka

We would recommend that you continue to use Kafka 0.9.0.1, the default for Heroku Kafka, for production use. We are working closely with the community to further test and validate 0.10 as ready for production use. We take some time to do this, in order to iron out any bugs with new releases for our customers. The only way that happens is if people try it out with their applications (for example in a staging environment), so we welcome and encourage your use of the new version. We can’t wait to see what you build!

Originally published: May 27, 2016

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.