At Heroku, we're always working towards increased operational stability with the services we offer. As we recently launched the beta of Apache Kafka on Heroku, we've been running a number of clusters on behalf of our beta customers.
Over the course of the beta, we have thoroughly exercised Kafka through a wide range of cases, which is an important part of bringing a fast-moving open-source project to market as a managed service. This breadth of exposure led us to the discovery of a memory leak in Kafka, having a bit of an adventure debugging it, and then contributing a patch to the Apache Kafka community to fix it.
For the most part, we’ve seen very few issues running Kafka in production. The system itself is very stable and performs very well even under massive amounts of load. However, we identified a scenario where processes in a long-running cluster would no longer fully recover from a restart. The process would seemingly restart successfully but always crash several seconds later. This is troubling to us; a significant part of safely managing and operating a distributed system like Kafka is the ability to restart individual members of a cluster. We started looking into this as a high priority problem.
Fortunately, an internal team had a staging cluster where we could reproduce the issue to investigate further without impacting a production app. We quickly figured out a remediation that we could apply in case we hit this issue on a production cluster before we addressed the bug, and then got to debugging the issue at hand.
I'm going to discuss the process of debugging this failure, including all the dead ends we encountered. Many debugging stories jump straight to "we found this really difficult bug", but leave out all the gory details and dead ends chased. This is a false picture of fixing gnarly bugs -- in reality nearly all of them involve days of going down dead ends and misunderstanding.
Given that we were able to reproduce the issue consistently, we began our investigation by simply starting the process and looking at the resources on the server using
top and similar diagnosis tools. It was immediately clear this was a memory issue. The process would boot, allocate memory at a rate of around 400MB/s without freeing any, and then crash after exhausting all the memory on the server.
We confirmed this by looking at stdout logs of the process. They contained this wonderful snippet:
# There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 131072 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit ...
We now had two indications that there was a memory issue of some kind! This is progress - we can now start looking at memory allocation, and mostly ignore all other issues. Typically, JVM memory is allocated on the heap, and Java has a bunch of tooling to understand what's on the heap. We broke open the logs emitted by the JVM's garbage collector and found a mystery inside. The GC logs were almost empty, indicating the program wasn't under heap pressure at all. Furthermore, what little information we did have indicated that this broker only ever had 100 megabytes of on-heap memory used. This didn't make sense given our previous evidence from
top and stdout.
The JVM mostly allocates memory in two ways:
On-heap memory represents the majority of memory allocation in most JVM programs, and the garbage collector manages allocation and deallocation of on-heap memory. Some programs do use notable amounts of "off-heap" or native memory, whereby the application controls memory allocation and deallocation directly. Kafka shouldn't typically be using a lot of off-heap memory, but our next theory is that it must be doing exactly that. Clearly, this is the only alternative, right? It can't be on-heap memory, or we'd see more information in the GC logs.
To test this theory, we wrote a small script to query the JMX metric
java.lang:type=Memory, which tells you about how much on- and off-heap memory the JVM thinks it is using. We ran this script in a tight loop while starting the broker, and saw, to our frustration, nothing useful:
... heap_memory=100 MB offheap_memory=63.2 MB heap_memory=101 MB offheap_memory=63.2 MB heap_memory=103 MB offheap_memory=63.2 MB ... crash
Neither on-heap nor off-heap memory was being used! But what else can even allocate memory? At this point, we reached further into our debugging toolkit.
Tracing is a very effective debugging tool, often employed in this kind of situation. Are the logs not telling you what you need to know? Time to dump out relevant events from a tracing tool and start looking through them for ideas.
In this particular case, we used sysdig, an especially powerful tool for debugging issues on a single server. Sysdig allows you to capture system calls, much like the more venerable strace. A syscall is the mechanism by which a userland process communicates with the kernel. Seeing as most kinds of resource usage involve talking to the kernel, looking at syscalls is a very effective way to diagnose this kind of issue.
Sysdig is best used in a "capture, then analyze" mode, much like
tcpdump. This lets you write all the syscalls emitted by a process to a file, and then take your time analyzing them.
sudo sysdig 'proc.name=java' -w ~/sysdig.scap
will capture to a file all syscalls emitted by the process named
java. We ran this command, then booted our broker and watched it crash.
Now we have a file full of syscalls, what do we look at? The capture file, in this case, was 434MB, which you can't "just read through". Sysdig gives you a suite of analysis tools for looking at the events emitted by the process. In this case, we're interested in memory allocation, so we're interested in the syscalls
munmap for the most part.
The issue we're debugging is that somewhere, Kafka is allocating memory and never freeing it. Remember, this isn't on-heap memory or off-heap memory, so something is doing native memory allocation.
Firing up sysdig, we see that this program does indeed allocate a lot of memory using
mmap syscalls. Analysis using bash scripting reveals that 9GB of memory is being allocated using
mmap during this run. This is more memory than was available on the server, which seems to point in the right direction. When memory is allocated by
mmap, the caller has to call 'munmap' eventually to release it back to the operating system. Not releasing memory back to the operating system is the definition of a memory leak and will cause the process to crash after the leak has used all available memory.
A quick but complex sysdig query reveals this to be the case:
$ sudo sysdig -r ~/sysdig.scap 'evt.type=mmap' -p '%proc.vmsize %evt.dir %evt.type %evt.info' | grep 'length=' | wc -l 2551 $ sudo sysdig -r ~/sysdig.scap 'evt.type=munmap' -p '%proc.vmsize %evt.dir %evt.type %evt.info' | grep 'length=' | wc -l 518
A lot is going on in this query -- sysdig is an incredibly powerful tool. This specific query allows us to see the memory usage of the process at the moment in time when the syscall was traced (that's what
%proc.vmsize does in the format argument). While here, we are just counting events, we also examined them for patterns, and this memory usage output was invaluable there.
At this point, we were stumped. We are allocating a lot of memory, but not freeing it. The
mmap calls didn't have any particular pattern that we could determine. At this stage in debugging, it's often time to take a break, and let your unconscious mind think up some possibilities. Recognize that you are human: getting rest helps with debugging just as much as staring at a screen.
A short while later, we had a hunch to follow: it's something to do with the cluster being configured to use GZIP to compress data.
To provide context, Kafka can use compression to dramatically reduce data volumes transferred, both over the network and to disk. This particular cluster had GZIP compression enabled. Java exposes GZIP compression in the JVM classes
GZIPOutputStream. Those classes are backed entirely by a C library,
zlib, which does its own memory allocation.
A quick Google search for "JVM GZIP Memory Leak", and we came across this article, which describes almost exactly what we were seeing. That article describes a usage pattern with GZIP that causes the JVM to run out of memory very easily. You open a
GZIPOutputStream and never close it when finished with it. This explains why the JVM didn't show this memory as either on-heap or off-heap -- it can't know what memory this C library uses.
We broke open the Kafka source code, and found a point where it was opening a
GZIPInputStream to decompress some data, but never closing it. Upon a restart, Kafka has to recover recently written messages, which involves decompressing them and checking them for validity. Most code in Kafka that does decompression reads the whole stream of compressed messages then closes the decompressor. This validity check works differently. It opens the compressed messages, reads the first message and checks its offset. It then short-circuits, failing to ever close the native buffers! This leads to our memory leak upon restart. As it so happens, this code path is never hit during normal operation. It only happens on restarts, which fits our evidence as well.
We confirmed this memory issue by reproducing the bug in a development version of Kafka and then failing to reproduce after applying a patch that closes the
GZIPInputStream when short-circuiting.
This is often how debugging difficult bugs goes: hours and days of no progress staring at code and application logs and trace logs, and then some hunch points you in the right direction and is trivial to confirm. Trying different tools and different ways of looking at the problem help you get to a hunch.
Having found and diagnosed the issue, we sent in an upstream ticket to the Kafka project and started working on a patch. After some back and forth review from the Kafka committers, we had a patch in trunk, which is included in the brand new release of Kafka 0.10.0.1. It's interesting to note how small the resulting patch was - this was a major bug that meant Kafka couldn't boot in certain situations, but the bug and resulting bug fix were both very simple. Often a tiny change in code is responsible for a huge change in behavior.
Kafka 0.10.0.1 is now the default version on Apache Kafka on Heroku. For those of you in the Apache Kafka on Heroku beta, you can provision a new cluster that doesn't have this issue. Likewise, we have tooling in place to upgrade existing versions.
For our beta customers, the command
heroku kafka:upgrade heroku-kafka --version 0.10
heroku-kafka is the name of your Heroku Kafka cluster, will perform an in-place upgrade to the latest version of 0.10 (which is 0.10.0.1 at the time of writing).
If you aren’t in the beta, you can request access here: https://www.heroku.com/kafka.