Troubleshooting Down the Logplex Rabbit Hole

Adventures of a Heroku Routing Engineer

My name is Fred and I spend most of my time on Logplex. Since joining Heroku in March 2013, I've become the main developer on that product and handle most of the maintenance and support that goes with it. In this post, I'll explain what the Heroku routing team needed to do to make Logplex more stable, decrease our workload, and keep our mornings quiet and productive.

I'm a remote employee on the Heroku routing team, and I live on the East coast, which means I'm usually the first one in the routing room in the company's HipChat. Later, Tristan, who lives in Illinois, joins me. Then a few hours later, the rest of the team joins us from the West coast, where most of Heroku's employees are located. I usually have a few silent hours to work without interruptions, e-mails, questions, scheduled meetings, and other distractions, which is a relaxing and productive way to start the day.

Customer support tickets that get escalated to the team, or Nagios and PagerDuty alarms, which I try to intercept before they wake my on-call teammate, are the only interruptions to my otherwise productive mornings. Generally, the alarms are rare, and my mornings are becoming increasingly uneventful. Back in June and May, however, an array of alarm interruptions was exhausting our entire routing team.

Alarms were coming from all stacks, while customer support tickets also started rolling in. The problems compounded to make a bad situation worse. Alarms interrupt the team's work flow, which then slows down our ability to resolve the problems. Issues cropped up at all hours of the day and night, so the team, especially the engineer on call, was worn down and becoming less productive, and our time was consumed by interruptions and fires to extinguish.

Although we have other areas of focus, the routing team mainly works on Heroku's HTTP routing stack and the logging infrastructure, centered around Logplex. A sizeable number of alarms were related to individual Logplex nodes that kept crashing with increasing frequency, each time generating alerts in our chat system, via e-mail, and ultimately through the routing team's worst enemy, PagerDuty.

What Logplex Does

Before going further, I should explain what Logplex does, otherwise this entire post may be a bit confusing.

The simplest way to explain Logplex is that it takes log data coming from everywhere in Heroku (dynos, routers, and runtimes) and pushes the data around to other endpoints. More specifically, Logplex accepts syslog messages from either TCP or HTTP(s) streams on any node in the cluster. Each of these nodes will store the log messages in Redis buffers (1,500 lines) to be accessed by heroku logs, allow a distributed merging of all the incoming streams for a given application in order to be displayed live with heroku logs --tail, or will forward them over through drains.

Drains are buffers on each of the Logplex nodes. They accumulate messages received for each node, and then forward them to remote endpoints designated for an application by a user or an add-on provider. This can be done over TCP syslog, or a custom HTTP(s) syslog format when required.

The workflow, to make it simple, looks a bit like this:

Logplex Data Flow

There's also an HTTP API that lets tools and users manipulate the drains, tails, and sessions to be used in the system.

When dealing with logs, we get unpredictable loads, which may be bursty (for example, when an application crashes and generates 50,000 lines of stack traces). Thus, overload is a constant issue with Logplex.

Generally speaking, overload can be dealt with in three ways:

Scale up forever by adding machines.
Block the input to slow down the producers.
Shed load by dropping requests.

The first option is good when you are in over your head, because no matter what you do, the current system can never handle the load and the quality of service is bad.

The second option, on the other hand, is one we want to avoid within Logplex. We don't want an overloaded Logplex node to slow down Heroku applications that are trying to log data, or applications that are busy crashing.

This leaves us with the third option, shedding load. How to handle overload is an important decision to make when you are first designing a system, because it will impact the data flow throughout the process. A system that chooses to block must be synchronous in most of its operations so that the bottleneck located deep within the app can impact the accept rate of the API at the edge of the system. An app that chooses to shed load must be mostly asynchronous internally and its slowest components, where data tends to accumulate, must be able to drop the overload.

For Logplex, which is written in Erlang, we do this load shedding in each process that writes to the outside world: the tail buffers, and we have the Redis buffers locally and within each single drain. To work this way, the drain can be thought of as a combination of two state machines:

a state machine representing the connection or protocol negotiation state for the socket to the remote endpoint;
a state machine handling the buffering of log lines (both on input and output).

For the first type, the most basic machine is:

Connected-Disconnected FSM (simple)

This means that a drain can be disconnected or connected to a given endpoint and transition between the two.

The second type, for buffering, can be represented as:

Accumulating-Extracting FSM

Both machines are conceptually simple, and HTTP drains, for example, use one process for each of them. There is some message passing overhead, but the model is easy to reason about.

To conserve resources, TCP drains will execute both state machines within a single Erlang process. The "accumulating" and "extracting" buffer operations can be made implicit to every operation, but because the input stream to a process is never stopping and impossible to regulate directly, they should never be blocked by other operations without compromising the stability of the system. This means that operations such as sending a message must be made asynchronous, prompting for an additional state:

Connected-Disconnected FSM (complex)

In practice, both drain implementations are far more complex as a result of a plethora of small business rules and behaviors we want to enforce, such as batching, retrying failed messages, loss reporting, send time outs, exponential backoffs, and so on. The key points are that HTTP buffers are conceptually simpler (and thus safer), but cost more in terms of resources; and TCP buffers are more efficient, but trickier to reason about and consequently easier to mess up by accident.

So that's the sky-high view of what I work on when I dive into in Logplex.

Garfield Would Hate All Days if They Had PagerDuty Alarms

As I mentioned, alarms were rolling in, which was particularly frustrating to our team because the system had been designed so that its architecture could cope with individual node failures. The problem was that we received alerts that were useless because they were not actionable. When a Logplex node would go down, I'd get a page, acknowledge it, wait for it to clear up, then go to sleep until the next day, when I would start investigating.

We ended up turning off the alarms for node failures, but we still wanted to reduce the failure rates. Users' logs that were in transit over the node are lost when it crashes, and although losing logs (as opposed to losing customer information) isn't that big of a deal because you can always generate more, it still makes for a bad user experience.

To make a forced analogy, the system was on painkillers, but we still had to fix what was causing the pain. Thus, I needed to figure out why the nodes kept dying (almost randomly), and then fix the problem so it stopped happening so frequently.

Fixing Crashes Through Refactoring

First Attempt at Fixing the Problem

Investigating the Cause of the Error

I was able to wait for a day before investigating a crash because Erlang nodes that go down generate a crash dump, which is a file containing basic information (for example, a slogan that says why the node died, such as its inability to allocate memory) for the entire node. The dump also provides the state of all processes that were running at the time, minus a few details. Each dump can range from a few hundred MBs up to a few GBs in size, and rummaging through them for relevant details can be challenging.

At first the dumps are intimidating, and they remain so for a while. But as you get used to your nodes dying, you get a better understanding of crash dumps. Common patterns pop up, and often these are message queues of processes that are excessively large, the number of processes or open ports and sockets, the size of some processes in memory, and so on. Most failures are going to be caused by one of these conditions. Finding patterns in the crash dump will lead you to a specific part of the code, which will help you reproduce the bug later.

After finding the patterns, I wrote a script that quickly scans large crash dumps to find and report them as plain text. Using this script and manual digging in the Logplex crash dumps, I could see the most frequent pattern, which was a single process with a large mailbox causing the node to go out of memory. By looking at the faulty process's call stack and ancestors in the dump, I could see that the problem was almost always a TCP Syslog drain.

I had never seen an HTTP(s) drain at fault for these failures, but the failures could be because we have nearly an order of magnitude more TCP Syslog drains than HTTP(s) drains, or it could be due to our TCP Syslog drains implementation blocking.

Without knowing what to attack and experiment with, I couldn't do much debugging. The Splunk and Graphite data we have showed that the node's memory would progressively go up sometime between 10 and 40 minutes. This, combined with a large mailbox, told me that the problem was related to slow input, rather than fast input the drain couldn't keep up with. Given that data set, I decided to blame the TCP Syslog drain implementation for our problems.

Trying to Refactor Drains

Even though hunting down individual blocking elements in the TCP Syslog drains is possible, the routing team and I decided to try refactoring it to use two distinct finite state machines, like the HTTP(s) drains did. This would make the code infinitely simpler, and promised lower latency and memory at the price of higher drop rates; if messages stay shorter in the process's mailbox so they can be dropped, there are fewer held back and gobbling memory at any point in time.

After taking a few days implementing the changes, and running tests and benchmarks, I returned to the team with a solution that seemed to achieve the goals we'd outlined. We agreed to send the code to staging, where it matured and proved to be absolutely stable for a while. The problem with staging, however, is that it does not need to sustain the same load production does, and generating the kind of daily traffic Logplex must deal with locally or in staging would be a serious undertaking. So far, generating that kind of traffic has not been worth it for the routing team in terms of developer resources.

I crossed my fingers and added one production node to run the modified code over a few days so I could monitor it and assess that the code could sustain production loads correctly.

Hiccuping so Hard You Die

When we looked at our team's internal tools that displayed live statistics for all Logplex nodes, the new one performed terribly. The node still had acceptable throughput, but we found a stop-start cycle of roughly 15 seconds, for a few milliseconds of pause at a time. This stop-start cycle meant that the node was having a hard time scheduling all of its Erlang processes in time for them to do their necessary work. Under heavy load, this may be understandable, and eventually the node may recover, but the load was quite regular at this point, and my new node was doing considerably worse than all of its older siblings.

After a few hours, some of the freezes would last longer and longer, and eventually, the node would come crashing down.

Second Attempt at Fixing the Problem

Searching for the Cure (for Hiccups)

I wanted to fix things, so I dove into the application logs. There was a considerably long event sequence to comb through. A typical Logplex node will log anywhere between 100 and 350 messages a second internally. A crashing one can bring this up to multiple thousands, and the events that were most useful to diagnose the failure could have happened 15 to 30 minutes earlier.

After studying the logs for a while, I was able to attribute the initial failing calls to cowboy's acceptors, which is a library we use for our HTTP input stream. In short, the cowboy server had an architecture a bit like this:

socket manager

This architecture is a regular pattern for TCP servers in Erlang. The manager (or a supervisor) opens up a listen socket, and shares it with a group of acceptors. Then each acceptor can concurrently accept new connections and handle them, or pass them off to a third party. This usually allows for far better connection times with shorter delays than doing things sequentially with one single process.

The special thing about this cowboy implementation, however, was that every time an acceptor accepted a connection, it reported back to the manager, a central process, to track the connection and check for configuration changes. This manager was a bottleneck for the execution of the program. Loïc, Cowboy's maintainer, knew about this bottleneck, which also showed up in benchmarks. To fix the problem, albeit temporarily, he did a trick that few Erlang applications should ever do: he raised the process priority.

To understand the implications of the fix, you should be aware that Erlang's VM does preemptive scheduling of all processes, and does so fairly across all processes based on a) the work they have accomplished, b) how busy the node is, and c) the work they will have to accomplish (e.g., if their mailbox is full). Because of this balancing, some important system processes often need a larger share of the processing power. The Erlang VM therefore supports four process priorities:

low
normal (default)
high
max

Every process that has a "high" priority and has to run code right now will do so before any of the "normal" processes can run. All the "normal" processes will run before the "low" processes can. This can work reasonably well when you know the load characteristics of the system you will have, and that higher priority tasks are truly higher priority; however, in the case of cowboy, the priorities were merely a work-around for speed in artificial benchmarks.

The supposition for all our woes became rather evident to me:

My refactoring nearly doubled the number of processes on the node, creating more contention for schedulers.
Cowboy was hogging the scheduler time to accept connections and communicate with acceptors.
Other processes (those handling the requests) were starved for CPU time.

Before I blame cowboy for all the issues, keep in mind that Logplex was running an old version of the server (0.6.1), and that newer versions (0.8+) have gone through architectural changes that entirely removed that central bottleneck for most calls and dropped the priority back to normal.

I first tested the hypothesis by manually changing the scheduler priority of the cowboy process, without otherwise altering the server's design. The crashes became a bit less frequent, but the new nodes were still not reliable enough to lead to solid conclusions from tests.

Despite our team's best efforts, I had little data and no way to reproduce the hiccups anywhere other than in production. I still held onto the idea that cowboy's architecture might be a problem; it still had a central bottleneck, despite the scheduler priorities being right. Now it was time to upgrade to a newer Cowboy version.

Upgrading the Servers and Nodes

We settled on Cowboy 0.8.5, which had a new internal architecture. Unfortunately, this required upgrading the Erlang version from R15 to R16. Erlang versions tend to be backwards compatible, and deprecations come only with a few warnings, which I wanted to eliminate. The biggest issues were the interface changes in the cowboy library itself, which required extensive testing to ensure full backwards compatibility for seamless deployments.

R16 also contained what promised to be sweet scheduler optimizations that we hoped to get our hands on, and I optimized the work done with ETS tables to be more parallel and require less data copying, and also shipped a few important bug fixes to make sure it was worth upgrading. This ended up being a bit complex because some of the bugs were urgent and deployed on the production nodes running R15, then were ported forward to work on R16, which I then forked to make work with the new cowboy version, which had to be merged back with the other branch that ... and so on. At one point I was running and monitoring four different versions of Logplex at the same time in production to see which one would win. This wasn't the smartest option, but, given how difficult isolating variables is, it was the most practical option at the time.

One apparent result was that the regular R16 nodes were doing well, without the stop-start cycles. Eventually, I replaced all R15 production nodes with R16 ones, and kept four nodes running R16 with the split TCP Syslog drains.

Don't Stop Me Now

After monitoring the system for a few weeks, I only saw minor improvements from the split drains nodes. They seemed to lock less frequently, and far less violently. Then, every four or five days, one of the new nodes would go radio silent. They would appear to be running, but nothing would come in and out of the VM for 20-30 minutes at a time, then a thousand lines of logs, then nothing again. Connecting to the node through a remote shell or killing it in a way that generated a crash dump was impossible. Nodes never would recover fully from this, always flapping between "poorly usable" and "not usable at all."

I decided to give up on weeks' worth of work because there was no way to run the code at the time. Maybe, now that the routing team changed some internal mechanisms for the system and the VM still made more progress with its schedulers, running the code would be possible, but back then I was at a dead end. I didn't admit total defeat, however, because time proved that all the other nodes in the fleet (those updated to R16, but without changing the drain model) had stopped crashing almost entirely. I ported the rest of the optimizations that weren't there yet to that branch, and we made that the standard for the entire cluster. The boosts in reliability were such that I could archive the trello cards related to the tasks and could start working on other stuff.

Roughly three weeks later, Nagios started screaming in the routing team's internal chat room every five minutes, for days at a time. Some nodes in the cluster had their memory bubble up, and never gave it back to the OS. The nodes wouldn't crash as fast as before; instead, they'd grow close to the ulimit we'd set and hover there, taunting Nagios, and us by extension.

Clearly, I needed to do more work.

I Just Keep on Bleeding and I Won't Die

First Attempt at Fixing THAT Problem

Garbage Garbage Collection

At first I decided to wait it out and see when the node would crash, as it had to do at some point. I hoped that a crash dump would hold details about memory that could help. Maybe the Logplex nodes were resentful that I wanted to work on something else, but they wouldn't ever go over the memory limit. They'd wait and nag and die (or refuse to) for my attention.

Eventually I logged onto the misbehaving production node and ran the following expression within the Erlang shell:

[erlang:garbage_collect(Pid) || Pid <- processes()].

This effectively goes through all processes on the Erlang node, and forces a garbage collection on them. The alarms stopped.

Manually forcing the garbage collection turned out to work wonders. The question was: Why? Before answering that, though, I needed data. The node hadn't crashed, the garbage was gone, and before a node could display that annoying behavior took weeks. I had little choice but to wait.

Fortunately, the phenomenon happened more frequently over time. A few nodes eventually managed to go overboard and die for me. The analysis of the crash dumps revealed that there was no single process holding a criminal amount of memory, and neither was a single mailbox (largest: 26) looking like it would explode. Some processes had high levels of memory, but live profiling didn't give any single process as a culprit.

As the plot thickened, I kept manually calling for garbage collection when a Nagios memory alert cropped up, but something looked off. Here's a table of one of the nodes' memory consumption as reported by Erlang and by the OS before and after forced global garbage collection:

logplex.82559	erlang:memory(total)	beam process	node total
Pre GC	9.87 GB	12.9 GB	14.15 GB
Post GC	5.89 GB	11.8 GB	12.9 GB
Delta	-3.98 GB	-1.1 GB	-1.25 GB

And here's what a relatively fresh node looks like:

logplex.83017	erlang:memory(total)	beam process	node total
Memory	6.4 GB	6.7 GB	7.6 GB

That was fairly weird and highlighted two big things:

Garbage collection seems to have trouble doing its job for the entire node without being prompted.
There is memory not allocated directly by Erlang (the language) that seems to stay around and grow with time.

I decided to focus on the first point (why GC was having trouble) because it was immediately actionable and showed tangible results that could prevent errors and crashes. I had no idea what the culprit was, but at some point, one of the nodes started triggering errors. I sat at the terminal, waiting to see changes in metrics. While the OS would report the following:

graphite memory chart

the VM would internally report the following (the scales are different and confusing):

splunk memory chart

Here are the numbers for the memory statistics over time:

logplex@ip.internal)1> [{K,V / math:pow(1024,3)} || {K,V} <- erlang:memory()].
[{total,10.441928684711456},
 {processes,3.8577807657420635},
 {processes_used,3.8577076755464077},
 {system,6.584147918969393},
 {atom,3.5428348928689957e-4},
 {atom_used,3.483891487121582e-4},
 {binary,2.7855424359440804},
 {code,0.008501693606376648},
 {ets,3.745322160422802}]

%% waiting 20 minutes %%

logplex@ip.internal)1> [{K,V / math:pow(1024,3)} || {K,V} <- erlang:memory()].
[{total,11.024032421410084},
 {processes,3.953512007370591},
 {processes_used,3.9534691693261266},
 {system,7.070520414039493},
 {atom,3.5428348928689957e-4},
 {atom_used,3.483891487121582e-4},
 {binary,3.2071433141827583},
 {code,0.008501693606376648},
 {ets,3.8099956661462784}]

This shows heavy memory growth, most of it for binary memory, going from 2.79GB up to 3.21GB. Remembering Mahesh Paolini-Subramanya's blog post on binary memory leaks, I decided to try confirming that this was indeed the root cause.

Erlang's binaries are of two main types: ProcBins and Refc binaries. Binaries up to 64 bytes are allocated directly on the process's heap, and take the place they use in there. Binaries bigger than that get allocated in a global heap for binaries only, and each process holds a local reference in its local heap. These binaries are reference-counted, and the deallocation will occur only once all references are garbage-collected for all processes that held a binary.

In 99% of the cases, this mechanism works entirely fine. In some cases, however, the process will either:

do too little work to warrant allocations and garbage collection;
eventually grow a large stack or heap with various data structures, collect them, then get to work with a lot of refc binaries. Filling the heap again with binaries (even though a virtual heap is used to account for the refc binaries' real size) may take a lot of time, giving long delays between garbage collections.

In the case of Logplex, the latter case was the one occuring. I confirmed it by polling processes of a node with process_info(Pid, binary), which returns a list of all the binary references of a process in a list. The length of the list can be used to know which processes hold the most references, but that's not quite enough yet; these references may be valid. Building that list, calling for global garbage collection on the node, then building a new list and calculating the delta between both to know which processes held the most out-of-date references needed to be done next:

(logplex@ip.internal)3> MostLeaky = fun(N) ->
(logplex@ip.internal)3>     lists:sublist(
(logplex@ip.internal)3>      lists:usort(
(logplex@ip.internal)3>          fun({K1,V1},{K2,V2}) -> {V1,K1} =< {V2,K2} end,
(logplex@ip.internal)3>          [try
(logplex@ip.internal)3>               {_,Pre} = erlang:process_info(Pid, binary),
(logplex@ip.internal)3>               erlang:garbage_collect(Pid),
(logplex@ip.internal)3>               {_,Post} = erlang:process_info(Pid, binary),
(logplex@ip.internal)3>               {Pid, length(Post)-length(Pre)}
(logplex@ip.internal)3>           catch
(logplex@ip.internal)3>               _:_ -> {Pid, 0}
(logplex@ip.internal)3>           end || Pid <- processes()]),
(logplex@ip.internal)3>      N)
(logplex@ip.internal)3> end,
(logplex@ip.internal)3> MostLeaky(10).
[{<0.9356.0>,-158680},
 {<0.4782.0>,-113260},
 {<0.21671.0>,-31253},
 {<0.17166.0>,-23081},
 {<0.16754.0>,-20260},
 {<0.5595.0>,-19292},
 {<0.7607.0>,-18639},
 {<0.716.0>,-18184},
 {<0.19981.0>,-17747},
 {<0.21843.0>,-15908}]

I have since added that function to the recon library so that nobody is required to do these calls by hand.

The little data dump above showed that some processes held more than 100,000 stale references to refc binaries, and a lot of them held more than 10,000. This told me that some processes held a lot of binaries, and investigating individual processes revealed they were all drains or buffers of some kind. This was bad news because it meant that the way Logplex is built is more or less playing right into one of the few rare cases where the Erlang GC isn't delivering results on par with what is promised for the general case. The Logplex application, which looked like a perfect match for Erlang, got trapped into an implementation detail that made it a pathological case for the VM.

Picking Up the Trash

Generally, refc binaries memory leaks can be solved in a few different ways:

call garbage collection manually at given intervals (icky);
manually track binary sizes and force GC, which defeats the purpose of having garbage collection in the first place and may do a worse job than the VM's virtual binary heap;
stop using binaries (not desirable);
or add hibernation calls when appropriate (possibly the cleanest solution).

I decided to put a quick fix in place, which still lives in production to this day.

The simple module basically loops on itself and at given intervals polls for the reported memory, checks to see whether it goes past a threshold, and if so, garbage collects the node. If required, the module also allows manual calls.

The script worked as expected, and high memory warnings were quickly tamed, logged (with the deltas), and waiting to be inspected later.

The Attempt Fails, Somewhat

After a few weeks looking at the logs and seeing that no incident with the nodes could be related to memory issues, logs showed garbage collections happening as required, and everything looked great. Resorting to emergency measures as the only way to get the node to drop high amounts of memory isn't ideal, however, and we never know how fast a spike in usage will happen. I decided to add a bunch of hibernation calls in non-intrusive locations (inactive drains, or when disconnecting from a remote endpoint), which would allow us to garbage collect globally much less frequently, while keeping memory lower overall when people register mostly inactive drains.

Everything went fine, except that after five weeks, one node crashed despite the fixes being in place. The global garbage collection didn't even get triggered. Looking at logs from the OS and from the Logplex node internally revealed that the OS allocated 15GB of RAM to the Logplex node, while it internally reported using less than half of that. We had a serious memory leak, which was incredibly frustrating.

Second Attempt at Fixing THAT Problem

How Erlang's Memory Stuff Works

At this point, I was hitting the limits of what I knew about the Erlang virtual machine, and I suspected that either memory leaks were going on outside of the memory the VM reported, or the nodes were victim to memory fragmentation. (We're using NIFs for lzf decompression, but the VM could have been at fault, too.)

Not knowing what to do, I contacted Lukas Larsson. A few years back, when I spent my first two weeks at Erlang Solutions Ltd. in London, Lukas was also there for a few days and acted as my guide through the city and company. Since then, Lukas has moved on (internally) to consult for the OTP team at Ericsson, and I moved on to AdGear, and then to Heroku. We still connect occasionally at conferences and over IRC, and Lukas has always helped answer my tricky questions about the Erlang VM.

I asked Lukas how I could pinpoint what was wrong—memory leak or fragmentation—and I showed him some of my collected data. I'm sharing what I learned in the process because, in addition to be interesting, the information is not documented anywhere other than the source.

The amount returned by erlang:memory/0-1 is the amount of memory actively allocated, where Erlang terms are laid in memory; this amount does not represent the amount of memory that the OS has given to the virtual machine (and Linux doesn't actually reserve memory pages until they are used by the VM). To understand where memory goes, one must first understand the many allocators being used:

Allocators map

temp_alloc: does temporary allocations for short use cases (such as data living within a single C function call).
eheap_alloc: heap data, used for things such as the Erlang processes' heaps.
binary_alloc: the allocator used for reference counted binaries (what their 'global heap' is).
ets_alloc: ETS tables store their data in an isolated part of memory that isn't garbage collected, but allocated and deallocated as long as terms are being stored in tables.
driver_alloc: used to store driver data in particular, which doesn't keep drivers that generate Erlang terms from using other allocators. The driver data allocated here contains locks/mutexes, options, Erlang ports, etc.
sl_alloc: short-lived memory blocks will be stored there, and include items such as some of the VM's scheduling information or small buffers used for some data types' handling.
ll_alloc: long-lived allocations will be in there. Examples include Erlang code itself and the atom table, which stay there.
fix_alloc: allocator used for frequently used fixed-size blocks of memory. One example of data used there is the internal processes' C struct, used internally by the VM.
std_alloc: catch-all allocator for whatever didn't fit the previous categories. The process registry for named process is there.

The entire list of where given data types live can be found in the source.

By default, there will be one instance of each allocator per scheduler (and you should have one scheduler per core), plus one instance to be used by linked-in drivers using async threads. This ends up giving you a structure a bit like the drawing above, but split it in N parts at each leaf.

Each of these sub-allocators will request memory from mseg_alloc and sys_alloc depending on the use case, and in two possible ways. The first way is to act as a multiblock carrier (mbcs), which will fetch chunks of memory that will be used for many Erlang terms at once. For each mbc, the VM will set aside a given amount of memory (~8MB by default in our case, which can be configured by tweaking VM options), and each term allocated will be free to go look into the many multiblock carriers to find some decent space in which to reside.

Whenever the item to be allocated is greater than the single block carrier threshold (sbct), the allocator switches this allocation into a single block carrier (sbcs). A single block carrier will request memory directly from mseg_alloc for the first 'mmsbc' entries, and then switch over to sys_alloc and store the term there until it's deallocated.

So looking at something such as the binary allocator, we may end up with something similar to:

binary allocators example

Whenever a multiblock carrier (or the first 'mmsbc' single block carriers) can be reclaimed, mseg_alloc will try to keep it in memory for a while so that the next allocation spike that hits your VM can use pre-allocated memory rather than needing to ask the system for more each time.

When we call erlang:memory(total), what we get isn't the sum of all the memory set aside for all these carriers and whatever mseg_alloc has set aside for future calls, but what actually is being used for Erlang terms (the filled blocks in the drawings above). This information, at least, explained that variations between what the OS reports and what the VM internally reports are to be expected. Now we needed to know why our nodes had such a variation, and whether it really was from a leak.

Fortunately, the Erlang VM allows us to get all of the allocator information by calling:

[{{A, N}, Data} || A <- [temp_alloc, eheap_alloc, binary_alloc, ets_alloc,
                          driver_alloc, sl_alloc, ll_alloc, fix_alloc, std_alloc],
                   {instance, N, Data} <- erlang:system_info({allocator,Allocator})]

The call isn't pretty and the data is worse. In that entire data dump, you will retrieve the data for all allocators, for all kinds of blocks, sizes, and metrics of what to use. I will not dive into the details of each part; instead, refer to the functions I have put inside the recon library that will perform the diagnostics outlined in the next sections of this article.

To figure out whether the Logplex nodes were leaking memory, I had to check that all allocated blocks of memory summed up to something roughly equal to the memory reported by the OS. The function that performs this duty in recon is recon_alloc:memory(allocated). The function will also report what is being actively used (recon_alloc:memory(used)) and the ratio between them (recon_alloc:memory(usage)).

Fortunately for Logplex (and me), the memory allocated matched the memory reported by the OS. This meant that all the memory the program made use of came from Erlang's own term allocators, and that the leak came from C code directly was unlikely.

The next suspected culprit was memory fragmentation. To check out this idea, you can compare the amount of memory consumed by actively allocated blocks in every allocator to the amount of memory attributed to carriers, which can be done by calling recon_alloc:fragmentation(current) for the current values, and recon_alloc:fragmentation(max) for the peak usage.

By looking at the data dumps for these functions (or a similar one), Lukas figured out that binary allocators were our biggest problem. The carrier sizes were large, and their utilization was impressively low: from 3% in the worst case to 24% in the best case. In normal situations, you would expect utilization to be well above 50%. On the other hand, when he looked at the peak usage for these allocators, binary allocators were all above 90% usage.

Lukas drew a conclusion that turned out to match our memory graphs. Whenever the Logplex nodes have a huge spike in binary memory (which correlates with spikes in input, given that we deal with binary data for most of our operations), a bunch of carriers get allocated, giving something like this:

allocators full

Then, when memory gets deallocated, some remnants are kept in Logplex buffers here and there, leading to a much lower rate of utilization, looking similar to this:

allocators leak

The result is a bunch of nearly empty blocks that cannot be freed. The Erlang VM will never do defragmentation, and that memory keeps being hogged by binary data that may take a long time to go away; the data may be buffered for hours or even days, depending on the drain. The next time there is a usage spike, the nodes might need to allocate more into ETS tables or into the eheap_alloc allocator, and most of that memory is no longer free because of all the nearly empty binary blocks.

Fixing this problem is the hard part. You need to know the kind of load your system is under and the kind of memory allocation patterns you have. For example, I knew that 99% of our binaries will be smaller or equal to 10kb, because that's a hard cap we put on line length for log messages. You then need to know the different memory allocation strategies of the Erlang virtual machine:

Best fit (bf)
Address order best fit (aobf)
Address order first fit (aoff)
Address order first fit carrier best fit (aoffcbf)
Address order first fit carrier address order best fit (aoffcaobf)
Good fit (gf)
A fit (af)

alloc examples #1

For best fit (bf), the VM builds a balanced binary tree of all the free blocks' sizes, and will try to find the smallest one that will accommodate the piece of data and allocate it there. In the drawing above, having a piece of data that requires three blocks would likely end in area 3.

Address order best fit (aobf) will work similarly, but the tree instead is based on the addresses of the blocks. So the VM will look for the smallest block available that can accommodate the data, but if many of the same size exist, it will favor picking one that has a lower address. If I have a piece of data that requires three blocks, I'll still likely end up in area 3, but if I need two blocks, this strategy will favor the first mbcs in the diagram above with area 1 (instead of area 5). This could make the VM have a tendency to favor the same carriers for many allocations.

Address order first fit (aoff) will favor the address order for its search, and as soon as a block fits, aoff uses it. Where aobf and bf would both have picked area 3 to allocate four blocks, this one will get area 2 as a first priority given its address is lowest. In the diagram below, if we were to allocate four blocks, we'd favor block 1 to block 3 because its address is lower, whereas bf would have picked either 3 or 4, and aobf would have picked 3.

alloc examples #2

Address order first fit carrier best fit (aoffcbf) is a strategy that will first favor a carrier that can accommodate the size and then look for the best fit within that one. So if we were to allocate two blocks in the diagram above, bf and aobf would both favor block 5, aoff would pick block 1. aoffcbf would pick area 2, because the first mbcs can accommodate it fine, and area 2 fits it better than area 1.

Address order first fit carrier address order best fit (aoffcaobf) will be similar to aoffcbf, but if multiple areas within a carrier have the same size, it will favor the one with the smallest address between the two rather than leaving it unspecified.

Good fit (gf) is a different kind of allocator; it will try to work like best fit (bf), but will only search for a limited amount of time. If it doesn't find a perfect fit there and then, it will pick the best one encountered so far. The value is configurable through the mbsd VM argument.

A fit (af), finally, is an allocator behavior for temporary data that looks for a single existing memory block, and if the data can fit, af uses it. If the data can't fit, af allocates a new one.

Each of these strategies can be applied individually to every kind of allocator, so that the heap allocator and the binary allocator do not necessarily share the same strategy.

The Memory Is Coming from Inside the House

Lukas recommended we go with the address order best fit strategy (aobf), along with a reduction in the size of our average mbcs for the binary allocator. With this strategy, we used more CPU to pick where data would go, but the VM hopefully would favor existing free blocks in more cases, meaning that we would have much fewer near-empty mbcs sitting around after a usage spike.

I enabled these settings for a few nodes in production, and then I waited. The problem with these settings is that failures could take up to five weeks to show up in regular nodes when we have multiple dozens of them, and then slowly ramp up in frequency. Measuring the success of the experiments I put in production took an excessively long time, during which the nodes used more CPU to do their thing. After three or four weeks without a crash, I decided to push the experiment further. I pushed the change in production to all nodes, but for these options to kick in, each node needed to be restarted. Usually we use Erlang's hot code loading features to deploy software without terminating a single connection. Instead of restarting the nodes, I waited for them to crash, which took a few weeks. Until the crash, roughly 25% of the cluster was running with the new memory allocation options, and 75% ran with the old ones. The first few nodes passed the time when issues cropped up, and seemed stable.

All the new memory-related crashes were happening on the older nodes, and the now 30-35% new nodes never seemed to crash. I considered the experiment successful, while still knowing there was a non-negligible probability that age was the sole reason the older nodes, but not the new ones, kept dying. Eventually, new Erlang versions came out and I gave it a final push with a cluster roll over the course of a few hours. All production Logplex nodes are now running the newest stable Erlang version with the tweaked memory allocation settings.

Problem: Solved

After a few weeks (or months, depending on the age of the node verified), I found out that in practice the reduction isn't perfect; fragmentation is still occurring, but we see an improvement. Whereas most binary allocators that saw a significant amount of usage before the fixes would have usage rates between 3% and 25%, the new nodes seeing a significant amount of usage tend to have at least 35% to 40% usage, with some of them having well above 90% across the cluster. This more than doubles our efficiency in memory and usage high enough that we have yet to lose a VM without first being able to trigger a global garbage collection call. In fact, we haven't lost memory due to Out-Of-Memory errors that were directly attributable to drains and how Logplex is built. This doesn't mean the cluster components never fail anymore; I still see failures resulting from bad load-balancing between the nodes or connections and services going down that we incorrectly built around (and are now remediating).

I also saw cases in which the nodes would not fail, but have degraded quality of service. The most glaring example of this was some bad behavior from some nodes under heavy load, with high drop rates in Logplex messages to be delivered to users.

Nodes Are Blocking and See Poor Performance

We Solve This One Nice and Easy

The Structure of IO on a Node

During the investigation, our routing team learned that one of the processes on the Erlang node that tended to be at risk was the user process. Erlang's IO is based around the idea of "group leaders." A group leader is a process that will take charge of handling IO and forwarding it around. I already went through why this is useful for Erlang's shell to work, so I will focus on the state of affairs.

Every OTP application running on the VM has an application master, a process that acts as a secret top-level supervisor for each application. On top of the VM, there's a process named user (the at-risk process) that handles all of the IO and is in charge of the standard I/O ports. In between, a varying number of processes, depending on how you're interacting with the VM, may or may not be there. Moreover, every IO call is synchronous between the IO Client (the process that calls io:format/2) and the IO Server (the application master, or ultimately, user).

In a default VM, logging with regular IO calls results in the following structure:

basic io

or, if your commands come from the Erlang shell:

shell io

For our log volume, we made the reasonable optimization of telling io:format to communicate directly with the user process (by calling io:format(user, Fmt, Args)), which removed middlemen and allowed faster communication with less overhead. At peak times, if the VM (or the OS, given we're in the cloud) hiccuped, many connections could time out at once. And if we logged all of these events, we'd get a storm in which the individual processes would block by waiting for the confirmation of messages, which also created a memory bubble. This meant that the logging we wanted was the source of the problems we had. With more waiting, the buffers would accumulate messages in drains that couldn't be sent in time before filling up, resulting in dropped messages.

Buffering Up

First we replaced all of our logging calls with the lager library, which is a fantastic library that I highly recommend. The lager library uses a middleman process, but this middleman acts as a buffer on its own and allows all communication to be asynchronous, up until a certain point. Switching to lager worked well, except that on a few nodes, occassionally (and unpredictably), IO slowed to a crawl, and lager would have thousands of messages backlogged. When that happens, IO switched in synchronous mode to avoid going out of memory. This is entirely sane behavior, but for Logplex it meant that critical paths in our code (for example, the TCP Syslog drains, which should never block) would suddenly lock up, endangering the node in other ways because there was now a risk of overflowing the mailboxes of additional processes.

Going synchronous simply shifted the danger around, which was trouble. I saw two options: to log less, which would temporarily solve the problem at the cost of visibility; or to try buffering and batching log messages asynchronously, which could let us keep all the logs, but wasn't sure to work. Batching has long been a recommended solution for throughput issues. Raising latency a bit leads to better resource usage by regrouping similar operations together and getting more throughput. I decided to try the buffering option because implementing it is not excessively complex, and it promised a quick fix that would be a win in terms of speed and data logged. The idea was to take all the log messages, send them to a middleman buffer process pair that would accumulate them (and optionally drop them), merge the log strings into larger pages, and send these larger pages to the user process.

I would be replacing the earlier diagram describing the IO flow by:

batchio io

If the root cause of the problem was the message passing overhead, this solution could work. If the problem was the total bandwidth of logs Logplex was producing, little could be done to help.

Problem Solved

The solution worked. I promptly released the BatchIO library, which contains the buffer process pair and wraps it behind a glorified io:format/2 call.

By batching operations, there's an estimated reduction in the number of messages sent across the VM for IO of as much as 75% (according to a back-of-the-envelope calculation of mine) without decreasing the amount of data logged. In case of overflow, the library will drop messages instead of blocking.

So far, no messages have needed to be dropped, even though we moved to a fully asynchronous model. We've also kept lager active on the node for error logs, as opposed to the default Erlang handlers. BatchIO couldn't handle error logs; lager is better at handling error logs than the default error handlers on the node, and it will prevent overload in a better way for them.

Denouement

Metrics and Stuff

Through this months long debugging session, the routing team has gained much higher time between individual component failures. This means fewer interruptions for us, fewer silently lost logs for our users, and additional bandwidth to let Heroku focus on more pressing non-operational issues.

The improvements gleaned from this project went far further than the direct reliability of the system; additional system performance improvements came at the right time as demand increased, and the end results were satisfying. Over the course of the project, the number of log messages in transit through Logplex increased by nearly 50%, whereas the number of messages dropped was reduced by an order of magnitude during the same time period.

As a result of our efforts, I also have released a few libraries to the Erlang community:

Recon: Contains scripts and modules to help diagnose Erlang issues, including scripts to deal with crash dumps, memory allocation diagnostics, and other general debugging functions that can be safely used in production.
BatchIO: Our IO buffering library. Although this is not as good as a full logging library, I hope it can be used as a model to help with existing logging libraries, if they were to offer alternatives to going synchronous when overloaded.
POBox: A generalization of the Logplex buffers for HTTP(s) drains designed by Geoff Cant, the technical leader of the routing team. This library is used as the core of BatchIO, and we've started using it in other internal projects that require batching or load-shedding.

I still plan to work on debugging in Logplex. For example, I'm in the process of helping Logplex be more fault-tolerant to external services failure. And although overload can still take nodes down, at least it now takes more load to overload Logplex.

Conclusion

Every project that lives in production for a while and requires scaling up ends up having weird, complicated issues. They're issues that cannot be attributed to one root cause, are difficult to reproduce, and that nobody else in the community can encounter or solve for you. These issues crop up no matter what language, stack, framework, or hardware you have, or how correct you think your software is.

These kinds of production problems are the modern version of bridges collapsing after too many people cross at once, creating heavy mechanical resonance. Although most modern bridges are able to handle these issues safely, past failures (and even newer failures) led to safer structures. Unlike bridge design, software engineering is still a young field. Hopefully this article offers a glimpse of the daily work Heroku engineers perform, and sharing our experience will be helpful to the greater development community.

Video Transcript