Video Transcript


How We Sped up SNI TLS Handshakes by 5x

During the development of the recently released Heroku SSL feature, a lot of work was carried out to stabilize the system and improve its speed. In this post, I will explain how we managed to improve the speed of our TLS handshakes by 4-5x.

The initial reports of speed issues were sent our way by beta customers who were unhappy about the low level of performance. This was understandable since, after all, we were not greenfielding a solution for which nothing existed, but actively trying to provide an alternative to the SSL Endpoint add-on, which is provided by a dedicated team working on elastic load balancers at AWS. At the same time, another of the worries we had was to figure out how many more routing instances we would need to absorb the CPU load of a major part of our traffic no longer being simple HTTP, but HTTP + TLS.

Detecting the Problem

The simplest way to work at both of these things is always through benchmarking. So we set up some simple benchmarks that used TLS with no session resumption or caching, and on HTTP requests that were extremely small with no follow-up requests coming over the connection. The objective there was to specifically exercise handshakes:

  • TLS handshakes are more costly than just encrypting data over a well-established connection.
  • HTTP keep-alive requests reuse the connection and the first one is technically more expensive (since it incurs the cost of the handshake), so we disabled them to always have the most expensive thing happening.
  • Small amounts of data to encrypt mean that the weight of the handshake dominates the overall amount of time taken for each query. We wanted more handshakes and less of everything else.

Under such benchmarks, we found that our nodes became slow and unresponsive at a depressingly rapid rate, almost 7-8 times earlier than with the same test over HTTP. This is a far cry from reported overheads of 1%-2% in places like google, although we purposefully went with a very pessimistic pattern. A more realistic pattern would likely have had lower overhead.

This could have easily have been a call for panic. We had used the Erlang/OTP SSL libraries to serve our traffic. While there's some added safety to having a major part of your stack not written in C and not dependent on OpenSSL, which recently has experienced several notable vulnerabilities, we did run the risk of much worse performance. To be clear, the Erlang SSL library does use OpenSSL bindings for all of the cryptographic functionality but uses Erlang for protocol-level parts of the implementation, such as running state machines. The library has gone through independent audit and testing.

We have a team of people who know Erlang pretty well and are able to profile and debug it, so we decided to see if we could resolve the performance issue just tweaking standard configurations. During initial benchmarking, we found that bandwidth and packet counts were very low and memory usage was as expected. CPU usage was fairly low (~30%) but did tend to jump around quite a bit.

For us, this pointed to a standard kind of bottleneck issue where you see flapping as some work gets done in a batch, then the process waits, then does more work. However, from our internal tools, we could see very little that was not just SSL doing work.

Eventually, we used perf top to look at things, which is far less invasive than most tooling out there (if you're interested, you should take a look at Julian Squire's talk at the Erlang User Conference 2016 on this specific topic).

The thing that immediately jumped out at us was that a bunch of functions were taking far more time than we'd expect. Specifically, the following results would show up:

perf top output for the node

The copy_shallow and do_minor functions are related to garbage collection operations within the Erlang VM. They can be high in regular circumstances, but here they were much, much higher than expected. In fact, GC was taking more time than actual crypto work! The other thing that took more time than crypto was the db_next_hash function, which was a bit funny.

We looked some more, and as more samples came out, the pattern emerged:

perf top output for the node

CPU time would flap a lot between a lot of garbage collection operations and a lot of db_*_hash operations, whereas given the benchmark, we would have expected libcrypto.so to do the most work.

The db_*_hash operations are specifically related to something called ETS (Erlang Term Storage) tables. ETS tables are an efficient in-memory database included with the Erlang virtual machine. They sit in a part of the virtual machine where destructive updates are allowed and where garbage collection dares not approach. They're generally fast, and a pretty easy way for Erlang programmers to optimize some of their code when parts of it get too slow.

In this case, though, they appeared to be our problem. Specifically, the next and get operations are expected to be cheap, but select tends to be very expensive and a sign of full-table scans.

By logging onto the node during the benchmark, we could make use of Erlang's tracing facilities. The built-in tracing functions basically let an operator look at all messages, function calls and function returns, garbage collections, scheduling activities, data transiting in and out of ports, and so on, at the language level. This tracing is higher level than tracing provided by tools such as strace or dtrace.

We simply ran calls to recon_trace:calls({ets, select_delete, '_'}, 100) and saw that all of the calls came from a single process named ssl_manager.

Understanding and Fixing the Problem

The SSL manager in the Erlang library has been a potential bottleneck for a long period of time. Prior to this run, we had already disabled all kinds of cache implementations that turned out to be slower for our use cases. We had also identified this lone, central process as a point of contention by design -- we have a few of them and tend to know about them as specific load scenarios exercise them more than others.

The tracing above had also shown that the trace calls were made from a module called ssl_pkix_db, which is in turn called by ssl_manager.

These modules were used by the SSL implementation as a cache for intermediary certificates for each connection. The initial intent was to cache certificates read from disk by Erlang, say from /etc/ssl/.

For us, this would have been costly when the server is fetching the files from disk hundreds or thousands of times a second, decoding them from PEM to DER format, and subsequently to their internal Erlang format.

The cache was set in place such that as soon as one connection requested a given file, it would get decoded once, and then cached in memory through ETS tables with a reference count. Each new session would increment the count, and each terminated session would decrement the count. When the count reaches 0, the cache is dropped. Additional table scans would take place to provide upper time limits for caches. The cache was then used for other operations, such as scanning through CA certs during revocation checks.

The funny bit there is that the SSL library supports an entire other format to handle certificates: have the user decode the PEM data to the DER format themselves and then submit the DER data directly in memory through the connection configuration. For dynamic routers such as Heroku's, that's the method we've taken to avoid storing multiple certificates on disk unencrypted. We store them ourselves, using appropriate cryptographic means, and then decode them only once they are requested through SNI, at which point they are passed to the SSL library.

For this use case, the SSL library has certificates passed straight in memory and does not require file access. Still, the same caching code path was used. The certificates are cached but not reference-counted and also not shared across connections either. For heavy usage of DER-encoded certificates, the PEM cache, therefore, becomes a central bottleneck for the entire server, forcing the decoding of every certificate individually through that single critical process.

Patching Results

We decided to write a very simple patch whose purpose is just to bypass the cache wholesale, nothing more. Once the patch was written, we needed to test it. We ran it through our benchmark sequence and the capacity of each instance instantly tripled, with response times now much lower than before.

It soon came to the step of sending our first canary nodes into production to see how they'd react with real-world data rather than benchmark cases. It didn't take too long to see the improvements:

deploy results for a canary node

The latencies on the node for all SSL/TLS queries over SNI had instantly dropped by 400% to 500% -- from 100-120 milliseconds roundtrip down to a range of 20-25 milliseconds for the particular subset of apps we were looking at. That was a lot of overhead.

In fact, the canary node ran for a while and beat every other node we had on the platform:

deploy results for a canary node vs. all the others

It seemed a bit surprising that so much time was spent only through the bottleneck, especially since our benchmark case was more pessimistic than our normal production load. Then we looked at garbage collection metrics:

erlang GC diagram

The big dip in the middle is the node restarting to run the new software. What's interesting is that as soon as the new version was used, the garbage collection count went from about 3 million GCs per minute down to roughly 2.5 million GCs per minute. The reclaimed words are impacted more significantly: from around 7 billion words reclaimed down to 4 billion words reclaimed per minute.

Since Erlang garbage collections are per-process, non-blocking, and generational, it is expected to see a lot of small garbage collections and a few larger ones. The overall count includes both values. Each garbage collection tends to take a very short time.

What we know for a fact, though, is that we had this one bottleneck of a process and that a lot of time was saved. The supposition is that because a lot of requests (in fact, all of the SNI termination requests) had to touch this one Erlang process, it would have time to accumulate significant amounts of garbage for short periods of time. This garbage collection impacted the latency of this process and the scheduler it was on, but not the others; processes on other cores could keep running fine.

This yielded a pattern where a lot of data coming from the other 15 cores could pile up into the message queue of the SSL manager, up to the point that memory pressure forced it to GC more and more often. At regular intervals, a more major GC would take place, and stall an even larger number of requests.

By removing the one small bottleneck, the level of garbage generated was more equally shared by all callers and also tended to be much more short-lived. In fact, if the caller was a short-lived process, its memory would be reclaimed on termination without any garbage collection taking place.

The results were so good that we rapidly deployed them to the rest of the platform:

overall deploy latency graph

Those values represented the median (orange), the 95th percentile (blue), and the 99th percentile (green). You can see the progression of the deploy as the median rapidly shrinks, and the point where the deploy finishes when the 99th percentile responses became faster than our previous medians.

The patch has been written, tested, cleaned up, and sent upstream to the OTP team at Ericsson (see the pull request). It has recently been shipped along with Erlang OTP-19.1, and everyone can now use it in the releases starting then.

Originally published: December 22, 2016

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.