Apache Kafka, Data Pipelines, and Functional Reactive Programming with Node.js

Heroku recently released a managed Apache Kafka offering. As a Node.js developer, I wanted to demystify Kafka by sharing a simple yet practical use case with the many Node.js developers who are curious how this technology might be useful. At Heroku we use Kafka internally for a number of uses including data pipelines.  I thought that would be a good place to start.

When it comes to actual examples, Java and Scala get all the love in the Kafka world.  Of course, these are powerful languages, but I wanted to explore Kafka from the perspective of Node.js.  While there are no technical limitations to using Node.js with Kafka, I was unable to find many examples of their use together in tutorials, open source code on GitHub, or blog posts.  Libraries implementing Kafka’s binary (and fairly simple) communication protocol exist in many languages, including Node.js.  So why isn’t Node.js found in more Kafka projects?

I wanted to know if Node.js could be used to harness the power of Kafka, and I found the answer to be a resounding yes.

Moreover, I found the pairing of Kafka and Node.js to be more powerful than expected.  Functional Reactive Programming is a common paradigm used in JavaScript due to the language’s first-class functions, event loop model, and robust event handling libraries.  With FRP being a great tool to manage event streams, the pairing of Kafka with Node.js gave me more granular and simple control over the event stream than I might have had with other languages.

Continue reading to learn more about how I used Kafka and Functional Reactive Programming with Node.js to create a fast, reliable, and scalable data processing pipeline over a stream of events.

The Project

I wanted a data source that was simple to explain to others and from which I could get a high rate of message throughput, so I chose to use data from the Twitter Stream API, as keyword-defined tweet streams fit these needs perfectly.

Fast, reliable, and scalable.  What do those mean in this context?

  • Fast.  I want to be able to see data soon after it is received -- i.e. no batch processing.
  • Reliable.  I do not want to lose any data.  The system needs to be designed for “at least once” message delivery, not “at most once”.
  • Scalable.  I want the system to scale from ten messages per second to hundreds or maybe thousands and back down again without my intervention.

So I started thinking through the pieces and drew a rough diagram of how the data would be processed.

Data Pipeline

Each of the nodes in the diagram represents a step the data goes through.  From a very high level, the steps go from message ingestion to sorting the messages by keyword to calculating aggregate metrics to being shown on a web dashboard.

I began implementation of these steps within one code base and quickly saw my code getting quite complex and difficult to reason about.  Performing all of the processing steps in one unified transformation is challenging to debug and maintain.

Take a Step Back

I knew there had to be a cleaner way to implement this.  As a math nerd, I envisioned a way to solve this by composing simpler functions -- maybe something similar to the POSIX-compliant pipe operator that allows processes to be chained together.

JavaScript allows for various programming styles, and I had approached this initial solution with an imperative coding style.  An imperative style is generally what programmers first learn, and probably how most software is written (for good or bad).  With this style, you tell the computer how you want something done.

Contrast that with a declarative approach in which you instead tell the computer what you want to be done.  And more specifically, a functional style, in which you tell the computer what you want done through composition of side-effect-free functions.

Here are simple examples of imperative and functional programming.  Both examples result in the same value.  Given a list of integers, remove the odd ones, multiply each integer by 10, and then sum all integers in the list.

Imperative

const numList = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
let result = 0;
for (let i = 0; i < numList.length; i++) {
  if (numList[i] % 2 === 0) {
    result += (numList[i] * 10)
  }
}

Functional

const numList = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
const result = numList
               .filter(n => n % 2 === 0)
               .map(n => n * 10)
               .reduce((a, b) => a + b, 0)

Both complete execution with result equal to 300, but the functional approach is much easier to read and is more maintainable.

If that’s not readily apparent to you, here’s why: in the functional example, each function added to the “chain” performs a specific, self-contained operation.  In the imperative example, the operations are mashed together.  In the functional example, state is managed for me within each function, whereas I have to manage changing state (stored in the result variable) during execution in the imperative version.

These may seem like small inconveniences, but remember this is just a simple example.  In a larger and more complex codebase the minor inconveniences like these accumulate, increasing the cognitive burden on the developer.

The data processing pipeline steps were screaming to be implemented with this functional approach.

But What About The Reactive Part?

Functional reactive programming, “is a programming paradigm for reactive programming (asynchronous dataflow programming) using the building blocks of functional programming (e.g. map, reduce, filter)" [frp].  In JavaScript, functional reactive programming is mostly used to react to GUI events and manipulate GUI data.  For example, the user clicks a button on a web page, the reaction to which is an XHR request which returns a chunk of JSON.  The reaction to the successfully returned chunk of JSON is a transformation of that data to an HTML list, which is then inserted into the webpage’s DOM and shown to the user.  You can see patterns like this in the examples provided by a few of the popular JavaScript functional reactive programming libraries: Bacon.js, RxJS, flyd.

Interestingly, the functional reactive pattern also fits very well in the data processing pipeline use case.  For each step, not only do I want to define a data transformation by chaining functions together (the functional part), but I also want to react to data as it comes in from Kafka (the reactive part).  I don’t have the luxury of a fixed length numList.  The code is operating on an unbounded stream of values arriving at seemingly random times.  A value might arrive every ten seconds, every second, or every millisecond.  Because of this I need to implement each data processing step without any assumptions about the rate at which messages will arrive or the number of messages that will arrive.

I decided to use the lodash utility library and Bacon.js FRP library to help with this.  Bacon.js describes itself as, “a small functional reactive programming lib for JavaScript. Turns your event spaghetti into clean and declarative feng shui bacon, by switching from imperative to functional...Stop working on individual events and work with event-streams instead [emphasis added]” [bac].

Kafka as the Stream Transport

The use of event streams makes Kafka an excellent fit here.  Kafka’s append-only, immutable log store serves nicely as the unifying element that connects the data processing steps.  It not only supports modeling the data as event streams but also has some very useful properties for managing those event streams.

  • Buffer: Kafka acts as a buffer, allowing each data processing step to consume messages from a topic at its own pace, decoupled from the rate at which messages are produced into the topic.
  • Message resilience: Kafka provides tools to allow a crashed or restarted client to pick up where it left off.  Moreover, Kafka handles the failure of one of its servers in a cluster without losing messages.
  • Message order: Within a Kafka partition, message order is guaranteed. So, for example, if a producer puts three different messages into a partition, a consumer later reading from that partition can assume that it will receive those three messages in the same order.
  • Message immutability: Kafka messages are immutable.  This encourages a cleaner architecture and makes reasoning about the overall system easier.  The developer doesn’t have to be concerned (or tempted!) with managing message state.
  • Multiple Node.js client libraries: I chose to use the no-kafka client library because, at the time, this was the only library I found that supported TLS (authenticated and encrypted) Kafka connections by specifying brokers rather than a ZooKeeper server.  However, keep an eye on all the other Kafka client libraries out there: node-kafka, kafka-node, and the beautifully named Kafkaesque.  With the increasing popularity of Kafka, there is sure to be much progress in JavaScript Kafka client libraries in the near future.

Putting it All Together: Functional (and Reactive) Programming + Node.js + Kafka

This is the final architecture that I implemented (you can have a look at more details of the architecture and code here).

Twitter Data Processing Pipeline Architecture

Data flows from left to right.  The hexagons each represent a Heroku app.  Each app produces messages into Kafka, consumes messages out of Kafka, or both.  The white rectangles are Kafka topics.

Starting from the left, the first app ingests data as efficiently as possible.  I perform as few operations on the data as possible here so that getting data into Kafka does not become a bottleneck in the overall pipeline.  The next app fans the tweets out to keyword- or term-specific topics.  In the example shown in the diagram above, there are three terms.  

The next two apps perform aggregate metric calculation and related term counting.  Here’s an example of the functional style code used to count the frequency of related terms.  This is called on every tweet.

function wordFreq(accumulator, string) {
  return _.replace(string, /[\.!\?"'#,\(\):;-]/g, '') //remove special characters
    .split(/\s/)
    .map(word => word.toLowerCase())
    .filter(word => ( !_.includes(stopWords, word) )) //dump words in stop list
    .filter(word => ( word.match(/.{2,}/) )) //dump single char words
    .filter(word => ( !word.match(/\d+/) )) //dump all numeric words
    .filter(word => ( !word.match(/http/) )) //dump words containing http
    .filter(word => ( !word.match(/@/) )) //dump words containing @
    .reduce((map, word) =>
      Object.assign(map, {
        [word]: (map[word]) ? map[word] + 1 : 1,
      }), accumulator
    )
}

A lot happens here, but it’s relatively easy to scan and understand.  Implementing this in an imperative style would require many more lines of code and be much harder to understand (i.e. maintain).  In fact, this is the most complicated data processing step.  The functional implementations for each of the other data processing apps are even shorter.

Finally, a web application serves the data to web browsers with some beautiful visualizations.

Kafka Twitter Dashboard

Summary

Hopefully, this provided you with not only some tools but also the basic understanding of how to implement a data processing pipeline with Node.js and Kafka.  We at Heroku are really excited about providing the tools to make evented architectures easier to build out, easier to manage, and more stable.

If you are interested in deploying production Apache Kafka services and apps at your company, check out our Apache Kafka on Heroku Dev Center Article to get started.

Browse the blog archives, subscribe to the full-text feed, or visit the engineering blog.