Data streams are all around us, and companies need to process that data instantly, or as close to real time as possible. That’s where stream processing technologies can help. Stream processing lets you sense, query, analyze, and process data as it flows. But how do you go about it?
For one of our projects, we needed a stream processing framework. In addition to basic streaming queries, we also needed complex event processing. Let’s go over the options we considered and the choice we made.
Let’s start by summarizing our needs:
An expressive processing framework: the application we wanted to build had to compute windowed aggregates over streaming data, but, in addition to this, it had to react in a special way to some special sequences of events, or to the absence of an event. So we needed a framework that gave us control on how events were partitioned for processing, we needed different types of windowing and, if possible, some tools to facilitate pattern matching.
An exactly once guarantee for message processing. Double events or missed events would not be tolerated, and internal state (counters, state machines…) needed to be consistent with this.
Our system was to be highly available, resilient against software or hardware faults, having redundancy and automatic recovery from crashes or hiccups. It had to be easy to monitor the system, so we could intervene quickly if something unexpected should happen.
Good support for offline unit and integration testing. Stream processing code is hard to test, and often, a lot of infrastructure is needed to set up a reliable test.
Queryable state would allow us to show important data in real-time to the end user (eg the result of aggregations/counts, number of errors…
Our application was Kubernetes native, so we were looking for something that would fit in.
In order to make the right choice, we looked at a couple of possible solutions.
(TLDR: we chose Apache Flink. Read on if you want to know why. There may even be a comparison table at the end.)
We started by looking at Faust. It seemed to be a very good candidate for us: it conveniently uses Python as a language. In addition to this, it is has the same design as the Kafka Streams framework (which we will discuss later). On paper, it looked fantastic:
Faust, being strictly Kafka to Kafka, could give us exactly once semantics in a very painless way (no extra setup needed, all storage is done in Kafka).
The expressivity of Faust seems to be limited only by the expressivity of the Python language. If Faust was missing a certain way of grouping data together, we could always write our own stream processor that could do it.
Deployment is a breeze: just spin up as many workers as you want, no “master” needed. It uses Kafka’s features for coordination between workers. The workers need some persistent storage for efficient operation, but when it is lost, it can be rebuilt automatically.
Queryable state is supported and easy to implement.
Faust has support for creating offline unit tests (so you could test and debug a stream processor without having a Kafka broker at your disposal).
For interaction with anything else (database, rest…): use Kafka connect or roll your own Python code. Low tech, but seemed good enough.
Since Faust looked so promising, we started working on a Faust implementation immediately. Soon enough, a lot of problems arose:
Reliability: very quickly, we ran into situations where Faust crashed and refused to recover, or where Faust didn’t crash but suddenly stopped to process records. When looking for solutions, we stumbled upon existing open bug reports and a fork that fixed some (but not all) issues. When Faust stopped to process records, the REST API still listed “running” as the state for the job, preventing us from implementing correct health checks.
Exactly once semantics was not always working the way we expected it to: we used noack to prevent automatic acknowledgement of messages when processing failed for some reason. But Faust does not always honor this.
It is not clear what the future will bring for Faust: the original version from Robinhood is all but dead, the fork seems to be in a better shape and might become a viable option.
The unit testing was problematic. For instance, in Faust, it is your own responsibility to make sure you partition your streams and your state correctly and consistently. The mock objects in the unit tests treat everything as one big partition, leaving a lot of bugs undetected. In the end, we stopped relying on unit tests, running in-cluster tests instead.
Writing stream processors in an imperative way turned out to be very verbose. We didn’t really expect this upfront, especially with a language like Python that is normally known for concise syntax.
A nice thing is that Kafka Streams is made to coexist with other software: you can basically just use it as a library and combine it with anything you want.
We also found some caveats however:
Queryable state is supported so-so: it works, but you have to do a lot of the implementation yourself (eg creating an API for querying the state).
More verbose than the Python code we had in Faust. We did some tests with the Scala API. The scala wrapper improves things, but is rather thin, so you will end up using the Java API a lot, especially for stateful stream processing.
Crash recovery is done by replaying messages from a “log” Kafka topic. That is nice, since, if you have a Kafka backup, you automatically have a kafka streams backup too. However, it can be a problem when you have to replay a lot of messages to recover. Note that this weakness is shared by Kafka Streams, ksqlDB and Faust (but not by Dlink).
Our advice: use Kafka Streams (and ksql, see below) if you are already invested in the Confluent ecosystem. Kafka streams blends in nicely, and for anything non-Kafka, you will probably use Kafka Connect anyway. However:
Also consider ksqlDB as an alternative. If you can describe your problem in SQL, you will save a lot of time.
Make use of Kafka’s log compaction to avoid long downtimes when recovering from a crash.
Keep your Kafka Streams application side-effect free (= purely functional) and use Kafka Connect to talk with the outside world.
ksqlDB is built on top of Kafka Streams, offering a SQL interface for defining stream processors instead of coding in Java/Scala. It supports two modes:
An interactive mode where you have a SQL prompt and you can just start processors by typing SQL.
A “fixed” mode where ksqlDB runs a fixed number of queries.
Interestingly enough, the queryable state that ksqlDB supports is not available when you run in fixed mode, and we think the interactive mode is rather for exploration than for production use.
Also interestingly, ksqlDB doesn’t inherit the composability from Kafka Streams: you cannot mix SQL for simpler cases with Scala/Java stream processors for more complex cases (actually you can, but you need to run both as a separate app).
What we found very nice was that ksqlDB seemed to be very reliable: it does not only store the state of the queries in Kafka, but also the running queries themselves. So you can fire up a stream processor, and then completely kill ksqlDB and let it restart, and the query will just resume where it stopped.
Our advice: see our advice for Kafka Streams. If your workload is simple enough to be expressed fully as SQL queries, then just ksqlDB + Kafka (Connect) makes for a simple and reliable architecture.
Apache Flink is a big name in the streaming world. Let’s find out why.
Exactly once semantics are well supported, however, Apache Flink does it in a different way than the options above. It doesn’t rely on strict Kafka-to-Kafka processing for doing it exactly once. This has some substantial advantages: you can create a Kafka-to-JDBC pipeline and still have the same guarantees. It uses 2 phase commits and checkpoints to do exactly once for every source/sink that supports it. One caveat is that the exactly-once mode introduces some latency (depending on your checkpoint interval).
Monitoring support is top notch: you do not only get a REST API but also a web interface that gives you a lot of details. We cannot stress enough how nice and useful the web interface is when debugging, from having a nice overview of everything that went wrong, over a real-time view on the flow of messages, to getting a stack trace of all threads, and even flamegraphs!
Unit testing is very complete (it spins up a mini-cluster to run tests). However, we had some trouble to get going: how to debug a modern (Scala) Flink application seems to be documented in a mix of (often outdated) docs and Stack Overflow posts. In several cases, we had to look to Flink’s own tests to see how to properly test things. However, now that everything is set up, it works very nicely.
Flink is the most full-featured option we reviewed and you are programming on a high level: the SQL used by Flink is much more powerful than the SQL used by ksqlDB, and as a Java/Scala/Python programmer, you get a lot of tools to work with. In addition to this, there is no limit on mixing the different tools in a single application: you can combine SQL, Java/Scala code and CEP patterns in the same stream processor.
Like with every option, we also found some caveats:
Queryable state is supported. Well, kind of. It is in beta since Flink 1.2 and it looks like it will be deprecated soon. In addition to this, there is only a Java client for the queryable state. In the end, we decided to persist all relevant state in a classic RDBMS instead of relying on this (seemingly experimental) feature.
Flink (we use Spotify’s Flink operator) is quite a beast and starting up a distributed Flink cluster with an application in it takes a while. All alternatives start up much faster than flink. This can be quite annoying during development, and, in production, there is no support for rolling upgrades. This means that every time you upgrade or relaunch your job, you should expect a downtime of at least a couple of minutes. In addition to this, finding the optimal settings to run Flink on Kubernetes took us quite a while.
Over the years, flink deprecated old API’s and introduced new ones. The documentation for the old API’s seems to stick around and when looking for documentation. So be careful or you will be working with deprecated or even removed code.
Flink supports Java, Scala and Python. We opted for Scala, and we are not yet 100% certain that this was the right choice. For some parts (eg CEP), we had to resort to the Java API because the Scala API didn’t cover this.
Making a job that is upgradeable without losing state is not trivial. Make sure you test the upgradability of your code in addition to the correctness of your code.
Our advice: if your project is big enough to make up for the upfront costs of getting to know Flink (count a couple of weeks at least), go for Flink.
Everybody likes comparison tables, so let’s try to make one. But looking at a comparison table only will be misleading. That’s why we didn’t stop there
Exactly once guarantee
– (see text)
– (see text)
Testability (unit/ integration testing)
– (see text)
Ease of use/development
There are many ways to integrate flink with Kubernetes, all of them are non-trivial. Hopefully an effort to make an official one will come to fruition soon.
Flink can run in HA mode, using ZooKeeper or Kubernetes. For ksqlDB and Kafka Streams, just spin up multiple pods and let Kafka consumer groups do the work for you.
Flink code is not less readable than Kafka Streams code, however, setting up a Flink cluster took longer than setting up a Kafka Streams cluster.