Stream Processing With Apache Flink
Chris Shumaker's notes on the book by Fabian Hueske & Vasiliki Kalavri
Updated October 14th, 2024
Book Notes By Chris
The Apache Flink project is well known for its low-latency stream processing capabilities. But what do we know beyond the documentation? There is much documentation dedicated to deploying and operating clusters and building applications. But today there are services like AWS Kinesis Data Analytics that hide significant complexity allowing you to quickly reach the limits of the framework. I am reading "Stream Processing with Apache Flink" to give myself a better idea of what it means to really own a Flink application in production.
Introduction
The opening chapters provide great context for why one might use Flink. While they do not explicitly enumerate and compare alternate streaming frameworks, the authors do a great job highlighting some key concerns of stream processing, namely dealing with state, scale, and performance. To the uninitiated, these might seem unnecessary but for anyone who's build a data pipeline with just AWS Lambda and owned it for a few months or years, you know that these concerns should not be afterthoughts.
Use Cases
The authors identify three main use cases: event-driven applications, data pipelines, and streaming analytics. Regarding event-driven applications, the authors compare using Flink to an alternate architecture using microservices. In a microservices architecture, you may have several representations of state which can become inconsistent or multiple methods of inter-process communication (messaging, API's, gRPC vs. REST, etc.) that can get in the way of correctness and expressiveness. There are also enormous performance gains to having a framework that safely manages and protects your application state while keeping massive amounts of event data in-memory rather than transferring over multiple web API's and database connections on every record.
Alternatives
I want to talk about alternatives more explicitly. The book calls out a specific Lambda architecture, batch processing, and prior generations of stream processing as alternatives to using Flink. Documentation on Apache Flink has similar callouts. However, I think many data engineers today will find that these comparisons are not very helpful.
Personally, I have had great success pairing AWS Kinesis with AWS Lambda functions for near-realtime event processing and there are equivalents in lots of places. So let me identify a few major differences between that architecture and stream processing with Flink that will hopefully go beyond the obvious.
What Flink Provides Compared to AWS Kinesis + Lambda
Windowing
Using multiple streams (which necessitates windowing)
SQL queries on stream tables
Different scaling mechanisms
Automatic Optimization - Flink can dedicate all available compute to working operations while I/O-bound tasks wait. Lambda will be DIY and most people will just let them block or introduce multiple lambdas with additional streams, costs, and complexity.
Packaging and deployment
Open source flexibility
Options for message ordering guarantees (Lambda is DIY)
Choice of event-time versus processing-time semantics (Lambda is just processing time)
Knowledge is centered around stream processing while Kinesis/Lambda community, practices, books, and resources are tend to center around each service separately
What about checkpointing?
Actually, as a colleague of mine pointed out, Kinesis manages read positions for its consumers. This actually solves use cases around failures or updates within the Lambda consumers.
Fundamentals
Chapter 2 describes "dataflow programming" and how to represent an application as a directed acyclic graph. A dataflow DAG can be logical or physical. Examples of these graphes are available in the Flink documentation.
Parallelism
An interesting distinction is made between task-parallelism and data-parallelism. Flink allows resources to be distributed across partitions of the stream (data) and across various tasks (task). To achieve task parallelism with Kinesis and Lambda, one needs to invoke one Lambda from another or introduce an intermediate stream. This can fragment the logic of your application, make refactoring harder, and introduces more overhead to manage additional infrastructure. A Lambda/Kinesis architecture can, however, achieve some data parallelism if the stream is sharded or by configuring the Parallelization Factor introduced to Lambda in 2019. This is still limited to 2,000 concurrent lambda invocations (200 shard limit * 10 parallel invocation limit). The limit for Flink is much higher and does not require explicit configuration.
Data Exchange
It is important to understand how data can propagate through a distributed system. This doesn't just apply to stream processing, it's true in batch systems like Spark too. Hueske & Kalavri identify forward, broadcast, key-based, and random strategies. We often don't have to think about these, as they are managed by our engine. However, sometimes we can improve performance by controlling them manually. For example, very small tables can be "broadcast" in Spark to all executors so that no shuffles are required. Saving a large data exchange across nodes by pushing redundant data first is a worthwhile trade.