# Real-Time Data Processing Architecture

What Is Real-Time Data Processing and Why Does It Matter?

Real-time data processing is the practice of ingesting, analyzing, and acting on data at the moment it is generated — or within a latency window measured in milliseconds to a few seconds. Unlike traditional batch processing, where data accumulates over a period and is processed periodically, real-time systems operate on continuous, unbounded data streams.

The business case is straightforward: the value of a decision decays with time. Fraud detection that fires after a transaction settles is operationally worthless. A product recommendation served three days after a browsing session is irrelevant. Kleppmann (2017) frames this as a fundamental shift: modern data systems must treat data not as static tables to be queried, but as continuously flowing streams to be processed in motion.

The proliferation of IoT devices, mobile applications, clickstream sources, and financial market feeds has pushed data production rates far beyond what nightly batch jobs can handle practically. According to Marz & Warren (2015), a scalable real-time architecture must satisfy three properties: robustness (fault tolerance), low latency (sub-second responses), and scalability (horizontal scale-out without re-architecture).

Key application domains include financial fraud detection, e-commerce personalization engines, telecommunications network monitoring, predictive maintenance in manufacturing, and patient monitoring in healthcare settings.

What Is Apache Kafka and How Does It Work?

Apache Kafka is a distributed event streaming platform originally built at LinkedIn and open-sourced in 2011. Kreps et al. (2011) describe its defining characteristic: unlike traditional message queues that delete messages upon consumption, Kafka operates as a distributed commit log — messages are retained on disk for a configurable retention period regardless of whether they have been consumed.

The core architectural components are:

Broker: Each server in a Kafka cluster is a broker. Brokers receive messages from producers, store them on disk, and serve them to consumers. A cluster of three or more brokers provides fault tolerance through replication.

Topic: A named, logical channel for messages. Topics are divided into partitions for horizontal scalability. Each partition is an ordered, immutable sequence of records.

Partition: The unit of parallelism and ordering in Kafka. Records within a partition maintain strict ordering; across partitions, no ordering guarantee exists. Partitions are distributed across brokers, enabling both load balancing and fault tolerance via partition leadership and ISR (In-Sync Replicas).

Producer: A client application that publishes records to topics. The partitioner function determines which partition receives a given record — typically a hash of the record key, enabling related events (e.g., all events for a given user ID) to land in the same partition.

Consumer Group: A set of consumers that collectively read a topic. Each partition is assigned to exactly one consumer within the group, providing parallelism while preserving per-partition order. Different consumer groups each maintain independent offsets, allowing multiple downstream systems to read the same topic independently.

KRaft Mode: Kafka 3.x introduced KRaft (Kafka Raft Metadata Mode), eliminating the ZooKeeper dependency and simplifying cluster operations. As of Kafka 4.0, ZooKeeper mode is removed entirely.

Kafka's throughput is achieved through sequential disk I/O (far faster than random access), zero-copy data transfer via the sendfile system call, and batched message compression using lz4, snappy, or zstd.

What Is the Difference Between Stream Processing and Batch Processing?

Batch processing accumulates data over a time window and processes it as a finite dataset. The MapReduce paradigm exemplifies this: data rests in HDFS; a job runs nightly; results appear in morning reports. Batch systems excel at throughput, simplicity of failure recovery (re-run the job), and cost efficiency. Their fundamental limitation is latency — results are hours behind reality.

Stream processing treats data as an unbounded sequence of events and processes each event (or small micro-batch) as it arrives. Latency drops to milliseconds or seconds. Akidau et al. (2015) identify three fundamental challenges in stream processing: handling out-of-order events caused by network delays, reconciling event time (when the event occurred) with processing time (when the system received it), and guaranteeing exactly-once semantics despite failures.

Carbone et al. (2015) present Apache Flink as a unified engine: batch is modeled as a finite stream, so a single API and runtime handle both workloads without code duplication.

Property	Batch	Stream
Latency	Minutes to hours	Milliseconds to seconds
Data model	Finite, bounded	Infinite, unbounded
Failure recovery	Re-run job	Checkpoint + replay
Cost	Lower	Medium to high
Use cases	Reporting, ETL	Alerting, recommendations, monitoring

What Are Lambda and Kappa Architectures?

Lambda Architecture, formalized by Marz & Warren (2015), addresses the tension between batch accuracy and stream latency by running both in parallel:

Batch Layer: Stores all raw data immutably (typically in HDFS or object storage) and recomputes complete batch views periodically. High accuracy, high latency.
Speed Layer: Processes only recent data, producing low-latency incremental views that compensate for the batch layer's lag. Apache Storm or Spark Streaming are typical choices.
Serving Layer: Merges batch and speed views to answer queries. Apache Druid and Cassandra are common here.

The critical weakness of Lambda is code duplication: the same business logic must be implemented twice, in two different systems, with two different APIs. Operational overhead is substantial.

Kappa Architecture, proposed by Jay Kreps in 2014, eliminates the batch layer entirely. The premise: everything is a stream. Batch reprocessing is simply replaying a Kafka topic from the beginning through a stream processing job. A single codebase serves both real-time and historical processing. Kleppmann (2017) notes that Kappa's viability depends heavily on Kafka's retention capacity — storing years of raw events is feasible with tiered storage (Kafka 2.8+) but requires cost planning.

Choosing between the two depends on organizational maturity, existing infrastructure, and latency requirements. Kappa is operationally simpler; Lambda offers an easier migration path for teams with existing batch infrastructure.

How Is Real-Time Processing Applied in Operational Intelligence?

Operational Intelligence (OI) uses real-time analytics to monitor, understand, and improve running business processes. Unlike traditional BI, which produces historical reports, OI enables immediate intervention — detecting and responding to anomalies while they are still occurring.

A production-grade OI architecture includes:

Ingestion Layer: Source systems (ERP, CRM, IoT sensors, web logs) emit events via Kafka Producer APIs. Schema management through Confluent Schema Registry with Apache Avro ensures producer-consumer compatibility and prevents schema drift.

Stream Processing Layer: Flink or Kafka Streams jobs implement business logic: filtering irrelevant events, computing windowed aggregations (tumbling, sliding, session windows), joining streams (e.g., enriching transaction events with customer profiles), and running anomaly detection using statistical thresholds or embedded ML models.

Serving Layer: Processed results are written to low-latency stores — Apache Druid or InfluxDB for time-series queries, Redis or Apache Cassandra for key-value lookups.

Visualization Layer: Real-time dashboards via Grafana or custom UI components use WebSocket or Server-Sent Events (SSE) to push updates without polling.

The watermark mechanism described by Akidau et al. (2015) is essential here: it allows the system to make progress on windowed computations even when some events arrive late, by defining a threshold beyond which late arrivals are either incorporated via allowed lateness or routed to a side output for separate handling.

References

Kreps, J., Narkhede, N., & Rao, J. (2011). *Kafka: a Distributed Messaging System for Log Processing*. NetDB Workshop at VLDB.
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). *Apache Flink: Stream and Batch Processing in a Single Engine*. IEEE Data Engineering Bulletin, 38(4), 28–38.
Marz, N., & Warren, J. (2015). *Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*. Manning Publications.
Kleppmann, M. (2017). *Designing Data-Intensive Applications*. O'Reilly Media.
Akidau, T., et al. (2015). *The Dataflow Model*. Proceedings of the VLDB Endowment, 8(12), 1792–1803.

Frequently Asked Questions

What is the minimum infrastructure for real-time processing? For modest workloads, a single-node Kafka setup with Kafka Streams suffices. For production systems processing over 100,000 messages per second, a minimum three-broker Kafka cluster with separate stream processing nodes is recommended. Managed services like Confluent Cloud or AWS Kinesis Data Streams reduce operational overhead significantly.

When should I choose Kappa over Lambda Architecture? Choose Kappa when your team has no existing batch infrastructure to migrate and you want a single codebase. Choose Lambda when you need to integrate with an existing Hadoop-based data warehouse or when your stream processing framework lacks the maturity to handle full historical reprocessing reliably.

What does exactly-once semantics mean in practice? It guarantees that each event is processed and its effect committed to downstream systems exactly once, even if the processing node fails and recovers. Kafka achieves this through idempotent producers and transactional APIs. Apache Flink implements it through distributed checkpoints with two-phase commit to external sinks.

How do watermarks handle late-arriving events? A watermark is a timestamp assertion: "all events with time T or earlier have arrived." The stream processor uses watermarks to trigger window computations. Events arriving after the watermark can be handled via an allowed lateness window (included in updated results) or a side output (routed for separate processing). The watermark advance rate is tunable — aggressive advancement reduces latency but increases data loss risk for late events.

Back to Blog

Real-Time Data Processing Architecture: Operational Intelligence with Stream Processing