I’ve observed many discussions about stream processing versus batch processing, and how the data ecosystem has evolved over the years with technologies like Hadoop, Storm, Spark, Kafka, Snowflake, Databricks, and others. However, debating which approach is better without context isn’t very productive. The key question to ask is: What is more important for your needs—data freshness or accuracy?
A fundamental assumption in distributed systems is that the order of events cannot be guaranteed for consumers (observers). Therefore, if consumers prioritize data freshness and make decisions based on individual or small samples of events, accuracy cannot be ensured unless a state is preserved somewhere—a challenge in itself. If this sample of data is representative of the entire dataset, then decision-making can be directionally correct. Freshness is particularly beneficial when actions need to be taken based on single events or for real-time monitoring.
In many cases, data accuracy is more critical than real-time freshness, especially when making data-driven decisions, such as determining the number of unique visitors per day, conversions per hour, or A/B test results. These questions pertain to bounded data, where it is crucial to assume that most or all relevant data has been received (completeness).
From my experience, for most analytical questions and data processing, it’s more important to prioritize data accuracy and completeness over freshness. Batch processing can easily scale with increasing data volumes and improve data freshness to within several minutes without requiring changes to your system architecture. This is one reason why Spark (micro-batch processing) has prevailed over Storm (stream processing) over time; it’s simply easier to design, operate, and troubleshoot solutions when data has clear boundaries.
It’s also worth mentioning that Kafka, as an event streaming platform, has been instrumental in decoupling data generation considerations from data consumption and processing requirements.
If you want more content like this, press ‘Like’!
