Data Duplication Dilemma: The Hidden Tradeoffs To Manage And Overcome

2–3 minutes

read

Mathias Verraes pinpointed two of the biggest challenges in distributed systems

When a producer sends messages to another service or queue, the arrival order can’t be guaranteed due to race conditions and environmental factors (network issues, load, I/O) on both the producer and consumer (observer, as defined by Leslie Lamport) sides.

Moreover, it’s difficult for producer to guarantee that a message is sent only once. Why? Let’s consider a scenario where Service A and Service B need to coordinate state asynchronously (consensus). Service A sends a “State1” message to Service B. Service B receives the “State1” message, stores it, and sends back a “State1 Committed” message to Service A. Simple, right? But what if the “State1 Committed” message doesn’t arrive at Service A? Should Service A resend the “State1” message? This could result in Service B accepting the same message multiple times (more-than-once delivery). Or should Service A wait indefinitely for a response, risking that Service B might have crashed and never stored the “State1” message (less-than-once delivery)? Variations of this challenge also exist in problems like the Byzantine Generals Problem or N-phase commit protocols.

Guaranteeing exactly-once delivery is challenging and often requires a shared store or ledger that both the producer and consumer can access. Another approach involves using queues with proper partitioning to ensure all messages with the same id directed to the same partition, allowing a designated process to parse messages sequentially and identify which messages have already arrived. But this raises other questions: What’s the time window during which duplicate messages might arrive? Could it be days? And what are the costs and resources involved in managing this solution?

Consider this: there’s a tension between data freshness and deduplication (data quality). It’s much easier to deduplicate data when the dataset is processed less frequently, like once a day. But the higher the requirement for data freshness, the harder it becomes to deduplicate. Ask yourself, do I really need this data to be that fresh? And what is the historical window for performing this operation—will I deduplicate the last few days or weeks?

Keep in mind that some applications may tolerate duplicates more than others. For purposes like analytics, data-driven decisions, or AI, a reasonable rate of duplication might be acceptable. If deduplication is necessary and there isn’t a high requirement for data freshness, consider making it the consumer’s responsibility, not the producer’s. This is why more-than-once delivery is my default policy for data producers.

For services where high accuracy and freshness are required, use a store that logs the received message IDs (choose a reasonable time window). Every new message observed by the consumer should be checked against this store. Ensure that the producer follows a more-than-once delivery policy and that the message ID or key does not change if the producer retries sending a message.

If you want more content like this, press ‘Like’!

Discover more from THE CTO DILEMMA

Subscribe now to keep reading and get access to the full archive.

Continue reading