Keep It Atomic: The Secret To Robust Data Platform

2–3 minutes

read

A core principle when building a robust data platform is atomicity, or atomic actions. This principle requires that a data producer generates datasets in an atomic, isolated manner, ensuring consistency from the consumer’s perspective.

What this means in practice:

  • If a producer completes its task successfully, the dataset is complete and safe to be consumed.
  • The data remains isolated until the work is fully completed, preventing consumers from being exposed to partial or bad data.
  • If a producer terminates unexpectedly, consumers are not exposed to any of its data.
  • A new instance of the producer can retry the operation without interfering with any partial data left by the previous run.

There are various ways to achieve atomicity. Traditional databases leverage ACID principles (Atomicity, Consistency, Isolation, Durability) using locking, transactions, and ledgers. However, as we move towards distributed computing with various data stores like data lakes, document stores, KV stores, and metrics stores, achieving atomicity requires different methods.

Additionally, a single producer might create multiple datasets across different technologies, requiring all of them to maintain consistency. You can use locking or coordination mechanisms like Zookeeper, Curator, Eureka, etcd, or Consul for this purpose. However, challenges often arise related to lock management or handling multiple runs. Other coordination tools, such as Airflow, can’t fully ensure consistency across data stores or guarantee that consumers won’t be exposed to partial or dirty data.

I believe each dataset should be self-contained, regardless of its format, how many files it spans, or the tools and environments used to process it. The solution must be robust.

A different approach that has worked well for me combines versioning with data discovery services:

  • Each record contains a version assigned by the producer instance. All records produced by this instance will share the same version. Different instances of producers will have distinct versions, potentially resulting in different files, tables, etc.
  • These records and their versions can be stored across various locations and databases.
  • Consumers can’t access this data until the producer registers the version in a data discovery service. All consumers must rely on this service to locate data, ensuring they only access approved dataset versions.

Done! with this approach, you’ve achieved the necessary isolation and consistency, creating a data protocol adaptable to any data processing framework and data store you may use in the future.

If you want more content like this, press ‘Like’!

Discover more from THE CTO DILEMMA

Subscribe now to keep reading and get access to the full archive.

Continue reading