After three decades of building scalable data systems, I’ve identified the essential fields that every data record should include. I firmly believe in self-contained data—a dataset that doesn’t rely on file format, transport technology, or the environment to guarantee its quality and manageability. To put it simply: if all my files are merged into one, or pushed into a stream like Kafka, will I lose operational context because I depend on file names, creation dates, or file existence for processing?
To address this, I’ve pinpointed a few key fields that every record should have, making it easier to manage the operational aspects of data processing.
1. Version ID
Every record should include a version ID defined by the data producer. This version remains the same for all records generated in the same run, while different runs will have different version IDs. Even in distributed processing, all output records should carry the same version ID.
In streaming mode, the version ID may stay the same for all records generated by a single instance of the producer, until it restarts or based on other operational logic.
Typically, a monotonically increasing version ID—such as a Unix timestamp with milliseconds—works well.
2. Run ID
The run ID can be the same as the version ID, but it becomes especially useful when supporting recovery from failures or handling incremental runs. In these cases, the dataset may have the same version ID, but the run ID will differ to track reprocessing efforts. Ideally, the run ID should also increase monotonically.
Both the version ID and run ID should be logged in your centralized logging service, allowing you to trace the data back to the exact process that generated it.
3. Event Timestamp (Event_ts)
This timestamp reflects when the event occurred, whether triggered by a user, system entity, or API. For example, it might represent when a user clicked a button, a transaction was made, or an app emitted an event. This is often the most relevant timestamp for data analysis as it captures business interactions. The timestamp should be in UTC with timezone format.
4. Observer Timestamp (Observer_ts)
The observer timestamp captures when the service or producer processed (observed) the event / record. If the data is processed in batches, you can apply the same timestamp to all records in a batch. Like the event timestamp, should be in UTC with timezone format.
Why These Fields Matter
If you encounter two almost identical records with differences only in these four fields, you can quickly diagnose what happened and take appropriate action:
- Different version IDs: Choose the record with the higher version ID, as this indicates a rerun or recovery from failure.
- Same version ID but different run IDs: The higher run ID suggests a problem with the earlier run, and this record should be selected. If you notice excessive duplication, there may be issues with workers emitting the same records or problems with checkpointing during recovery.
By reviewing your centralized logging service, you can trace back these version and run IDs to pinpoint any failed jobs.
As I mentioned in my previous post about atomicity, consumers should avoid accessing incomplete or “dirty” data produced by erroneous runs. Using approved version IDs to determine consumable data reinforces this principle.
The relationship between Observer_ts and Event_ts provides valuable insights. For instance, data often arrives out of order, and your producer might encounter records with an Event_ts that occurred days ago. By comparing Observer_ts and Event_ts, you can identify trends and optimize your processing and storage resources over time.Additionally, if you notice a large number of old Event_ts paired with new Observer_ts, this indicates reprocessing of old data—yet another valuable operational insight.
If you want more content like this, press ‘Like’!
