Data Lake-First Strategy: Practical Guidelines

2–3 minutes

read

In recent years, I’ve been a strong advocate for a data lake-first approach. For me, this means treating the data lake as a long-lived, organized data platform that enables the storage, processing, and analysis of large volumes of data while managing the lifecycle of information from unstructured to structured forms. This platform comes with robust tooling for cataloging, storing, processing, and retrieving information.

A data lake is essential for organizations as it centralizes diverse data, providing a single source of truth for analytics and operations. It enables easy access to raw and processed data, empowering data-driven decisions, fostering innovation, and supporting scalability.

Given its relatively low cost and rich ecosystem, I encourage my organization to store almost everything there: logs, business events, imagery data, binary data, artifacts, and database dumps. Importantly, I view the data lake not solely as a hub for analytics but also as a critical component for operational needs, including business continuity, observability, and root cause analysis. To achieve these goals, the data must be well-organized into areas such as bronze, silver, and gold — or as I prefer to call them: operative, explorative, and trusted areas. These zones should also support versioning for improved traceability and isolation.

While there is extensive information available on the data lake-first approach, I want to highlight several key principles that guide my team in designing systems aligned with this methodology:

1. Track Everything

Store both application and business events. Use centralized event logging systems, such as Kafka, to funnel these events into the S3-based like data lake. This ensures a comprehensive and consistent view of activity.

2. Database Snapshots and Change Data Capture (CDC)

Applicative databases that serve services or applications should have their data periodically dumped into the data lake via snapshots or streamed continuously using CDC mechanisms. A hybrid of these approaches often provides the most comprehensive coverage.

3. Operative and Materialized Tables

Distinguish between operative datasets that support CRUD operations (with an emphasis on Create, Update, Delete) and materialized datasets optimized for read operations. Operative tables may require ACID compliance and serve live applications, while materialized datasets in the data lake act as read-only representations. A single database can house both types but should maintain clear separation and purpose.

4. Ephemeral Databases

Databases and tables outside of operative use cases should be ephemeral, primarily serving performance-related needs. This minimizes the overhead of long-term storage and focuses resources on critical data.

5. Publish-Subscribe Interface

The data lake should adhere to a publish-subscribe model. This requires datasets to be registered in a catalog with clearly defined schemas. Metadata should include details such as processing timestamps, high-water marks for data freshness, and the versions available for consumption.

6. Atomic and Reproducible Writes

Data writes into the lake must be atomic and reproducible. This ensures data integrity and consistency, enabling seamless recovery and downstream consumption.

Discover more from THE CTO DILEMMA

Subscribe now to keep reading and get access to the full archive.

Continue reading