On-Prem Lakehouse with Apache Iceberg

All blogs

On-Prem Lakehouse with Apache Iceberg

Oct 9, 2024

Project Type: Enterprise Data Engineering / On-Prem Lakehouse
Project Year: May 2024 – Feb 2025
Client Region: EMEA
Industry: HealthCare

Context & Objectives

A large retail enterprise operates multiple stores across the country, alongside an e-commerce platform and warehouse inventory systems. Their data ecosystem was fragmented: sales data lived in CSV exports from POS systems, clickstream data streamed continuously as JSON logs from their website, and inventory data resided in relational databases.

This fragmentation caused serious challenges for analytics and reporting teams:

Reconciling data across stores and systems was error-prone due to inconsistent naming conventions and duplicate entries.
Schema evolution issues meant that adding new columns often broke downstream pipelines.
Combining batch and streaming data for analytics was difficult.

Project Goal:

To design and implement a fully on-prem Lakehouse leveraging Apache Iceberg, providing a single source of truth across batch and streaming data, with:

ACID guarantees for reliable analytics
Support for schema evolution and time travel
Real-time and batch ingestion pipelines
Governed, auditable, and Queryable data

Challenges

The client faced a combination of technical and operational hurdles:

Inconsistent & Fragmented Data:
Data across POS systems, e-commerce clickstreams, and warehouse databases used different formats and naming conventions, causing duplicates and errors when reconciling data.

Rigid Pipelines & Schema Limitations:
Adding new columns or evolving schemas often broke existing ETL pipelines, making it hard to adapt to changing business needs.
Difficulty Integrating Batch and Streaming Data:
Historical sales data (batch) and real-time website events (streaming) could not be easily combined, resulting in slow access to clean, queryable data for analytics.

Constraints & Non-Functional Requirements

The solution had to satisfy enterprise-grade requirements:

1. Data Reliability

The platform needed to ensure ACID guarantees—Atomicity, Consistency, Isolation, and Durability for all transactions. This prevents duplicates, ensures consistent analytics, and allows analysts to trust the data.

2. Performance & Scalability

The Lakehouse had to support fast queries across billions of records while scaling with the business. As the number of stores, users, and clickstream events grows, the system must maintain low-latency access for analysts and dashboards.

3. Security & Governance

The solution needed strong access controls, metadata versioning, and audit trails. Row-level and table-level permissions ensure that only authorized users can view sensitive data. Metadata and versioning allow tracking of schema changes and historical snapshots, while audit trails provide compliance and traceability.

Solution & Implementation

To address the client’s challenges, we implemented a multi-layered On-Prem Lakehouse with Apache Iceberg, organized using a Medallion Architecture (Bronze → Silver → Gold).

Phase 1: Data Ingestion

Data Sources are from POS systems(CSV), clickstream(JSON) for real time user interactions and inventory databases(RDBMS) for warehouse stock levels from MySQL/Postgres.

Ingestion Pipelines:

Batch Ingestion: Python scripts or Apache NiFi pipelines collect daily CSV files from stores and load them into the Bronze Layer on HDFS. This ensures all historical sales data is captured reliably.
Streaming Ingestion: Apache Kafka captures website events in real-time, streaming each click, search, or transaction directly into the Lakehouse. Millisecond-level latency allows near-instant availability of fresh data.
Change Data Capture (CDC): Tools like Apache Sqoop monitor inventory databases, detecting inserts, updates, or deletes, and push these changes into Iceberg tables in real-time.

Outcome:
All raw data is ingested into the Bronze Layer on HDFS, forming a centralized, reliable foundation for downstream processing and transformation.

Phase 2: Lakehouse Storage Layer

Storage Components:

HDFS (Hadoop Distributed File System): Provides distributed, fault-tolerant storage for all raw data files, ensuring high availability and durability across the on-premise cluster.

Parquet Format: Converts raw CSV/JSON files into a columnar storage format, optimizing query performance by reading only relevant columns instead of entire datasets.

Apache Iceberg Tables: Adds a structured table layer with rich metadata on top of Parquet files, enabling advanced features:

ACID Transactions: Ensures consistent and reliable data, preventing duplicates and maintaining integrity across batch and streaming updates.
Schema Evolution: Allows adding, renaming, or removing columns without rewriting historical data, supporting changing business requirements.
Time Travel: Enables queries on historical snapshots, providing insights into past states of data for auditing or analysis.

Phase 3: Data Processing & Modeling

Once raw data from POS systems, clickstream logs, and inventory databases lands in the Bronze Layer, it is still scattered, duplicated, and inconsistent. To transform this data into analytics-ready tables, we leveraged Apache Iceberg as the foundation for all ETL/ELT processes, providing ACID guarantees, schema evolution, and time-travel capabilities.

Batch Processing with Apache Spark + Iceberg Connector:
Apache Spark reads raw Iceberg tables in the Bronze Layer and performs large-scale batch transformations. Key operations include:

Data Cleaning: Filling missing values and standardizing formats across diverse sources.
Deduplication: Removing repeated sales transactions to ensure accurate analytics.
Joins and Enrichment: Merging POS, inventory, and clickstream datasets to create a comprehensive view of the business.

Transformed data is written back to Silver tables for cleaned and enriched data and further aggregated into Gold tables that are business-ready. Using Iceberg ensures transactional consistency and preserves historical schema versions, so downstream analytics are reliable even as data evolves.

Streaming Processing with Apache Flink + Iceberg Sink:
Real-time clickstream events are ingested via Flink and written directly into Iceberg tables. Streaming transformations include:

Filtering invalid events
Aggregating click events
Incremental updates to Gold tables for near-real-time dashboards

Iceberg’s table format ensures that streaming updates integrate seamlessly with batch data, maintaining ACID compliance and efficient metadata handling. This allows historical and live data to coexist in a single, Queryable Lakehouse.

Outcome:
By combining batch and streaming ETL on Iceberg tables, the Lakehouse delivers a unified, high-quality dataset. Historical and real-time data are available simultaneously in Silver and Gold tables, providing analysts and business teams with trusted, consistent, and near-real-time insights for dashboards, reporting, and operational decision-making.

Phase 4: Serving & Analytics

Once the transformed and enriched data resided in Silver and Gold Iceberg tables, we enabled the client to consume it through interactive queries and dashboards. Using Apache Iceberg as the backbone, we ensured that all tables were structured, queryable, and consistent, supporting both batch and streaming data for reliable analytics.

Query Engines:
We configured distributed SQL engines like Trino and Presto to connect directly to Iceberg tables, allowing the client’s analysts to perform ad-hoc and complex queries efficiently. Iceberg’s metadata and snapshot management ensured accurate, performant queries, even as real-time clickstream events updated the underlying tables.

BI & Dashboards:
We connected BI tools such as Superset, Power BI, and Tableau to Trino/Presto and delivered dashboards tailored to the client’s operational needs:

Daily Sales Dashboards: Visualized sales per store, region, and product category.
Product Performance Rankings: Identified top-selling and underperforming items across stores and online channels.
Real-Time Website Metrics: Monitored user behavior, session activity, and clickstream patterns for near-real-time insights.

Outcome:
With our Lakehouse solution, the client could run complex queries without worrying about raw file locations or inconsistent formats. We provided a single source of truth, enabling fast, reliable, and interactive analytics that supported both operational decision-making and strategic planning.

Phase 5: Governance & Metadata Management

To ensure that the client’s Lakehouse was secure, auditable, and enterprise-ready, we implemented a robust governance and metadata management layer leveraging Apache Iceberg’s capabilities alongside industry-standard tools.

Catalog & Versioning:
We used Apache Hive Metastore and Apache Nessie to create a centralized catalog for all Iceberg tables. This enabled the client to track table schemas, partitions, and historical snapshots, supporting schema evolution and time-travel queries. As a result, the client could reliably access both current and historical data without inconsistencies.

Security & Lineage:
Using Apache Ranger and Apache Atlas, we enforced row-level and table-level access controls, ensuring that sensitive data was only accessible to authorized users. Atlas provided complete data lineage, allowing the client to trace any anomaly or error from the raw source all the way to Gold tables and dashboards

Implementation Highlights

ACID Compliance: We ensured that all Iceberg tables maintained deduplication and consistent transactional state, providing the client with reliable and accurate analytics across batch and streaming data.

Schema Evolution & Time Travel: Our solution allowed the client to add new columns or modify schemas without breaking existing pipelines, while time-travel queries enabled them to analyze historical snapshots of data whenever needed.
Hybrid Batch & Streaming Support: By combining Spark for batch processing and Flink for real-time streams, we unified historical and live data, giving analysts a single, consistent view of the business.
Performance & Cost Optimization: Using Parquet columnar format, partitioning by store and date, and selective caching, we optimized both query performance and resource usage on the client’s on-premise clusters.
Security & Governance: The Lakehouse was fully on-prem, with audit logs, row-level access controls (RLS), and a central catalog to ensure data security, compliance, and traceability across all layers.

Deliverables

The project delivered a complete on-prem Lakehouse with Bronze, Silver, and Gold layers built on Apache Iceberg tables, providing structured and reliable data for analytics.
Automated batch and streaming ingestion pipelines were implemented to handle POS, clickstream, and inventory data, while ETL/ELT transformations using Apache Spark and Flink ensured data cleaning, deduplication, and enrichment.
For analytics, distributed query engines such as Trino and Presto were configured to provide interactive and ad-hoc access directly on Iceberg tables.
Business intelligence dashboards in Superset, Power BI, and Tableau gave the client actionable insights, including daily sales, product performance, and real-time website metrics.
A robust governance and catalog setup with Hive/Nessie, Ranger, and Atlas ensured centralized metadata management, access controls, and full data lineage.

Outcomes & Business Impact

The Lakehouse provided the client with a single source of truth, unifying sales, clickstream, and inventory data under a fully governed and reliable platform.
Analysts were able to generate reports in minutes instead of hours or days, significantly accelerating time-to-insight. With ACID guarantees and schema evolution, duplicate and inconsistent data were eliminated, ensuring analytics were accurate and trustworthy.
Real-time clickstream data offered near-instant visibility for marketing and product teams, enabling faster operational decisions and more targeted strategies.

The solution was designed to be scalable and secure, supporting the addition of new stores, users, and streaming data without requiring a platform re-architecture. By reducing manual data reconciliation, the Lakehouse also improved operational efficiency, allowing data teams to focus on deriving actionable insights rather than managing data inconsistencies

Tech Stack

Storage Layer:
HDFS, Parquet, Apache Iceberg ,(minIO)

Batch Processing:
Apache Spark, Python, Iceberg Connector

Streaming Processing:
Apache Flink, Kafka

Ingestion:
Python, Apache NiFi, Sqoop, Debezium, Kafka

Query & Serving:
Trino, Presto, Superset, Power BI, Tableau

Governance & Metadata Management:
Hive Metastore, Apache Nessie, Apache Ranger, Apache Atlas

Security:
Row-Level Security (RLS) / Attribute-Based Access Control (ABAC), internal LDAP/SSO

DevOps & Monitoring:
Git-based code management, automated ETL jobs, monitoring with logs and metrics

Target Architecture

Data Sources:

POS systems providing CSV exports
Clickstream logs from the e-commerce website in JSON format
Inventory databases in relational systems (MySQL/Postgres)

Ingestion & Orchestration:

Real-time streaming via Kafka for clickstream events
Batch ingestion using Apache NiFi, Sqoop, and Python scripts for CSV and database extracts

Lakehouse Storage:

Raw data stored in HDFS
Structured into Apache Iceberg tables following a Bronze → Silver → Gold medallion architecture
Supports ACID transactions, schema evolution, and time-travel queries

Processing Layer:

Apache Spark for batch transformations
Apache Flink for real-time streaming events
Unified dataset combining historical and live data

Serving & Analytics:

Distributed SQL queries using Trino and Presto
Dashboards built in Superset, Power BI, and Tableau

Governance & Security:

Centralized catalog with Hive Metastore and Apache Nessie
Ranger and Atlas for metadata management, access control, and data lineage
RLS/ABAC policies ensure secure, role-based access

Conclusion: Enabling a Reliable On-Prem Data Platform

Implementing an on-prem Lakehouse with Apache Iceberg enabled the client to transform a fragmented and inconsistent data landscape into a structured, Queryable, and auditable platform. Analysts gained fast access to clean and reliable data, while business teams could leverage near-real-time insights for operational and strategic decisions. The solution provides a scalable, governed environment capable of supporting future growth, additional stores, users, and streaming data, all within the client’s on-prem infrastructure, ensuring security, compliance, and full control over their data.