On-Prem Lakehouse with Apache Iceberg & Spark (Batch Processing Focus)

All blogs

On-Prem Lakehouse with Apache Iceberg & Spark (Batch Processing Focus)

Oct 15, 2024

Project Type: Enterprise Data Engineering (Batch-Only Implementation)
Project Duration: May 2024 – Oct 2024

Context and objectives

The client is a mid-sized healthcare provider network operating multiple hospitals, clinics, and diagnostic centers across several regions. Patient data is captured through Electronic Health Records (EHR) systems, lab test results are stored in specialized laboratory databases, and pharmacy and supplier information comes from third-party vendors. Over time, these systems developed in isolation, with each department maintaining its own scripts, spreadsheets, and reporting workflows.

This fragmented data environment created several challenges: patient records and clinical metrics were inconsistent across systems, leading to duplicate entries and manual reconciliation; daily operational and clinical reporting took 1–2 days to refresh; key performance indicators such as patient wait times, treatment outcomes, and inventory levels were calculated differently across departments, reducing trust in reports; rapidly increasing patient visit and lab transaction data slowed queries and caused system timeouts; ETL pipelines were fragile and often broke when underlying schemas changed; and cloud adoption was restricted due to strict compliance, privacy, and budget constraints.

Overall, the data ecosystem was slow, inconsistent, and difficult to manage, making it challenging to deliver timely, reliable insights for clinical decision-making, operational efficiency, and regulatory reporting.

Project goals:

The healthcare provider aimed to implement an on-premises Lakehouse platform, a unified system that combines the scalability and flexibility of a data lake with the reliability and performance of a data warehouse. By leveraging open-source technologies such as Apache Spark for large-scale data processing, Apache Iceberg for table management and versioning, Airflow for workflow orchestration, PostgreSQL for metadata and reference data, and MinIO for secure object storage, the project set out to achieve the following goals:

Centralize and unify fragmented healthcare data: Consolidate patient records, lab results, pharmacy transactions, and supplier data from multiple disparate systems into a single, queryable Lakehouse.

Automate ETL and ELT pipelines: Minimize manual intervention, ensure data consistency, and provide timely updates for operational, clinical, and regulatory reporting.

Maintain a secure, on-premises infrastructure: Ensure compliance with healthcare regulations (HIPAA, GDPR), while allowing the platform to scale as patient and transaction volumes grow.

Enable analytics-ready, high-performance data access: Provide clinicians, analysts, and administrators with fast, reliable insights for clinical decision-making, operational efficiency, inventory management, and strategic planning.

By achieving these goals, the project aimed to transform a fragmented, slow, and error-prone data ecosystem into a robust, unified, and governed Lakehouse, empowering the healthcare organization to make faster, accurate, and actionable decisions.

Solution overview

We implemented a batch-oriented on-premises Lakehouse to unify and streamline the healthcare provider’s fragmented data systems. The Lakehouse combines the scalability and flexibility of a data lake with the structure and reliability of a data warehouse, optimized for batch processing.

Data Processing:
We leveraged Apache Spark to execute ETL pipelines that ingest raw data from EHR systems, laboratory databases, pharmacy records, and supplier files. The pipelines perform data cleaning, deduplication, validation, and aggregation, then write the results into Apache Iceberg tables. Iceberg ensures ACID compliance, supports schema evolution, and enables time-travel queries, allowing analysts to track historical patient and operational data reliably.

Storage & Orchestration:
MinIO serves as the scalable, fault-tolerant on-premises object storage layer for raw and processed data. Apache Airflow orchestrates all ETL jobs, managing dependencies across the Bronze → Silver → Gold medallion architecture:

Metadata & Analytics:
Table metadata and schema details for Iceberg are tracked via PostgreSQL, ensuring governance and consistency. Analysts and clinical teams access the Lakehouse through Trino, enabling high-performance ad-hoc queries and powering dashboards in Power BI for operational monitoring, clinical insights, and regulatory reporting.

Tech stack

Data Sources:

Electronic Health Record (EHR) systems (CSV/JSON exports), Laboratory Information Systems (LIS) databases.

Ingestion & Orchestration:

Apache Spark batch ETL jobs for data ingestion and transformations
scheduled and managed via Apache Airflow.

Storage / Lakehouse:

MinIO for scalable, fault-tolerant on-premises object storage.
Apache Iceberg tables for structured, versioned storage with ACID compliance, schema evolution, and time-travel capabilities.
PostgreSQL used as the Iceberg catalog to manage table metadata, schema versions, and access controls.

Transformation & Modeling:

ETL transformations built using Python + PySpark, following the Bronze → Silver → Gold medallion architecture.

Query & Serving:

Trino for distributed SQL queries across Iceberg tables.
Dashboards and ad-hoc analytics delivered via Power BI for clinical, operational, and administrative reporting.

Governance & Security:

Role-based access controls enforced at the Iceberg table level.
Catalog-level permissions managed via PostgreSQL.

Observability & Quality:

Airflow provides detailed logs, retries, and alerting.
Reconciliation tables and data validation checks implemented at each medallion layer to ensure data integrity and consistency.

DevOps & CI/CD:

ETL scripts version-controlled in Git.
Automated Airflow DAG deployments for reproducible, maintainable PySpark jobs.

Cost & Performance Optimization:

Columnar Parquet storage for efficient reads and writes.
Partition pruning and optimized batch processing for large healthcare datasets.

Implementation details

Data Ingestion

The first step in building the Lakehouse was ingesting data from all the disparate healthcare systems into a central location in its raw form. For this, we used Apache NiFi, a powerful data ingestion and workflow automation tool, along with MinIO for on-premises object storage. NiFi pipelines were designed to extract data from multiple sources and ensure it landed in a structured folder format on MinIO for easy processing downstream.

The ingestion process handled different types of healthcare data:

EHR and patient visit data: Daily exports from hospital EHR systems, typically in CSV or JSON format, were pulled automatically from secure FTP servers.
Laboratory data: Lab results and diagnostic test information were fetched from the Laboratory Information Systems (LIS) databases using MySQL queries.
Pharmacy and supplier master data: Supplier and medication reference files, often provided in Excel or CSV format, were uploaded manually into the system.

Once extracted, NiFi stored the raw files in MinIO, maintaining a clear, time-based folder structure to support organization and easy access for downstream processing.

Spark Configuration for Iceberg

Once the raw data landed in MinIO, the next step was to set up Apache Spark to read, process, and write data into Iceberg tables. Iceberg provides a modern table format that supports ACID transactions, schema evolution, and time-travel queries, which is crucial for maintaining reliable healthcare data.

We configured Spark to interact with Iceberg tables stored on MinIO and tracked metadata through PostgreSQL, which acts as the Iceberg catalog. This setup allowed Spark to create, query, and update tables directly in the Lakehouse while maintaining full transactional integrity.

Key aspects of the configuration:

Spark Session Setup: Spark was initialized with custom configurations to connect to the Iceberg catalog in PostgreSQL, and to read/write data from MinIO.
Catalog Definition: The Iceberg catalog was defined as a JDBC-backed catalog, storing table metadata in PostgreSQL. This ensures centralized schema management, allowing tables to evolve without breaking downstream pipelines.
Warehouse Location: A designated MinIO bucket was set as the data warehouse location, where all Iceberg tables reside.
MinIO Connection: Spark was configured with MinIO endpoint, access keys, and path-style access, enabling seamless interaction with on-prem object storage.

Bronze → Silver Transformation

After configuring Spark to interact seamlessly with Iceberg tables on MinIO, the next critical step was to transform the raw, fragmented healthcare data from the Bronze layer into a clean, standardized Silver layer. This transformation ensures that patient records, lab results, pharmacy transactions, and supplier information are consistent, validated, and structured, making them ready for reliable analytics and downstream reporting.

The process involves several key operations:

Standardization: Column names and data types are normalized across different sources.
Data Cleaning: Duplicate records are removed (e.g., repeated patient visits or lab tests), missing or invalid values are handled, and data anomalies are flagged for review.
Derived Metrics: New calculated columns are created where necessary.

This step ensures all healthcare data conforms to a consistent schema, removes duplicates and inconsistencies, and establishes a reliable foundation for Gold layer metrics such as daily patient admissions, lab turnaround times, pharmacy inventory usage, and supplier performance. It also enables analysts and clinical teams to trust the data, facilitating accurate reporting, compliance, and data-driven decision-making.

Silver → Gold Aggregation

With clean and standardized healthcare data available in the Silver layer, the next step was to aggregate and transform it into analytical datasets in the Gold layer. This step focuses on producing business-ready metrics that clinicians, administrators, and analysts can use directly for reporting and decision-making.

The main objectives of this step were:

Generate key performance metrics, such as daily patient admissions, lab test turnaround times, pharmacy usage, and supplier performance.
Aggregate Silver-layer data by relevant dimensions, e.g., date, department, or facility, to make it suitable for dashboards and BI tools.
Implement healthcare-specific business logic, such as filtering only completed lab tests or excluding canceled appointments, to ensure accurate reporting.

Airflow orchestration

After building the ETL pipelines for Bronze → Silver → Gold transformations, the next step was to automate and schedule these processes using Apache Airflow. Airflow allows us to define Directed Acyclic Graphs (DAGs), which represent the workflow of ETL tasks, manage dependencies, handle retries, and provide observability for the entire pipeline.

In our healthcare Lakehouse, the Airflow DAG orchestrates the daily batch ETL flow for patient, lab, pharmacy, and supplier data:

In the Airflow DAG, the Bronze → Silver task runs Spark jobs to clean, standardize, and validate raw healthcare data, ensuring that patient records, lab results, pharmacy transactions, and supplier information are consistent and deduplicated before further processing.
Once this task completes, the Silver → Gold task executes Spark jobs to aggregate the Silver-layer data into actionable metrics, including daily patient admissions, lab turnaround times, pharmacy inventory usage, and supplier performance.
Airflow manages the task dependencies, guaranteeing that Silver transformations only begin after Bronze ingestion is complete and that Gold-level aggregations start only once the Silver tables are ready, providing a reliable and automated workflow for the entire ETL process.

Query and analytics layer

Once the Gold-layer metrics were available in Iceberg tables, the final step was to enable analytics and reporting. We used Trino as the query engine, configured with the Iceberg connector, to provide distributed SQL query capabilities directly on the Lakehouse. Trino allows analysts, clinical teams, and administrators to run ad-hoc queries or feed BI dashboards without needing to export or move large volumes of data.

In this Lakehouse, we configured Trino to query Iceberg tables directly, enabling analysts and BI tool such as Power BI to access data without requiring intermediate exports. By leveraging Trino’s distributed query engine, we ensured fast response times even when handling large datasets spanning multiple hospitals, laboratories, and pharmacies.
This setup provides real-time access to actionable insights, supporting clinical decision-making, operational monitoring, and regulatory reporting.

Key features

The healthcare Lakehouse we implemented combines scalability, reliability, and governance to provide a modern analytics platform for clinical, operational, and administrative teams. By leveraging open-source technologies like Apache Spark, Iceberg, MinIO, Airflow, and Trino, we delivered a solution that supports large-scale healthcare data processing while maintaining compliance, traceability, and performance. The platform includes several standout features that make it a robust, enterprise-ready system:

Key Highlights:

On-Prem Object Store: MinIO replicates the behavior of AWS S3 within the healthcare data center, providing scalable, fault-tolerant storage while ensuring data privacy and regulatory compliance.
ACID Transactions: Iceberg guarantees consistent writes and snapshot isolation, ensuring patient, lab, pharmacy, and supplier data are always accurate and reliable.
Schema Evolution: The Lakehouse supports adding, renaming, or deleting columns in tables without the need for full ETL rebuilds.
Automated Scheduling: Airflow DAGs manage the full ETL pipeline from ingestion to transformation to aggregation, ensuring reliable, timely updates of healthcare metrics.
BI-Ready Data: Trino enables low-latency SQL queries on Gold-layer tables, allowing dashboards in Power BI to provide real-time insights without additional data exports.

Business Impacts

The implementation of the healthcare Lakehouse delivered significant improvements in data availability, reliability, and operational efficiency. Previously, healthcare data from patient records, labs, pharmacy, and supplier systems was only available the next day, queries could take 60–90 seconds, schema changes often required extensive rework, and manual interventions were frequent, leading to low trust in analytics.
After deploying the Lakehouse, data is updated same-day, queries now run in under 10 seconds, schema changes are handled automatically via Iceberg, and ETL pipelines are fully automated with Airflow, resulting in highly trusted, reproducible data for all stakeholders.

Unified Reporting: Clinical, operational, and administrative teams now have access to consistent and reliable metrics across departments.

Operational Efficiency: ETL maintenance time decreased by approximately 70%, freeing teams to focus on analytics rather than data wrangling.

Reusable Analytics Assets: Iceberg tables created for the Bronze, Silver, and Gold layers can be reused for new reports, dashboards, or advanced analytics use cases, enabling scalability and future growth.

Conclusion and Lessons Learned Overview

By implementing an on-premises batch Lakehouse leveraging Apache Iceberg, Spark, MinIO, and Airflow, the healthcare enterprise successfully modernized its data platform while remaining fully compliant with regulatory and budget constraints. Patient, lab, pharmacy, and supplier data is now centrally stored, versioned, and queryable, providing a single source of truth across the organization. ETL processes are automated and reproducible, ensuring data is consistently cleaned, validated, and modeled, while Trino enables analysts to access business-ready metrics instantly without additional data exports.

This architecture not only enhances data trust, query performance, and reporting reliability, but also establishes a scalable foundation for future enhancements. The platform is ready to advanced data governance and additional analytics workflows, evolving toward a full enterprise-grade Lakehouse capable of supporting both operational and clinical decision-making.