Databricks Ingestion & Integration (Azure Retail)

All blogs

Databricks Ingestion & Integration (Azure Retail)

Feb 14, 2025

Project Type: B2B Resource
Project Timeline: December 2024 – February 2025
Client Region: APAC
Industry: Retail

Context & Objectives

The client, Australia’s second-largest retail and fashion brand, was facing a growing challenge: their analytics ecosystem had become fragmented and difficult to manage. Data was flowing in from a wide variety of sources on-premises POS systems, ERP databases, eCommerce platforms, inventory catalogs, and customer support systems. Each source had its own structure, data quality issues, and refresh cadence. As a result, teams often struggled with inconsistent reporting, delayed insights, and duplicated efforts. Partner reporting was particularly cumbersome, relying heavily on manual extracts and spreadsheets, which increased the risk of errors and slowed decision-making.

Our mission was to transform this fragmented landscape into a modern, reliable, and governed analytics platform on Azure. We aimed to build an ingestion framework that could handle both batch and streaming sources efficiently, while ensuring data quality and governance through Databricks and Unity Catalog. The solution needed to follow repeatable patterns so that new data sources could be integrated quickly, without creating additional complexity.

Project Goals

Stand up ADF-orchestrated ingestion (batch + streams) into Databricks and Unity Catalog
The goal was to build automated pipelines capable of ingesting data from both batch and streaming sources. Event Hub and Kafka streams were to be integrated for real-time feeds, while ADF scheduling and retries would ensure reliable ingestion. All data needed to land in Databricks with governance enforced through Unity Catalog.

Establish medallion patterns and reusable templates
We aimed to implement the medallion architecture (Bronze, Silver, Gold) to improve consistency and reduce repetitive engineering work. Reusable templates for common retail entities such as products, stores, inventory, orders, and customers were targeted, enabling faster onboarding of new sources with standardized transformations and logging.

Deliver partner/tenant reporting in Power BI/Looker with RLS
A key goal was to provide secure, role-based access for internal teams and external partners using RLS policies. Pre-aggregated metrics and embedded filters were intended to enable fast, reliable reporting without manual extracts in the data.

Challenges

Modernizing analytics for a multi-source, multi-tenant retail environment came with several complexities. We had to handle diverse data sources, ensure governance and security, meet near-real-time freshness requirements, and manage costs, all while maintaining reliability at scale.

Source Diversity & Quality
We faced data from POS, ERP, inventory, eCommerce, web apps, and support systems, spread across both on-premises and cloud platforms. Data quality varied widely, so we needed to standardize, cleanse, and validate all incoming feeds to ensure trustworthy analytics downstream.

Freshness & Reliability
Operational and transactional feeds required near-real-time ingestion, while other sources were updated daily or sub-daily. Designing pipelines that met these SLAs, handled retries, incorporated dead-letter queues, and allowed deterministic backfills was critical to avoid downtime or data loss.

Governance & Security
Sensitive customer data demanded strict protection. We had to enforce Unity Catalog policies, Azure AD SSO, and data masking while ensuring partner reporting followed robust RLS rules. This ensured that each stakeholder could access only the data they were authorized to see.

Cost & Operational Efficiency
Scaling ingestion and transformation workloads without overspending required careful cluster sizing, concurrency controls, and FinOps monitoring. Balancing performance with cost-efficiency was essential to maintain a sustainable analytics environment.

Overall, the combination of diverse sources, strict governance, freshness requirements, and cost constraints made this project complex. We needed a solution that was repeatable, reliable, secure, and efficient to meet the client’s analytics modernization goals.

Solution Overview

To address the complexities of ingesting and governing multi-source retail data, we implemented a repeatable, secure, and monitored ingestion framework using Azure and Databricks. Our solution focused on reliability, data quality, governance, and cost efficiency, ensuring that both batch and streaming data could be ingested, transformed, and served consistently.

Ingestion Framework
We built robust pipelines using Azure Data Factory for batch ingestion and Event Hub/Kafka for streaming sources. Each pipeline incorporated retries, dead-letter queues, and audit tables to ensure reliability and traceability. This allowed us to handle both near-real-time operational feeds and sub-daily/daily batch updates without data loss or downtime.

Medallion Architecture & Templates
We implemented a Bronze-Silver-Gold medallion architecture to organize data by quality and conformance. Templates were created for common retail entities such as products, stores, orders, and inventory. These templates enabled rapid onboarding of new sources while ensuring consistent transformations, logging, and schema evolution.

Transformation & Governance
Delta Live Tables and Databricks Jobs handled silver-to-gold transformations, ensuring clean, conformed data in the marts. Unity Catalog enforced consistent permissions, policy tags, and data lineage tracking. Sensitive fields were masked, and RLS policies were applied at the gold layer to securely support partner and tenant reporting.

Serving & Observability
Gold-level retail marts were exposed in Power BI and Looker with pre-aggregated metrics and RLS-enabled views. Monitoring dashboards tracked ingestion health, freshness, and reconciliation across layers. Data contracts verified completeness and consistency of critical feeds, giving stakeholders confidence in the data.

Operational Safety & Cost Management
Cluster policies, job concurrency limits, and FinOps telemetry helped control costs while maintaining performance. Alerts and retries minimized operational disruptions, and lineage dashboards ensured changes could be traced to downstream reports and dashboards, improving transparency and reliability.

Constraints & Non-Functional Requirements

In addition to building a reliable ingestion framework, we had to meet several critical non-functional requirements. Security was a top priority, so we enforced Azure AD SSO for authentication, applied Unity Catalog governance for consistent permissions, and masked sensitive fields to protect customer data. Freshness requirements varied by source like operational feeds needed near-real-time ingestion, while other sources could be processed daily or sub-daily, which demanded careful orchestration. Reliability was ensured through dead-letter queues, audit tables, and backfills, enabling scalable ingestion across multiple domains without data loss.
Finally, cost efficiency was a key consideration, cluster policies, job concurrency limits, and FinOps telemetry helped optimize resource usage while keeping operational costs under control.

Data Model & Semantics

To organize the retail data effectively, we modeled core retail entities such as products, stores, inventory, orders, customers, and sessions. These entities formed the foundation of our ingestion and analytics workflows.

We implemented a medallion architecture to structure the data across layers: Bronze stored raw or streaming data exactly as ingested, Silver contained conformed and cleaned datasets ready for analytics, and Gold consisted of curated retail marts, such as sales performance, inventory health, and stockout reports. This layered approach ensured data quality, traceability, and consistency.

To support reporting and partner access, we applied pre-aggregations and embedded filters. Frequent queries were optimized for performance, while role-based filters (RLS) ensured that partners and tenants could only access the data they were authorized to see, maintaining both usability and governance.

Ops, Security, Quality & Performance

To ensure smooth and reliable operations, we implemented robust operational practices. Azure Data Factory pipelines included retries and alerts, while audit-driven backfills and schema evolution patterns helped maintain continuity and correctness during ingestion. This approach minimized downtime and ensured that any transient failures were automatically handled.
From a security perspective, we enforced strict access controls using Unity Catalog grants and Azure AD SSO. Sensitive fields were masked, and partner-specific RLS policies were applied to Gold-level marts, ensuring that each user could only access authorized data.
Data quality was a key focus. We implemented reconciliation checks between layers, freshness monitoring, and formal data contracts for critical feeds, guaranteeing that data was accurate, complete, and trustworthy for both internal analytics and partner reporting.

Finally, performance and cost efficiency were optimized through pre-aggregations for frequent queries, bounded defaults to limit resource consumption, and cluster policy enforcement. Lineage dashboards allowed us to track downstream impacts of changes, making maintenance predictable and enabling cost-effective scaling.

Tech Stack

Data Sources:

POS/ERP systems, inventory/catalog databases, eCommerce/web/app platforms, and support/customer service systems.

Ingestion:

Azure Data Factory (ADF) pipelines for batch processing
Event Hub/Kafka streams for real-time feeds, and dead-letter queues for reliable error handling.

Storage / Lakehouse:

Azure Data Lake Storage (ADLS) for raw and Bronze layers
Databricks for Delta Silver and Gold layers, all managed under Unity Catalog for governance.

Orchestration:

ADF pipelines for ingestion orchestration
Databricks Jobs for model execution and transformations.

Transformation / Modeling:

Delta Live Tables (DLT) for Bronze streaming
Databricks Jobs for Silver-to-Gold transformations
Reusable medallion patterns/templates for common retail entities.

Serving / Consumption:

Power BI and Looker for analytics and reporting

Governance / Security:

Azure AD SSO for authentication
Unity Catalog policies and tags for access control
data masking for sensitive fields.

Observability / Quality:

ADF monitoring dashboards, audit tables, reconciliation/freshness checks
Lineage tracking to ensure data reliability and transparency.

DevOps / CI-CD:

Repo-backed notebooks and jobs, environment promotion processes, and a template library for consistent deployments.

FinOps / Cost Management:

Cluster policies, job concurrency controls, and cost telemetry to optimize resource usage and reduce operational expenses.

Outcomes & Business Impact

Faster Onboarding:
By implementing reusable templates and standardized medallion patterns, new data sources could be integrated quickly and efficiently. This significantly reduced delivery timelines and accelerated analytics readiness for business teams.

Operational Stability:
Robust pipelines with dead-letter queues, automated retries, and continuous monitoring ensured reliable data ingestion. As a result, incidents were minimized, and operational downtime was drastically reduced, giving teams confidence in the data.

Partner Trust:
Secure, role-based access to Gold-level data marts through RLS-enabled views eliminated the need for manual extracts. Partners could now access accurate and authorized data directly, increasing trust and reducing operational overhead.

Scalability:
The medallion architecture and template library provided a repeatable framework for expanding ingestion and transformation across multiple domains. This allowed the organization to scale analytics without compromising on data quality, governance, or operational efficiency.

Deliverables

ADF Pipelines & Templates: Batch and streaming ingestion pipelines with built-in retries and dead-letter queues, enabling reliable and repeatable data ingestion.
Databricks Medallion Patterns: Bronze, Silver, and Gold transformations implemented with Unity Catalog governance, ensuring consistent data quality and lineage across all layers.
Gold Retail Marts: Partner-ready views with pre-aggregated metrics and role-based access (RLS), enabling secure and fast reporting without manual extracts.
Monitoring & Alerts Dashboards: Dashboards to track data freshness, ingestion health, and lineage, providing transparency and operational oversight.
FinOps Telemetry: Reports and guardrails to optimize cluster usage, control costs, and monitor resource efficiency during ingestion and transformation workflows.

Conclusion: Enabling Modern, Governed Retail Analytics

The Xponent Ingestion & Integration project modernized the client’s retail analytics on Azure by creating a repeatable, governed, and scalable data ingestion framework. Using medallion architecture and reusable templates, we ensured consistent, high-quality data from diverse sources, while secure RLS-enabled reporting gave partners and internal teams trustworthy access.
Operational reliability was achieved with retries, dead-letter queues, audit tables, and monitoring dashboards, while FinOps practices kept resource usage efficient.

Overall, the project delivered a future-ready analytics platform: one that supports scalable growth, secure data sharing, reliable governance, and audit-ready operations. Teams can now confidently use data for business insights, partners can access governed views, and the organization has a strong foundation for ongoing expansion across domains and sources