Driving Data Excellence: A Real-Time Lakehouse with DigileEdge

1. Context: Why the Legacy System Failed

The legacy platform failed not just because it was “old,” but due to structural architectural constraints:

1.1 Metadata Explosion

~400,000 tables in a single schema with no lifecycle management, causing query planner degradation and catalog lookup latency.

1.2 Batch-Oriented Architecture

Monthly/yearly scheduled ingestion with no incremental strategy, causing long freshness gaps and failure propagation across pipelines.

1.3 Tight Coupling Between Layers

Ingestion, transformation, and reporting tightly bound —changes required full pipeline rewrites with no reusability.

1.4 Lack of Observability

No lineage tracking, limited monitoring, and manual debugging via logs made incident response slow and painful.
‍

2. Architecture Design: Detailed Breakdown

2.1 Design Principles

The new system was built around:
Decoupling: Storage (Iceberg) separated from Compute (Trino) — each scales independently.
Immutability: Append-only, versioned tables prevent data corruption and enable time travel.
Streaming-First: CDC-based ingestion replaces batch jobs for near real-time data freshness.
Idempotency: Deterministic pipeline execution ensures safe retries without duplicate data.
Observability: Metadata and lineage tracking built in from day one via DataHub.

2.2 Data Layering Strategy

We implemented a multi-layered data model:

Layer 1: Raw (Bronze)

Source-aligned schema
Minimal transformation
Append-only ingestion via CDC

Layer 2: Refined (Silver)

Data cleansing
Schema normalization
Deduplication logic

Layer 3: Curated (Gold)

Business-ready datasets
Aggregations and joins
Optimized for query performance

2.3 Storage Design with Apache Iceberg

Why Iceberg?

Schema evolution without rewriting data
Partition evolution (critical for long-term scalability)
ACID transactions on data lake
Time travel for auditability

Table Design Decisions

Partitioning Strategy
- Based on query access patterns (e.g., date, region)
- Avoided over-partitioning (common anti-pattern)
File Size Optimization
- Target file size: 512MB–1GB
- Avoided small file problem using compaction jobs
Compaction Strategy
- Scheduled compaction via Airflow
- Merge small files → optimize read performance

2.4 Query Layer (Trino)

Cluster Design

Coordinator + multiple worker nodes
Auto-scaling enabled via Kubernetes

Query Optimization Techniques

Predicate pushdown
Partition pruning
Broadcast joins for small tables
Cost-based optimizer tuning

Concurrency Handling

Resource groups configured
Query prioritization (BI vs ad-hoc)

2.5 Ingestion Layer (CDC + NiFi)

CDC via Striim

Log-based change capture
Handles:
- Inserts
- Updates
- Deletes

CDC Design Challenges

Out-of-order events
Duplicate events
Late-arriving dat

Solutions

Event timestamp ordering
Deduplication keys
Watermarking strategies

NiFi Pipelines

Used for:
- Batch ingestion
- External data sources
Back-pressure handling configured

2.6 Orchestration (Airflow)

DAG Design Principles

Modular DAGs (per domain)
Idempotent tasks
Retry policies with exponential backoff

Dependency Management

Task-level dependencies
Dataset-based triggering (where applicable)

‍

3. Infrastructure Provisioning

3.1 Kubernetes Architecture

Cluster Setup

Multi-node cluster
Node pools:
- Compute-heavy (Trino workers)
- Storage-heavy (data nodes)
- General-purpose (Airflow, NiFi)

Key Configurations

Horizontal Pod Autoscaler (HPA)
Pod disruption budgets
Node affinity rules

3.2 Deployment Strategy

Helm-Based Deployments

Parameterized configurations
Environment-specific values

Namespace Strategy

Separate namespaces for:
- ingestion
- processing
- orchestration
- governance

3.3 Storage Infrastructure

Object storage (S3-compatible)
Persistent volumes for stateful components

3.4 Networking

Internal service mesh (optional)
Secure communication via TLS
Role-based access control (RBAC)
‍

‍4.Development Approach

4.1 Pipeline Framework (DigileEdge)

Config-driven pipelines (YAML/JSON-based)
Reusable templates:
- CDC ingestion
- Batch ingestion
- Transformation jobs

4.2 Idempotent Processing

Each pipeline designed to be re-runnable
No duplicate data on retries

4.3 Schema Evolution Handling

Iceberg schema evolution
Backward compatibility maintained

4.4 Version Control

Git-based
Code + config versioned together

4.5 Data Quality Framework

Validation rules:
- Null checks
- Range checks
- Referential integrity

5. Implementation Strategy

5.1 Migration Approach

Incremental migration (table-by-table)
Parallel run:
- Legacy vs new system‍

5.2 Cutover Strategy

Shadow mode testing
Gradual switchover

5.3 Rollback Strategy

Versioned datasets
Rollback to previous snapshot

‍

6. Testing Strategy

6.1 Data Validation

Row-level reconciliation
Hash-based validation

6.2 Performance Testing

Benchmark queries
Load testing with synthetic + real data

6.3 Chaos Testing

Node failures
Network interruptions

6.4 Pipeline Testing

Unit tests for transformations
Integration tests across DAGs

6.5 SLA Validation

End-to-end pipeline timing
Alert thresholds
‍

7. Observability & Operations

Monitoring Stack

Metrics:
- CPU, memory
- Query latency
Logs:
- Centralized logging
Alerts:
- SLA breaches
- pipeline failures

Data Observability

Lineage tracking (DataHub)
Data freshness metrics
Data quality dashboards
‍

8. Final Outcome

Real-time ingestion: CDC-powered pipeline delivering near-instant data freshness across all domains.
Scalable query layer: Trino on Kubernetes auto-scales to handle concurrent BI and ad-hoc workloads.
Resilient infrastructure: Kubernetes-native deployment with chaos-tested recovery and versioned rollback.
Fully automated workflows: Airflow DAGs orchestrate end-to-end pipelines with observability built in.

Closing Thought

This wasn’t just modernization—it was a shift from:

data as storage → data as a real-time platform

For more updates, follow us on LinkedIn, Twitter, Facebook, Instagram, and YouTube

‍