Driving Data Excellence: A Real-Time Lakehouse with DigileEdge

1. Context: Why the Legacy System Failed

The legacy platform failed not just because it was “old,” but due to structural architectural constraints:

1.1 Metadata Explosion

~400,000 tables in a single schema with no lifecycle management, causing query planner degradation and catalog lookup latency.

1.2 Batch-Oriented Architecture

Monthly/yearly scheduled ingestion with no incremental strategy, causing long freshness gaps and failure propagation across pipelines.​

1.3 Tight Coupling Between Layers

Ingestion, transformation, and reporting tightly bound —changes required full pipeline rewrites with no reusability.

1.4 Lack of Observability

No lineage tracking, limited monitoring, and manual debugging via logs made incident response slow and painful.

2. Architecture Design: Detailed Breakdown

2.1 Design Principles

The new system was built around:
Decoupling: Storage (Iceberg) separated from Compute (Trino) — each scales independently.​
Immutability: ​​Append-only, versioned tables prevent data corruption and enable time travel.
Streaming-First: ​
CDC-based ingestion replaces batch jobs for near real-time data freshness.
Idempotency:
​Deterministic pipeline execution ensures safe retries without duplicate data.
Observability​:
Metadata and lineage tracking built in from day one via DataHub.

2.2 Data Layering Strategy

We implemented a multi-layered data model:

Layer 1: Raw (Bronze)

  • Source-aligned schema
  • Minimal transformation
  • Append-only ingestion via CDC

Layer 2: Refined (Silver)

  • Data cleansing
  • Schema normalization
  • Deduplication logic

Layer 3: Curated (Gold)

  • Business-ready datasets
  • Aggregations and joins
  • Optimized for query performance

2.3 Storage Design with Apache Iceberg

Why Iceberg?

  • Schema evolution without rewriting data
  • Partition evolution (critical for long-term scalability)
  • ACID transactions on data lake
  • Time travel for auditability

Table Design Decisions

  • Partitioning Strategy
    • Based on query access patterns (e.g., date, region)
    • Avoided over-partitioning (common anti-pattern)
  • File Size Optimization
    • Target file size: 512MB–1GB
    • Avoided small file problem using compaction jobs
  • Compaction Strategy
    • Scheduled compaction via Airflow
    • Merge small files → optimize read performance

2.4 Query Layer (Trino)

Cluster Design

  • Coordinator + multiple worker nodes
  • Auto-scaling enabled via Kubernetes

Query Optimization Techniques

  • Predicate pushdown
  • Partition pruning
  • Broadcast joins for small tables
  • Cost-based optimizer tuning

Concurrency Handling

  • Resource groups configured
  • Query prioritization (BI vs ad-hoc)

2.5 Ingestion Layer (CDC + NiFi)

CDC via Striim

  • Log-based change capture
  • Handles:
    • Inserts
    • Updates
    • Deletes

CDC Design Challenges

  • Out-of-order events
  • Duplicate events
  • Late-arriving dat

Solutions

  • Event timestamp ordering
  • Deduplication keys
  • Watermarking strategies

NiFi Pipelines

  • Used for:
    • Batch ingestion
    • External data sources
  • Back-pressure handling configured

2.6 Orchestration (Airflow)

DAG Design Principles

  • Modular DAGs (per domain)
  • Idempotent tasks
  • Retry policies with exponential backoff

Dependency Management

  • Task-level dependencies
  • Dataset-based triggering (where applicable)

3. Infrastructure Provisioning

3.1 Kubernetes Architecture

Cluster Setup

  • Multi-node cluster
  • Node pools:
    • Compute-heavy (Trino workers)
    • Storage-heavy (data nodes)
    • General-purpose (Airflow, NiFi)

Key Configurations

  • Horizontal Pod Autoscaler (HPA)
  • Pod disruption budgets
  • Node affinity rules

3.2 Deployment Strategy

Helm-Based Deployments

  • Parameterized configurations
  • Environment-specific values

Namespace Strategy

  • Separate namespaces for:
    • ingestion
    • processing
    • orchestration
    • governance

3.3 Storage Infrastructure

  • Object storage (S3-compatible)
  • Persistent volumes for stateful components

3.4 Networking

  • Internal service mesh (optional)
  • Secure communication via TLS
  • Role-based access control (RBAC)

‍4.Development Approach

4.1 Pipeline Framework (DigileEdge)

  • Config-driven pipelines (YAML/JSON-based)
  • Reusable templates:
    • CDC ingestion
    • Batch ingestion
    • Transformation jobs

4.2 Idempotent Processing

  • Each pipeline designed to be re-runnable
  • No duplicate data on retries

4.3 Schema Evolution Handling

  • Iceberg schema evolution
  • Backward compatibility maintained

4.4 Version Control

  • Git-based
  • Code + config versioned together

4.5 Data Quality Framework

  • Validation rules:
    • Null checks
    • Range checks
    • Referential integrity

5. Implementation Strategy

5.1 Migration Approach

  • Incremental migration (table-by-table)
  • Parallel run:
    • Legacy vs new system

5.2 Cutover Strategy

  • Shadow mode testing
  • Gradual switchover

5.3 Rollback Strategy

  • Versioned datasets
  • Rollback to previous snapshot

6. Testing Strategy

6.1 Data Validation

  • Row-level reconciliation
  • Hash-based validation

6.2 Performance Testing

  • Benchmark queries
  • Load testing with synthetic + real data

6.3 Chaos Testing

  • Node failures
  • Network interruptions

6.4 Pipeline Testing

  • Unit tests for transformations
  • Integration tests across DAGs

6.5 SLA Validation

  • End-to-end pipeline timing
  • Alert thresholds

7. Observability & Operations

Monitoring Stack

  • Metrics:
    • CPU, memory
    • Query latency
  • Logs:
    • Centralized logging
  • Alerts:
    • SLA breaches
    • pipeline failures

Data Observability

  • Lineage tracking (DataHub)
  • Data freshness metrics
  • Data quality dashboards

8. Final Outcome

  • Real-time ingestion: CDC-powered pipeline delivering near-instant data freshness across all domains.​
  • Scalable query layer: Trino on Kubernetes auto-scales to handle concurrent BI and ad-hoc workloads.​
  • Resilient infrastructure: Kubernetes-native deployment with chaos-tested recovery and versioned rollback.​
  • Fully automated workflows: Airflow DAGs orchestrate end-to-end pipelines with observability built in.​

Closing Thought

This wasn’t just modernization—it was a shift from:

data as storage → data as a real-time platform

For more updates, follow us on LinkedIn, Twitter, Facebook, Instagram, and YouTube

Check Your AI Readiness

Check Your AI Readiness

Get a personalized readiness score and actionable next steps for your AI journey.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.