
The legacy platform failed not just because it was “old,” but due to structural architectural constraints:
~400,000 tables in a single schema with no lifecycle management, causing query planner degradation and catalog lookup latency.
Monthly/yearly scheduled ingestion with no incremental strategy, causing long freshness gaps and failure propagation across pipelines.
Ingestion, transformation, and reporting tightly bound —changes required full pipeline rewrites with no reusability.
No lineage tracking, limited monitoring, and manual debugging via logs made incident response slow and painful.
The new system was built around:
Decoupling: Storage (Iceberg) separated from Compute (Trino) — each scales independently.
Immutability: Append-only, versioned tables prevent data corruption and enable time travel.
Streaming-First: CDC-based ingestion replaces batch jobs for near real-time data freshness.
Idempotency: Deterministic pipeline execution ensures safe retries without duplicate data.
Observability: Metadata and lineage tracking built in from day one via DataHub.
We implemented a multi-layered data model:
Layer 1: Raw (Bronze)
Layer 2: Refined (Silver)
Layer 3: Curated (Gold)
Why Iceberg?
Table Design Decisions
Cluster Design
Query Optimization Techniques
Concurrency Handling
CDC via Striim
CDC Design Challenges
Solutions
NiFi Pipelines
DAG Design Principles
Dependency Management
.png)
Cluster Setup
Key Configurations
Helm-Based Deployments
Namespace Strategy
This wasn’t just modernization—it was a shift from:
data as storage → data as a real-time platform
For more updates, follow us on LinkedIn, Twitter, Facebook, Instagram, and YouTube