Pinterest’s CDC-Powered Ingestion Slashes Database Latency from 24 Hours to 15 Minutes
Pinterest launched a next-generation CDC-based database ingestion framework using Kafka, Flink, Spark, and Iceberg. The system reduces data availability latency from 24+ hours to 15 minutes, processes only changed records, supports incremental updates and deletions, and scales to petabyte-level data across thousands of pipelines, optimizing cost and efficiency. By Leela Kumili
InfoQ Homepage
News
Pinterest’s CDC-Powered Ingestion Slashes Database Latency from 24 Hours to 15 Minutes
Architecture & Design
Pinterest’s CDC-Powered Ingestion Slashes Database Latency from 24 Hours to 15 Minutes
Feb 26, 2026
min read
Leela Kumili
Write for InfoQ
Feed your curiosity.
Help 550k+ global
senior developers
each month stay ahead.Get in touch
Listen to this article - 0:00
0:00
0:00
- Reading list
Pinterest has launched a next-generation database ingestion framework to address the limitations of its legacy batch-based systems and improve real-time data availability. The previous infrastructure relied on multiple, independently maintained pipelines and full-table batch jobs, resulting in high latency, operational complexity, and inefficient resource utilization. Critical use cases, including analytics, machine learning, and product features, required faster, more reliable access to data.
The legacy system faced several key challenges. Data latency often exceeded 24 hours, delaying analytics and ML workflows. Daily changes for many tables were below 5%, yet full-table batch processes reprocessed unchanged records, wasting compute and storage resources. Row-level deletions were not natively supported, and operational fragmentation across pipelines caused inconsistent data quality and high maintenance overhead.
As emphasized by a Pinterest engineer,
A unified DB ingestion framework built on Change Data Capture (Debezium/TiCDC), Kafka, Flink, Spark, and Iceberg provides access to online database changes in minutes (not hours or days) while processing only changed records, resulting in significant infrastructure cost savings.
The framework is generic, supporting MySQL, TiDB, and KVStore, is configuration-driven for easy onboarding, and integrates monitoring with at-least-once delivery guarantees.
Next-gen database ingestion architecture overview (Source: Pinterest Blog Post)
The architecture separates CDC tables from base tables. CDC tables act as append-only ledgers, recording each change event with typical latency under five minutes. Base tables maintain a full historical snapshot, updated via Spark Merge Into operations every 15 minutes to an hour. Iceberg’s Merge Into operation provides two update strategies: Copy on Write(COW) and Merge on Read(MOR). Copy on Write rewrites entire data files during updates, increasing storage and compute overhead. Merge on Read writes changes to separate files and applies them at read time, reducing write amplification. After evaluating both strategies, Pinterest standardized on Merge on Read because Copy on Write introduced significantly higher storage costs that outweighed its benefits for most workloads. The selected approach enables incremental updates while keeping infrastructure costs manageable at the petabyte scale.
Spark jobs first deduplicate the latest changes from CDC tables and then apply updates or deletions to base tables. Historical data is loaded initially through a bootstrap pipeline, and ongoing maintenance jobs handle compaction and snapshot expiration.
Optimizations include partitioning base tables by a hash of the primary key using Iceberg bucketing, allowing Spark to parallelize upserts and reduce data scanned per operation. The framework also addresses the small files problem by instructing Spark to distribute writes by partition, reducing overhead caused by multiple small files per task.
Measured outcomes include reducing data availability latency from more than 24 hours to as low as 15 minutes, processing only the 5% of records that change daily, and lowering infrastructure costs by avoiding unnecessary full-table operations. The system handles petabyte-scale data across thousands of pipelines while supporting incremental updates and deletions.
Pinterest’s CDC-based ingestion framework delivers real-time access to database changes, with Iceberg tables on AWS S3 and Flink-Spark handling streaming and batch workloads. Future improvements will focus on automated schema evolution, safely propagating upstream changes downstream to enhance the reliability and maintainability of large-scale pipelines.
About the Author
****Leela Kumili****
Show moreShow less
This content is in the Database topic
Related Topics:
Development
Architecture & Design
DevOps
Apache Iceberg
Apache Kafka
Apache Flink
Debezium
MySQL
Change Data Capture
Database
Apache Spark
Related Editorial
Popular across InfoQ
Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%
Google Brings its Developer Documentation into the Age of AI Agents
Uforwarder: Uber’s Scalable Kafka Consumer Proxy for Efficient Event-Driven Microservices
Vercel Releases React Best Practices Skill with 40+ Performance Rules for AI Agents
Kubernetes Introduces Node Readiness Controller to Improve Pod Scheduling Reliability
Software Evolution with Microservices and LLMs: A Conversation with Chris Richardson
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers.
We protect your privacy.