Pinterest Improves Data Ingestion with CDC & Iceberg: Real-time Analytics at Scale

Pinterest is dramatically accelerating its data processing speeds, slashing database latency from over 24 hours to as little as 15 minutes. The social media company achieved this significant improvement through the implementation of a next-generation database ingestion framework, designed to overcome limitations in its previous, batch-oriented systems. This upgrade is critical for powering real-time analytics, machine learning applications and enhancing the overall user experience on the platform, particularly as Pinterest leans further into AI-driven recommendations and features. The core of this transformation lies in a shift towards Change Data Capture (CDC) technology.

Previously, Pinterest relied on multiple, independently managed data pipelines and full-table batch jobs. This approach proved inefficient, leading to substantial delays in data availability and increased operational complexity. According to Pinterest engineers, the classic system struggled to keep pace with the demands of modern data-intensive applications. A significant portion of the data processed – around 95% – hadn’t changed, yet the system continued to reprocess it, wasting valuable computing resources and storage capacity. The inability to natively support row-level deletions further compounded these issues, creating inconsistencies and increasing maintenance overhead. Addressing these challenges was paramount to unlocking the full potential of Pinterest’s data assets.

Building a Real-Time Data Pipeline with CDC

The new framework centers around Change Data Capture, utilizing tools like Debezium and TiCDC to identify and track changes made to online databases. This data is then streamed through Kafka, processed by Flink and Spark, and ultimately stored in Iceberg tables on Amazon S3. This architecture allows Pinterest to access database changes in minutes, rather than hours or days, and to process only the records that have actually been modified. The result is a substantial reduction in infrastructure costs and a significant boost in data availability. The system currently supports MySQL, TiDB, and KVStore, with a configuration-driven approach simplifying the onboarding of new data sources.

Next-gen database ingestion architecture overview (Source: Pinterest Blog Post)

Optimizing for Cost and Performance with Iceberg

A key component of Pinterest’s new data ingestion framework is the use of Apache Iceberg, an open table format for huge analytic datasets. The architecture separates CDC tables, which act as append-only ledgers recording each change, from base tables that maintain a full historical snapshot. Updates to the base tables are performed using Spark’s Merge Into operation, which offers two strategies: Copy on Write (COW) and Merge on Read (MOR). Pinterest opted for Merge on Read after evaluating both approaches. While Copy on Write rewrites entire data files during updates, Merge on Read applies changes to separate files and merges them at read time, reducing write amplification and storage costs. This decision proved crucial for managing petabyte-scale data efficiently.

Further optimizations include partitioning base tables by a hash of the primary key using Iceberg bucketing, enabling parallel processing of updates and reducing the amount of data scanned per operation. The framework too addresses the “compact files problem” – a common challenge in distributed systems – by instructing Spark to distribute writes by partition, minimizing overhead. These optimizations contribute to the overall efficiency and scalability of the system.

Measurable Results and Future Development

The impact of the new framework has been significant. Pinterest reports reducing data availability latency from more than 24 hours to as low as 15 minutes. By processing only the 5% of records that change daily, the company has also achieved substantial infrastructure cost savings. The system is now capable of handling petabyte-scale data across thousands of pipelines while supporting both incremental updates and deletions. This improved data access is directly benefiting Pinterest’s analytics, machine learning models, and product features.

Looking ahead, Pinterest plans to focus on automating schema evolution, ensuring that changes to database schemas are safely propagated downstream. This will further enhance the reliability and maintainability of the large-scale data pipelines. The company’s investment in CDC and Iceberg demonstrates a commitment to building a robust and scalable data infrastructure capable of supporting its continued growth and innovation. Pinterest’s approach to data ingestion, as detailed in their engineering blog, provides a valuable case study for other organizations grappling with similar data challenges.

Pinterest’s success with this new framework underscores the growing importance of real-time data processing in today’s fast-paced digital landscape. As companies increasingly rely on data-driven insights, the ability to quickly and efficiently ingest and analyze data will be a critical competitive advantage. The next step for Pinterest involves refining the automated schema evolution process, a development expected to further streamline data pipeline management in the coming months.

Have thoughts on Pinterest’s data infrastructure improvements? Share your comments below and let us know how these changes might impact your experience on the platform.

development

Pinterest Improves Data Ingestion with CDC & Iceberg: Real-time Analytics at Scale

Building a Real-Time Data Pipeline with CDC

Optimizing for Cost and Performance with Iceberg

Measurable Results and Future Development

Related

Nurse Finds 4-Year-Old Alone Before Heart Procedure

Ocado to Cut 1,000 Jobs in £150m Restructuring Plan | AI & Cost Cuts

You may also like

Leave a Comment Cancel Reply