Skip to content
brickster.ai
All videos
newsDatabricks·July 19, 2022

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

Description

Many companies are trying to solve the challenges of ingesting transactional data in Data lake and dealing with late-arriving updates and deletes. To address this at Swiggy, we have built CDC(Change Data Capture) system, an incremental processing framework to power all business-critical data pipelines at low latency and high efficiency. It offers: Freshness: It operates in near real-time with configurable latency requirements. Performance: Optimized read and write performance with tuned compaction parameters and partitions and delta table optimization. Consistency: It supports reconciliation based on transaction types. Basically applying insert, update, and delete on existing data. To implement this system, AWS DMS helped us with initial bootstrapping and CDC replication for Mysql sources. AWS Lambda and DynamoDB streams helped us to solve the bootstrapping and CDC replication for DynamoDB source. After setting up the bootstrap and cdc replication process we have used Databricks delta merge to reconcile the data based on the transaction types. To support the merge we have implemented supporting features - * Deduplicating multiple mutations of the same record using log offset

Description from YouTube. Full content on the video page.

More from Databricks