Delta Lake
Recent items mentioning Delta Lake across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Databricks has significantly expanded interoperability for Unity Catalog Open APIs, now allowing external engines like Apache Spark, Flink, and DuckDB to create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits for safe concurrent writes 4. Delta Lake is also a foundational component for advanced AI applications, as seen with Claroty's AI-powered CPS Library built on Databricks Custom Agents and Delta Lake to automate entity resolution for industrial and healthcare assets 5. Community discussions highlight practical aspects of Delta Lake, including handling MERGE with schema evolution and optimizing MERGE performance 68.
Generated daily from the 10 most recent items mentioning Delta Lake. Click any [N] to jump to the source.
This release fixes several regressions, including issues with MERGE operations, schema overwrites with predicates, and partition column changes. It also enables passing non-string datatypes in custom commit metadata and updates the minimum PyArrow version to 21.0.0 for preliminary variant type support.
CommunityHow I Mastered System Design Interviews
This video teaches a six-step framework for mastering data engineering system design interviews, covering requirements gathering, pipeline design, data modeling, storage and file formats, data quality and observability, and pipeline resilience. It demonstrates how to apply this framework with practical examples and back-of-the-envelope calculations to justify design choices.
Enzyme whitepaper - incremental view maintenance
I know this is pretty nerdy stuff, but I found this paper super interesting: "**Enzyme: Incremental View Maintenance for Data Engineering"**. [Enzyme: Incremental View Maintenance for Data Engineering](https://arxiv.org/html/2603.27775v2) IVM (incremental view maintenace) is not a brand-new problem. It has been studied in databases for decades. The basic idea is - when source data changes, Enzyme tries to avoid rerunning the entire query from scratch. Instead, it figures out which parts of the existing result are affected and updates just those parts, while still producing the same result you’d get from a full recompute. What I found especially interesting is how much “under the hood” machinery is needed to make this work reliably: * tracking source table changes with Delta Lake features like Change Data Feed, row tracking, deletion vectors, and time travel * decomposing queries into logical operators like filters, joins, aggregations, and windows * generating delta plans for each operator * deciding whether incremental refresh is actually cheaper than full recompute * handling non-deterministic functions, Python UDFs, query fingerprints, and pipeline dependencies * falling back safely when incrementalization is not worth it or not safe Also, I wasn't aware that a materialized view is not just “the final table.” Under the hood it consists of two parts: 1. backing table - which stores the user’s data plus internal metadata columns 2. top-level view - which describes how the result should be computed, The split architecture of MVs at Databricks, comprising a backing Delta table and a top-level view, provides flexibility for incremental computation. For example, if the top-level view contains AVG(x), Enzyme can internally store SUM(x) and COUNT(\*), because those are easier to update incrementally when new rows arrive. Databricks says Enzyme was validated across thousands of production pipelines and produced cumulative daily compute savings of billions of CPU seconds. On the TPC-DI benchmark, it incrementalized 100% of the workloads and beat full recomputation in 6 out of 8 datasets.
This release fixes a bug preventing partition column changes when overwriting tables and addresses a memory regression in Python MERGE operations. It also adds support for passing non-string datatypes in custom commit metadata and introduces nanosecond timestamp support.
PipelineIQ: Forward‑Looking Sales Intelligence That Drives Action
Your CRM data is a mess. Everyone knows it. Most AI tools pretend it isn't. Databricks took a different approach with PipelineIQ - instead of building yet another forecasting model that assumes clean data (spoiler: it never is), they built a prescriptive action engine that works with the chaos. The result? Every deal in the pipeline gets one of three verdicts: 🚶 Walk - disengage, this isn't worth your time 🔄 Pivot - viable deal, wrong approach 🚀 Accelerate - conditions are right, lean in now No vague "insights." No dashboards that require a PhD to interpret. Just: here's what to do today. Built on Databricks' own stack (Foundation Model APIs, Delta Lake, Unity Catalog) and used internally by their own sales org - this is a rare "we built it for ourselves first" story. Read the full blog here: [https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action](https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action)
Expanded interoperability with Unity Catalog Open APIs
Unity Catalog Open APIs now offer expanded interoperability, with external access to UC managed Delta tables in Beta and credential vending generally available with M2M OAuth support. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits feature for safe concurrent writes and audibility.
The Rosetta stone of CPS: Claroty’s AI-powered library
Claroty's AI-powered CPS Library, built on Databricks Custom Agents and Delta Lake, automates entity resolution for 17M+ industrial and healthcare assets, solving the asset identity crisis where 88% of CPS devices lack exact product codes. This multi-agent AI system improves vulnerability attribution accuracy by over 25% and provides new security recommendations for over 56% of analyzed devices.
How to handle MERGE with Schema Evolution in Delta Lake
Delta Lake Under the Hood: What Every Data Engineer Should Know
Mastering Delta Lake MERGE Performance: Why It Slows Down and How to Fix It
Best resources to learn Databricks?
Hi everyone, I have around 5 years of experience as a SQL Developer and in Data Engineering. I am planning to learn Databricks seriously and also prepare for the Databricks exam. I have good experience with SQL and data concepts, but I want to build strong practical knowledge in Databricks, Spark, Delta Lake, Lakehouse concepts, and real-time data engineering use cases. Can you please suggest the best resources to learn Databricks from beginner to advanced level. I am mainly looking for: Hands-on learning resources Practice projects Practice exams or sample questions YouTube courses, books, blogs, or official materials Also, which Databricks should I start with as someone coming from SQL and Data Engineering background? Thanks in advance for your suggestions. Would really appreciate any practical learning path or resources that helped you.
What Developers Need to Know About Delta Lake 4.2
The learning order that actually works for Databricks. I wasted 3 months before figuring this out.
I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.
Here are 5 topics that showed up much more than I expected in my DEA exam
I took the Databricks Data Engineer Associate exam recently and wanted to share what actually came up because it was quite different from what I spent most of my time studying. I went in thinking Delta Lake theory and platform architecture would be the big topics. They weren't. The exam is way more practical than I expected. **The first thing** that caught me off guard was how heavily they test Auto Loader. Not just the basics but real scenarios. One question described a pipeline receiving 50,000 new files per day and asked which ingestion method to use and why. You need to understand when Auto Loader makes sense versus COPY INTO, how schema evolution works with mergeSchema, and the difference between directory listing and file notification mode. I probably got six or seven questions just on this one topic. **The second thing** was lazy evaluation. I knew the concept but I wasn't prepared for how they test it. They give you a block of code with four or five DataFrame transformations and ask what happens when you run the cell. The answer is nothing happens because there is no action at the end. But the way they frame the questions makes you second guess yourself if you only memorized the definition without really understanding it. **Third** was Lakeflow expectations. The old name was Delta Live Tables but they use Lakeflow in the exam now. You need to know the three expectation types and when to use each one. They gave me a scenario where the pipeline should log bad records but never drop them and I had to pick the right expectation decorator. Also know the difference between streaming tables and materialized views because that came up more than once. **Fourth** was Unity Catalog permissions. Not just the three level naming pattern but actual grant scenarios. Something like a data analyst needs to read tables in the sales schema but should not be able to create new tables and you have to pick the correct grant statement. I got at least three or four questions like this. **Fifth** was MERGE INTO. They really love this command. Upsert scenarios, deduplication, slowly changing dimensions. If you cannot write a MERGE statement from memory with the WHEN MATCHED and WHEN NOT MATCHED clauses you should spend an hour practicing just that before you sit for the exam. What surprised me about what was not heavily tested. Cluster configuration was maybe one question. The architecture diagrams with control plane and data plane were one or two questions at most. Delta Sharing was one question. Spark internals like shuffle details were barely mentioned. The biggest thing I wish I had done differently is spend less time reading documentation and more time actually running code. When you have actually executed a MERGE INTO on a real table and seen the results, the exam question feels like something you have done before instead of something you read about once. I used Databricks Free Edition for all my practice and it was more than enough. Hope this helps someone who is preparing right now. Feel free to ask anything about the exam in the comments and I will try to answer.
This release improves the new Datafusion TableProvider and log parsing performance, alongside numerous bug fixes. Key fixes address issues with DeltaScan schema handling, streamed merge file pruning, and incorrect row counts for DELETE operations.
UnityCatalog 0.4.1
The Unity Catalog Spark connector now supports atomic REPLACE TABLE AS SELECT and Dynamic Partition Overwrite for managed Delta tables, and a new credential-scoped file system to prevent out-of-memory errors. This release also adds support for the VARIANT data type in the UC client and fixes a critical security vulnerability (CVE-2026-27478) requiring new server configuration for existing deployments with authorization enabled.
Delta Lake 4.2.0
This release enhances Unity Catalog managed tables with support for REPLACE TABLE, RTAS, Dynamic Partition Overwrite, and improved streaming read options like `startingTimestamp` and `skipChangeCommits`. It also introduces GA support for Variant columns, Geospatial types with data skipping, and collated strings, alongside fixes for Variant stats and decimal predicates.
NewsStop Guessing Table Health — Let These Dashboards Tell You
Databricks offers two dashboards for monitoring table health and access: the Table Access Advisor and the Table Health Advisor. These dashboards provide insights into table ownership, read/write patterns, staleness, optimization status, and underlying file structures, helping users identify ghost tables and ensure best practices.
TutorialsHow to Sync Lakebase Tables to Delta with Lakehouse Sync
Databricks demonstrates how to sync Lakebase PostgreSQL tables to Delta tables within a Databricks Lakehouse using the Lakehouse Sync feature. This process enables analytical workloads on data originating from Lakebase applications by leveraging Delta and Spark.
Delta Lake 4.1.0
Delta Lake 4.1.0 enhances Unity Catalog integration with improved support for catalog-managed tables, including atomic CTAS and conflict-free feature enablement for Deletion Vectors and Column Mapping. It also introduces a new Spark V2 connector based on Delta Kernel API for streaming reads and server-side planning capabilities.
This release fixes a bug related to a backported change from an earlier pull request. It addresses a specific issue within the Rust core of the Delta Lake connector for Python.
This release introduces a session-first DataFusion integration and exposes Delta Lake Vacuum metadata as Arrow streams. It also fixes issues with schema merge appends for generated columns and improves parquet predicate pushdown.
Delta Lake 4.0.1
The "managed table" feature is renamed to `catalogManaged` (breaking change for `catalogOwned-preview` and `ucTableId`), and Unity Catalog now supports OAuth authentication for catalogs. This release also fixes a `NoSuchMethodError` when running `REORG TABLE … APPLY (PURGE)` with Spark 4.0.1 and enables creating UC-managed Delta tables where properties are sent to the UC server.
python-v1.3.1: read support deletion vectors, column mapping
This release adds read support for Delta Lake tables utilizing deletion vectors and column mapping. It also includes performance improvements for table scans and predicate pushdown, alongside better error messages for Unity Catalog and LakeFS.
This release introduces several API changes and integrates `delta_kernel` for improved stats parsing performance. It also fixes issues with schema evolution during merge operations and null handling in scalar extraction.
CommunityApache Spark Was Hard Until I Learned These 30 Concepts!
The video explains 30 key Apache Spark concepts, starting with a comparison to MapReduce to highlight Spark's in-memory processing and DAG-based execution model. It then details Spark's cluster architecture, job execution flow (driver, executors, tasks), and memory management within executor containers.
TutorialsDelta Lake Masterclass | Azure Databricks | PySpark | From Zero-To-Expert
This video provides a comprehensive masterclass on Delta Lake using Azure Databricks and PySpark, covering its core concepts, internal workings, and practical applications. It demonstrates how Delta Lake solves data lake problems like lack of ACID support, DML operations, and schema enforcement, and teaches features like time travel, concurrency control, and optimization techniques.
Events[Demo] Lakebase: Real-time Operational & Analytical Data on One Platform
Lakebase allows users to create synced tables in Unity Catalog, combining Delta Lake data with other sources for real-time operational and analytical use. These synced tables can be configured for one-off snapshots or continuous updates, enabling unified data access for applications and historical analysis.
UnityCatalog 0.3.0
Unity Catalog now supports Spark 4.0 and Delta Lake 4.0, enhancing compatibility with the latest Databricks runtime components. New API surfaces for credentials and external locations provide more flexible handling of external storage services.
NewsCrypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data
NewsScaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake
NewsScaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning
Delta Lake 4.0.0
Delta Lake 4.0.0 introduces preview support for catalog-managed tables and the Variant data type for semi-structured data, alongside Delta Connect for Spark Connect integration. Key improvements include instant dropping of table features without history truncation and enhanced performance through log compaction files and clustered table support.
Delta Lake 3.3.2
This release fixes an issue where stale checksum files were not cleaned up during Delta table maintenance. It also includes a fix for Delta Flink to correctly map Delta's BinaryType to Flink's data types.
Delta Lake 3.3.1
This release fixes an issue allowing user-specified schema on read if consistent with the table schema. It also includes a kernel fix for handling non-uniform value types in map[string, string] within Delta commit files.
Delta Lake 3.3.0
Delta Lake 3.3.0 introduces Identity Columns, faster VACUUM LITE, and the ability to enable Row Tracking on existing tables for row-level lineage. It also allows enabling UniForm Iceberg on existing tables without data rewrites and supports reading tables with Type Widening enabled in Delta Kernel.
Delta Lake 3.2.1
This release fixes several bugs in Delta Lake 3.2.0, including issues with MERGE operations, clustering, and restoring tables. It also enhances Delta Universal Format by allowing Iceberg enablement on existing tables via ALTER TABLE.
EventsAnnouncing Delta Lake 4.0 with Liquid Clustering. Presented by Shant Hovsepian at Data + AI Summit
EventsAnnouncing DuckDB Support for Delta Lake and a DuckDB Extension to Unity Catalog - Hannes Mühleisen
EventsLakehouse Format Interoperability With UniForm. Shant Hovsepian presents at Data + AI Summit 2024
Delta Lake 4.0.0 Preview
Delta Lake 4.0.0rc1 introduces support for Spark Connect, Type Widening, the Variant data type, and Coordinated Commits for flexible multi-cloud writes. This preview also includes fixes for liquid clustering, improved CDF query filter pushdown, and performance enhancements for checkpoint finding.
Delta Lake 3.2.0
This release introduces Liquid clustering for incremental optimization and preview support for Type Widening to alter column types without data rewrites. It also adds preview support for Apache Hudi in Delta UniForm tables and improves VACUUM operations with inventory tables and writer protocol checks.
Tutorials124. Databricks | Pyspark| Delta Live Table: Datasets - Tables and Views
Tutorials














