Skip to content
brickster.ai
All topics

Delta Lake

Recent items mentioning Delta Lake across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

60 recent items19 releases2 news30 videos9 community threads
What's happening in Delta LakeAI synthesis · updated 8h ago

Databricks has significantly expanded interoperability for Unity Catalog Open APIs, now allowing external engines like Apache Spark, Flink, and DuckDB to create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits for safe concurrent writes 4. Delta Lake is also a foundational component for advanced AI applications, as seen with Claroty's AI-powered CPS Library built on Databricks Custom Agents and Delta Lake to automate entity resolution for industrial and healthcare assets 5. Community discussions highlight practical aspects of Delta Lake, including handling MERGE with schema evolution and optimizing MERGE performance 68.

Generated daily from the 10 most recent items mentioning Delta Lake. Click any [N] to jump to the source.

RedditDiscussion

Enzyme whitepaper - incremental view maintenance

I know this is pretty nerdy stuff, but I found this paper super interesting: "**Enzyme: Incremental View Maintenance for Data Engineering"**. [Enzyme: Incremental View Maintenance for Data Engineering](https://arxiv.org/html/2603.27775v2) IVM (incremental view maintenace) is not a brand-new problem. It has been studied in databases for decades. The basic idea is - when source data changes, Enzyme tries to avoid rerunning the entire query from scratch. Instead, it figures out which parts of the existing result are affected and updates just those parts, while still producing the same result you’d get from a full recompute. What I found especially interesting is how much “under the hood” machinery is needed to make this work reliably: * tracking source table changes with Delta Lake features like Change Data Feed, row tracking, deletion vectors, and time travel * decomposing queries into logical operators like filters, joins, aggregations, and windows * generating delta plans for each operator * deciding whether incremental refresh is actually cheaper than full recompute * handling non-deterministic functions, Python UDFs, query fingerprints, and pipeline dependencies * falling back safely when incrementalization is not worth it or not safe Also, I wasn't aware that a materialized view is not just “the final table.” Under the hood it consists of two parts: 1. backing table - which stores the user’s data plus internal metadata columns 2. top-level view - which describes how the result should be computed, The split architecture of MVs at Databricks, comprising a backing Delta table and a top-level view, provides flexibility for incremental computation. For example, if the top-level view contains AVG(x), Enzyme can internally store SUM(x) and COUNT(\*), because those are easier to update incrementally when new rows arrive. Databricks says Enzyme was validated across thousands of production pipelines and produced cumulative daily compute savings of billions of CPU seconds. On the TPC-DI benchmark, it incrementalized 100% of the workloads and beat full recomputation in 6 out of 8 datasets.

276szymon_dybczakyesterday
RedditGeneral

PipelineIQ: Forward‑Looking Sales Intelligence That Drives Action

Your CRM data is a mess. Everyone knows it. Most AI tools pretend it isn't. Databricks took a different approach with PipelineIQ - instead of building yet another forecasting model that assumes clean data (spoiler: it never is), they built a prescriptive action engine that works with the chaos. The result? Every deal in the pipeline gets one of three verdicts: 🚶 Walk - disengage, this isn't worth your time 🔄 Pivot - viable deal, wrong approach 🚀 Accelerate - conditions are right, lean in now No vague "insights." No dashboards that require a PhD to interpret. Just: here's what to do today. Built on Databricks' own stack (Foundation Model APIs, Delta Lake, Unity Catalog) and used internally by their own sales org - this is a rare "we built it for ourselves first" story. Read the full blog here: [https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action](https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action)

00sai-nageshwaran3d ago
Databricks CommunityData Engineering

How to handle MERGE with Schema Evolution in Delta Lake

001w ago
Databricks CommunityTechnical Blog

Delta Lake Under the Hood: What Every Data Engineer Should Know

001w ago
Databricks CommunityTechnical Blog

Mastering Delta Lake MERGE Performance: Why It Slows Down and How to Fix It

001w ago
RedditDiscussion

Best resources to learn Databricks?

Hi everyone, I have around 5 years of experience as a SQL Developer and in Data Engineering. I am planning to learn Databricks seriously and also prepare for the Databricks exam. I have good experience with SQL and data concepts, but I want to build strong practical knowledge in Databricks, Spark, Delta Lake, Lakehouse concepts, and real-time data engineering use cases. Can you please suggest the best resources to learn Databricks from beginner to advanced level. I am mainly looking for: Hands-on learning resources Practice projects Practice exams or sample questions YouTube courses, books, blogs, or official materials Also, which Databricks should I start with as someone coming from SQL and Data Engineering background? Thanks in advance for your suggestions. Would really appreciate any practical learning path or resources that helped you.

1621Sony_ch2w ago
RedditGeneral

What Developers Need to Know About Delta Lake 4.2

10Lenkz2w ago
RedditDiscussion

The learning order that actually works for Databricks. I wasted 3 months before figuring this out.

I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.

8518InevitableClassic2612w ago
RedditDiscussion

Here are 5 topics that showed up much more than I expected in my DEA exam

I took the Databricks Data Engineer Associate exam recently and wanted to share what actually came up because it was quite different from what I spent most of my time studying. I went in thinking Delta Lake theory and platform architecture would be the big topics. They weren't. The exam is way more practical than I expected. **The first thing** that caught me off guard was how heavily they test Auto Loader. Not just the basics but real scenarios. One question described a pipeline receiving 50,000 new files per day and asked which ingestion method to use and why. You need to understand when Auto Loader makes sense versus COPY INTO, how schema evolution works with mergeSchema, and the difference between directory listing and file notification mode. I probably got six or seven questions just on this one topic. **The second thing** was lazy evaluation. I knew the concept but I wasn't prepared for how they test it. They give you a block of code with four or five DataFrame transformations and ask what happens when you run the cell. The answer is nothing happens because there is no action at the end. But the way they frame the questions makes you second guess yourself if you only memorized the definition without really understanding it. **Third** was Lakeflow expectations. The old name was Delta Live Tables but they use Lakeflow in the exam now. You need to know the three expectation types and when to use each one. They gave me a scenario where the pipeline should log bad records but never drop them and I had to pick the right expectation decorator. Also know the difference between streaming tables and materialized views because that came up more than once. **Fourth** was Unity Catalog permissions. Not just the three level naming pattern but actual grant scenarios. Something like a data analyst needs to read tables in the sales schema but should not be able to create new tables and you have to pick the correct grant statement. I got at least three or four questions like this. **Fifth** was MERGE INTO. They really love this command. Upsert scenarios, deduplication, slowly changing dimensions. If you cannot write a MERGE statement from memory with the WHEN MATCHED and WHEN NOT MATCHED clauses you should spend an hour practicing just that before you sit for the exam. What surprised me about what was not heavily tested. Cluster configuration was maybe one question. The architecture diagrams with control plane and data plane were one or two questions at most. Delta Sharing was one question. Spark internals like shuffle details were barely mentioned. The biggest thing I wish I had done differently is spend less time reading documentation and more time actually running code. When you have actually executed a MERGE INTO on a real table and seen the results, the exam question feels like something you have done before instead of something you read about once. I used Databricks Free Edition for all my practice and it was more than enough. Hope this helps someone who is preparing right now. Feel free to ask anything about the exam in the comments and I will try to answer.

318InevitableClassic2612w ago