Skip to content
brickster.ai
All topics
Data EngineeringSee on /pulse →

LakeFlow

Recent items mentioning LakeFlow across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

60 recent items3 news13 videos44 community threads
What's happening in LakeFlowAI synthesis · updated 23h ago

Databricks has introduced Lakeflow Designer for visual, no-code data pipeline creation, though early feedback notes its generated code can be messy, with a workaround involving Genie to convert visual workflows to clean PySpark/SQL notebooks 2310. Lakeflow Connect is also expanding its integrations, with HubSpot now generally available and discussions around Salesforce and Jira connectors 678. This aligns with a broader trend towards declarative pipelines and simplified data flow 5.

Generated daily from the 10 most recent items mentioning LakeFlow. Click any [N] to jump to the source.

RedditHelp

Databricks Streaming Table + ABAC policies causing ABAC_POLICIES_NOT_SUPPORTED

Guys... has anyone run into this before? We’re facing an issue in Databricks with a streaming pipeline writing data into a Streaming Table that has ABAC policies applied on some columns (column masking policies). When the pipeline tries to write/update the table, it fails with this error: `ABAC_POLICIES_NOT_SUPPORTED: ABAC policies are not supported on tables defined within a pipeline.` Basically, the service principal running the job cannot update the Streaming Table because the table has ABAC policies applied. What’s confusing is that the Databricks docs mention that Streaming Tables support ABAC. We also tried adding the service principal to the policy whitelist/exemptions, but it still fails. Has anyone seen this behavior before? Is this an actual limitation with Lakeflow/Streaming Tables + ABAC, or are we missing some configuration? Thanks!

43Quaiadayesterday
RedditTutorial

First Look at Lakeflow Designer

70Youssef_Mriniyesterday
RedditNews

DABs templates

We can configure a folder in Workspace for custom DABs templates. If we click Create Bundle in the git repo, we can use our ready DABs template. #databricks [https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b](https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b)

151hubert-dudek2d ago
RedditNews

How NAB’s journey to 100% Declarative Pipelines is helping data flow like electricity

**1,800 :** Spark Declarative Pipelines running on Databricks Lakeflow **86% to 99.6% :** Improvement in job success rate **80% less** transformation complexity This is insane. If you have any questions about SDP let me know.

1616Youssef_Mrini2d ago
Databricks CommunityData Engineering

Salesforce Connector - Lakeflow Connect 400 Error

002d ago
RedditNews

Lakeflow Connect | HubSpot (GA)

Hi all, Lakeflow Connect's HubSpot connector is now GA! It provides a managed, secure, and native ingestion solution for the HubSpot Marketing Hub — ingesting marketing campaigns, and email analytics into Databricks. Try it now: 1. [**Set up HubSpot as a data source**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/hubspot-source-setup) 2. [**Create a HubSpot Connection in Catalog Explorer**](https://docs.databricks.com/aws/en/connect/managed-ingestion#hubspot) 3. [**Create the ingestion pipeline via the UI, a Databricks notebook, or the Databricks CLI**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/hubspot-pipeline)

101Brickster_S4d ago
Databricks CommunityWarehousing & Analytics

Lakeflow Connect for Jira - Questions

004d ago
RedditNews

Genie Schedule Tasks

Report every week for the given topic, now it is really easy, just ask for it! #databricks [https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b](https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b)

144hubert-dudek4d ago
RedditNews

Disabling tasks

I remember when I was proposing that this option was missing, and now it is GA. Disabling tasks in Lakeflow Jobs. #databricks [https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b](https://databrickster.medium.com/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b)

61hubert-dudek6d ago
RedditNews

%uv pip

In a serverless environment 5 (soon, probably in other environments as well), we can also install packages using the UV package manager. Tests show that it is even a few times faster! #databricks [https://medium.com/@databrickster/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b](https://medium.com/@databrickster/databricks-news-lakeflow-designer-uv-package-manager-genie-tasks-disable-lakeflow-tasks-3e2cfb9ef86b)

327hubert-dudek1w ago
Databricks CommunityData Engineering

Lakeflow Connect Data ingestion from SQL Server and PostgreSQL to Databricks with CDC

001w ago
Databricks CommunityAnnouncements

Agentic Data Engineering with Genie Code and Lakeflow

001w ago
RedditGeneral

What’s new in the Lakeflow Pipelines Editor

(Databricks PM here) Excited to share that the **Lakeflow Pipelines Editor is now generally available**! This is the code editor for building Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables pipelines). We shipped a few new features inside it that we'd love your feedback on. **Redesigned layout for AI first development** We now land users directly into the code and offer flexibility on where to dock the pipeline graph. By default you will now find it in the bottom panel with the option to open it up in a dedicated tab. This makes it easier to view your code, pipeline graph / table metrics and Genie Code chat window side by side.  *(Genie Code is GA.)*  https://preview.redd.it/hffbcydvtg0h1.png?width=2048&format=png&auto=webp&s=c071b1c1d012dd47f92438933a526ba2524409f1 https://preview.redd.it/kzcahydvtg0h1.png?width=2048&format=png&auto=webp&s=808600f720ff798a415fce412772dcc0ab6edc4f **Run selected SQL code to preview data**  Previously, the only way to see what a query produced was to materialize the table and re-run the pipeline. You can now highlight a block of SQL in a pipeline source file and run just that selection to preview the result — no materialization needed. Useful when you're working on a transformation and want to inspect the output before running the pipeline and materializing the data. https://preview.redd.it/za6s0zdvtg0h1.png?width=1480&format=png&auto=webp&s=3c4efb658fa6ed5a288e23bb3c2a3b90c383b740 Let us know if you are interested in seeing the same feature supported for Python! **Incrementalization insights** When a materialized view can't refresh incrementally, it falls back to a full recompute which leads to longer duration and higher cost. The editor now flags the most common patterns that prevent incrementalization directly in your code. https://preview.redd.it/i9ahdzdvtg0h1.png?width=2048&format=png&auto=webp&s=bf6c29eb6b3575182253d32e1b506ddd8138022f You can also see their aggregation in the issues panel so you can see them across the whole pipeline at a glance. https://preview.redd.it/ollfgzdvtg0h1.png?width=2048&format=png&auto=webp&s=19767d95d4449dc441f3ec41926b44c12bce9d69 We are still working on increasing our coverage so would love to hear your feedback as we improve this experience. To learn more, [you can check out our docs](https://docs.databricks.com/aws/en/ldp/multi-file-editor)! What other improvements would help your day-to-day pipeline work?

173brickster_1231w ago
RedditNews

Databricks is now supporting Microsoft Outlook in Lakeflow Connect (Beta)

Azure Databricks has introduced a managed Microsoft Outlook connector for Lakeflow Connect, currently available in Beta, enabling organizations to ingest Outlook email data directly into Azure Databricks. With this new connector, teams can now integrate Outlook-based communication data into analytics, governance, automation, and AI workflows more efficiently. https://preview.redd.it/8r0xry5yz50h1.png?width=1402&format=png&auto=webp&s=78ae55a64c3df3d47b0edcb4ed6f246f421e7b08 Key capabilities currently supported: * Incremental ingestion * Unity Catalog governance * UI & API-based pipeline authoring * Databricks Workflows orchestration * Declarative Automation Bundles * Column selection/deselection Supported authentication: * OAuth M2M (Machine-to-Machine) Current Beta limitations: * SCD Type 2 support * Automated schema evolution * API-based row filtering * Multiple tables per pipeline (currently limited to 1) Since the connector is still in Beta, workspace admins must enable the feature from the Previews page before use. Nice to see Databricks continuing to expand Lakeflow Connect integrations across enterprise ecosystems. [Source Link](https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/outlook-overview)

112Few-Engineering-41351w ago
Databricks CommunityData Engineering

Lakeflow SDP partition error

001w ago
RedditGeneral

The next generation of Databricks Genie just launched. Here is what data engineers actually need to know.

I have been following Genie since it first launched with AI/BI last year. Back then, I honestly thought it was mostly for business users. A chatbot on top of your data that could answer basic questions in plain English. Useful, but not something I thought data engineers really needed to care much about. After seeing the new 2026 version, I completely changed my mind. Genie is no longer just a business chatbot. The biggest change is Genie Code, which is basically an AI agent designed for data professionals. It can generate pipelines, debug failures, create dashboards, monitor systems, and work directly with Lakeflow and Unity Catalog. That part caught my attention immediately because it moves beyond simple Q&A and starts touching actual engineering workflows. What surprised me most is how connected the whole system has become. It can pull context from dashboards, Genie Spaces, apps, metadata, documentation, and external systems like GitHub, Jira, and Confluence through MCP. Instead of only searching tables, it tries to understand relationships across the environment. That feels very different from the first version. The operational side is also interesting. Genie Code can monitor pipelines, investigate failures, help with DBR upgrades, and respond to issues before teams even notice them. The more I read about it, the more it felt less like a chatbot and more like an assistant sitting beside the engineering team. But honestly, the biggest takeaway for me is not the AI itself. It is what this means for data engineers. A lot of people immediately jump to “AI will replace data engineers,” but I think the opposite is happening. These systems are only as good as the data foundation underneath them. If metadata is incomplete, if tables are messy, if naming conventions are inconsistent, or if documentation is missing, the AI layer will give poor answers confidently. That means clean data modeling, governance, metadata, documentation, and data quality are becoming even more important than before. The engineers building those foundations become more valuable, not less. I think the role is slowly shifting away from spending hours writing repetitive boilerplate transformations and more toward building trustworthy, AI-ready data systems. One thing I keep noticing while learning Databricks through BricksNotes and the wider community is that the platform is moving very quickly toward AI-native data engineering. Features like Unity Catalog, Lakeflow, and now Genie all connect together. It feels like understanding metadata and governance is becoming just as important as understanding Spark itself. Also interesting that Genie now has a full mobile experience on iOS and Android. Business users can access dashboards, apps, and chat directly from their phones, which means the underlying data quality matters even more because people are going to depend on these systems everywhere, not only during work hours. Curious if anyone here is already using Genie or Genie Code in production. I would genuinely like to hear how the answer quality has been and whether your teams are changing how they approach metadata and documentation because of it.

5719InevitableClassic2611w ago
RedditNews

Ingestion without CDF

You don't need CDF for incremental ingestion, and Databricks has decided to master query-based capture. [https://databrickster.medium.com/watermark-based-incremental-ingestion-lakeflow-connect-query-based-capture-91836fbaa453](https://databrickster.medium.com/watermark-based-incremental-ingestion-lakeflow-connect-query-based-capture-91836fbaa453) [https://www.sunnydata.ai/blog/lakeflow-connect-query-based-capture-incremental-ingestion](https://www.sunnydata.ai/blog/lakeflow-connect-query-based-capture-incremental-ingestion)

220hubert-dudek1w ago
Databricks CommunityData Engineeringanswered

Does Lakeflow Connect guarantee no out-of-order records?

001w ago
Databricks CommunityMVP Articles

Disable Tasks in Databricks Lakeflow Jobs: A Powerful Feature for Flexible Workflow Orchestration

001w ago
RedditDiscussion

Lakeflow connect Jira Question

Hey all, the Jira connector has actually been working really well for us. There is only one major gap from the jira API which is the issues history object/endpoint. Right now we are not able to track the historical data from Jira itself. Are there plans to enable this object as part of the databricks connector?

44EnvironmentalAd20961w ago
RedditNews

Lakeflow Connect | Smartsheet (Beta)

Hi all, Lakeflow Connect's Smartsheet connector is now in beta! It provides a managed, secure, and native ingestion solution for Smartsheet sheets and reports into Databricks. Try it now: 1. **Enable the Smartsheet Beta:** Workspace admins can enable the Beta via Settings → Previews → "LakeFlow Connect for Smartsheet" 2. [**Set up Smartsheet as a data source**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/smartsheet-source-setup) 3. [**Create a Smartsheet Connection in Catalog Explorer**](https://docs.databricks.com/aws/en/connect/managed-ingestion#smartsheet) 4. [**Create the ingestion pipeline via the UI, a Databricks notebook or the Databricks CLI**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/smartsheet-pipeline)

31Brickster_S1w ago
RedditTutorial

pivot() workarounds in Lakeflow Spark Declarative Pipelines

Problem: In Lakeflow Spark Declarative Pipelines, the `pivot()` function is not supported. The `pivot` operation in Spark requires the eager loading of input data to compute the output schema. This capability is not supported in pipelines. Source: [https://docs.databricks.com/aws/en/ldp/limitations](https://docs.databricks.com/aws/en/ldp/limitations) # How can this be mitigated? **Workaround 1: Rewrite PIVOT Using CASE WHEN** This is the most common workaround. You manually expand the pivot into conditional aggregations. >Original Query: SELECT * FROM sales_data PIVOT ( SUM(sales) FOR region IN ('North', 'South', 'East', 'West') ) >Rewritten without PIVOT: SELECT product, SUM(CASE WHEN region = 'North' THEN sales ELSE 0 END) AS North, SUM(CASE WHEN region = 'South' THEN sales ELSE 0 END) AS South, SUM(CASE WHEN region = 'East' THEN sales ELSE 0 END) AS East, SUM(CASE WHEN region = 'West' THEN sales ELSE 0 END) AS West FROM sales_data GROUP BY product This works perfectly in Lakeflow Pipelines because the output schema is fully deterministic at parse time, no eager data loading required. **Workaround 2: Rewrite PIVOT Using aggregate FILTER** Databricks SQL supports the `FILTER(WHERE ...)` clause on aggregates, which is a cleaner alternative to CASE WHEN: >Original PIVOT query: SELECT year, region, q1, q2, q3, q4 FROM sales PIVOT ( SUM(sales) AS sales FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten with FILTER: SELECT year, region, SUM(sales) FILTER(WHERE quarter = 1) AS q1, SUM(sales) FILTER(WHERE quarter = 2) AS q2, SUM(sales) FILTER(WHERE quarter = 3) AS q3, SUM(sales) FILTER(WHERE quarter = 4) AS q4 FROM sales GROUP BY year, region This syntax is often more readable than nested CASE WHEN, especially with multiple aggregations. **Multi-Column PIVOT Rewrite** >For pivoting on multiple columns simultaneously: SELECT * FROM sales PIVOT ( SUM(sales) AS sales FOR (quarter, region) IN ((1, 'east') AS q1_east, (1, 'west') AS q1_west, (2, 'east') AS q2_east, (2, 'west') AS q2_west) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'east') AS q1_east, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'west') AS q1_west, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'east') AS q2_east, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'west') AS q2_west FROM sales GROUP BY year **Multiple Aggregations** You can also rewrite PIVOTs that use multiple aggregate functions. >Original Query SELECT * FROM (SELECT year, quarter, sales FROM sales) AS s PIVOT ( SUM(sales) AS total, AVG(sales) AS avg FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1) AS q1_total, AVG(sales) FILTER(WHERE quarter = 1) AS q1_avg, SUM(sales) FILTER(WHERE quarter = 2) AS q2_total, AVG(sales) FILTER(WHERE quarter = 2) AS q2_avg, SUM(sales) FILTER(WHERE quarter = 3) AS q3_total, AVG(sales) FILTER(WHERE quarter = 3) AS q3_avg, SUM(sales) FILTER(WHERE quarter = 4) AS q4_total, AVG(sales) FILTER(WHERE quarter = 4) AS q4_avg FROM sales GROUP BY year **Summary** Both approaches produce identical results and work fully within SDP pipelines with complete lineage tracking.

113zr-brickster1w ago
RedditDiscussion

Databricks Data Engineer Associate Exam Updated for 2026

The Databricks Data Engineer Associate exam changed on May 4, 2026. The exam now has 7 domains instead of 5. Two new domains were added. The first new domain is CI/CD. This includes: • Databricks Repos • Git integration • Branching and commits • Deploying Declarative Automation Bundles • Using the Databricks CLI • Moving code from dev to test to production Databricks Asset Bundles is now called Declarative Automation Bundles, so learn the new name. If you have never used Git or the Databricks CLI inside Databricks, spend some time practicing in the Free Edition. Connect a Git repo, make commits, and deploy bundles. Hands-on practice will help a lot. The second new domain is Troubleshooting, Monitoring, and Optimization. This includes: • Reading the Spark UI • Finding bottlenecks like data skew and excessive shuffling • Understanding Liquid Clustering • Predictive optimization • Troubleshooting cluster and memory issues Many courses do not teach Spark UI deeply, so try running queries yourself and checking the Spark UI. Compare good queries with inefficient ones to understand the difference. Some existing domains also changed. Ingestion now includes Lakeflow Connect along with Auto Loader and COPY INTO. Governance now includes: • Column-level masking • Row-level security • Attribute-based access control You now need to understand security beyond basic GRANT permissions. Lakeflow Jobs also tests three trigger types: • Scheduled • File arrival • Table update Know when to use each one. Some product names also changed: • Databricks Asset Bundles → Declarative Automation Bundles • Delta Live Tables → Lakeflow Declarative Pipelines The exam uses the new terminology, so update your study material if you are using older resources. The exam format is still: • 45 scored questions • 90 minutes • $200 There may also be extra unscored questions mixed into the exam. For preparation, the original Academy courses still help for the old domains. But for the two new domains, hands-on practice is very important. Practice: • Spark UI • Git integration • Databricks CLI • Deployments using bundles Also read the latest official exam guide PDF from the Databricks page. Good luck to everyone preparing for the exam.

468InevitableClassic2611w ago
RedditNews

Lakeflow Connect | Outlook (Beta)

Hi all, Lakeflow Connect's Outlook connector is now in beta! The Lakeflow Connect Outlook connector provides a managed, secure, and native ingestion solution for Microsoft Outlook email data — ingesting messages and attachments into Databricks. Try it now: 1. [**Set up Outlook as a data source**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/outlook-source-setup) 2. [**Create an Outlook Connection in Catalog Explorer**](https://docs.databricks.com/aws/en/connect/managed-ingestion#outlook) 3. [**Create the ingestion pipeline via the UI, a Databricks notebook, or the Databricks CLI**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/outlook-pipeline)

141Brickster_S1w ago
RedditNews

Lakeflow Connect | GitHub (Beta)

Hi all, Lakeflow Connect's GitHub connector is now in beta! The Lakeflow Connect GitHub connector provides a managed, secure, and native ingestion solution for GitHub organizational metadata and activity data — ingesting commits, pull requests, team members, and more into Delta tables. Try it now: 1. [**Set up GitHub as a data source**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/github-source-setup) 2. [**Create a GitHub Connection in Catalog Explorer**](https://docs.databricks.com/aws/en/connect/managed-ingestion#github) 3. [**Create the ingestion pipeline via the UI, a Databricks notebook, or the Databricks CLI**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/github-pipeline)

50Brickster_S1w ago
Databricks CommunityData Engineeringanswered

Is Lakeflow Connect SCD Type 2 output is incompatible with Spark dec pipeline streaming tables?

001w ago
Databricks CommunityData Engineering

SCD2 table migration using LakeFlow

001w ago
Databricks CommunityMVP Articles

Watermark-Based Incremental Ingestion (Lakeflow Connect query-based capture)

002w ago
RedditNews

Watermark-Based Incremental Ingestion (Lakeflow Connect query-based capture)

What if we want to ingest data incrementally without CDF? than we have new functionality from Databricks "query-based capture", which is nothing less than watermark-based incremental ingestion. It seems to be another Best Practice solution for incremental data loading. [https://databrickster.medium.com/watermark-based-incremental-ingestion-lakeflow-connect-query-based-capture-91836fbaa453](https://databrickster.medium.com/watermark-based-incremental-ingestion-lakeflow-connect-query-based-capture-91836fbaa453) [https://www.sunnydata.ai/blog/lakeflow-connect-query-based-capture-incremental-ingestion](https://www.sunnydata.ai/blog/lakeflow-connect-query-based-capture-incremental-ingestion)

21hubert-dudek2w ago
RedditTutorial

If your Lakeflow SDP pipeline broke with DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE, here's a recovery script

I ran into this recently and wanted to share. A Delta table I was streaming from got dropped and recreated by an upstream team. Same name, same schema, but the new table has a fresh internal ID. Spark Structured Streaming checkpoints bind to that ID, so the next pipeline run error with: `[DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE] The streaming query was reading from an unexpected Delta table...` In open-source Spark you'd delete the checkpoint directory. Lakeflow SDP manages those paths internally, so that's not an option. The fix is the Pipelines API parameter `reset_checkpoint_selection` (added in `databricks-sdk` 0.100): pass a list of FQN flow names and start an update that clears only those checkpoints. Bronze/Silver/Gold targets stay untouched. I packaged the recovery as a sub-template in my Databricks bundle template repo. One CLI call ships the script (with a `--dry-run` flag), a workspace notebook variant, and a README: `databricks bundle init https://github.com/vmariiechko/databricks-bundle-template --template-dir assets/sdp-checkpoint-recovery` It also includes a fallback for environments where you can't pip-upgrade the SDK (for me it was the case when using the Databricks serverless runtime, which bundles its own SDK). Repo: https://github.com/vmariiechko/databricks-bundle-template/tree/main/assets/sdp-checkpoint-recovery Two gotchas worth knowing: - Flow names must be three-part Unity Catalog FQNs (`catalog.schema.table`), or you hit `IllegalArgumentException`. - Resetting checkpoints triggers a pipeline update; the API has no "reset only" mode. If you want the pipeline stopped after, cancel from the UI as soon as the call returns. Happy to answer questions or hear how you have handled this situation. P.S. Feel free to submit issues or PRs.

22Marik3482w ago
RedditGeneral

[Beta] Lakeflow Connect community connectors

Hey everyone, we just launched a beta for Lakeflow Connect community connectors. The goal is to make it simple to build a custom connector for almost any source you want to ingest from (currently, REST API-based sources work best). They're completely open-source and to make things easier, the repo includes AI dev tools to help speed up the research, auth setup, and testing of connectors for new sources. When you use a community connector, the source code is simply cloned from a GitHub repository into a workspace directory you specify, where it runs as an SDP pipeline. This means you get the same governance and observability guarantees as any other SDP pipeline, and you can deploy them directly through the Add Data UI. The repo already contains 15+ sample connectors (including GitHub, HubSpot, Mixpanel, and Zendesk). You can run these sample connectors as-is, fork them, or just use them as a reference template for your own. * **Repo:**[ https://github.com/databrickslabs/lakeflow-community-connectors](https://github.com/databrickslabs/lakeflow-community-connectors) * **Docs:**[ https://docs.databricks.com/aws/en/ingestion/community-connectors](https://docs.databricks.com/aws/en/ingestion/community-connectors)

122ingest_brickster_1982w ago
Databricks CommunityData Engineeringanswered

Lakeflow Declarative Pipeline queue

002w ago
RedditHelp

Any tips for DABs in CI/CD? Seems pretty useless so far.

We've used DAB-commands like Validate and Plan for a while - to print Github PR-comments on what the PR will change, delete and create. But we are struggling to catch breaking changes before they are committed to main. Some examples: After migrating a pipeline to serverless compute, our branch passed the plan in the PR stage, but failed in main due to `You must use the Advanced edition when using serverless compute. (400 INVALID_PARAMETER_VALUE)` which is something I would expect CI to catch. Another example is Lakeflow generating a new pipeline-ID, which means during deploy it will try to apply itself to an existing pipeline and fail on mismatching pipeline-ids. Again, would've loved to fail in CI instead of main. How are you solving this?

57DeepFryEverything2w ago
RedditGeneral

Finally! Databricks lets you disable tasks without hacks

For years, there was no simple way to disable a single task in a Databricks workflow. Let that sink in 🙂 If you wanted to skip a task, you had to get creative \- Add custom flags \- Wrap logic in if/else blocks \- Or build your own workaround just to not run something It worked, but it came at a cost. All that extra logic cluttered the code, made pipelines harder to read, and turned what should be a simple toggle into a maintenance headache. Now, we finally have the ability to disable individual tasks in Lakeflow Jobs - while keeping everything intact for later. Worth knowing: * A disabled task in an Azure Databricks Lakeflow job is skipped at runtime without being removed from the job. * Disabled tasks retain their configuration and run history, so you can re-enable them later without rebuilding anything. * The feature is currently in private preview. To enable it, go to Previews and switch the toggle to ON. https://preview.redd.it/4u2dpxbsr2zg1.png?width=811&format=png&auto=webp&s=dbec953e66528698487fd572b2abd3e1053724cd

1712szymon_dybczak2w ago
RedditDiscussion

The learning order that actually works for Databricks. I wasted 3 months before figuring this out.

I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.

8518InevitableClassic2612w ago
RedditHelp

Lakeflow d365 full refresh

Hi folks Need a solution to this problem, full refresh/initial data load. We have a synapse link that creates timestamp folders, I need to do a full refresh but the task is trawling via 10000s of folders. Running a table at a time helps, is there a better solution.

31Snoo_747702w ago
RedditNews

Community connectors

Community connectors Databricks is built on open-source. Now, let's change how we ingest data so anyone can build connectors. Community connectors are here! For me, it is one of the most important news stories of the year, as soon as we can have 1000s of connectors, and I count on contributions from all SaaS platforms! repo [https://github.com/databrickslabs/lakeflow-community-connectors/tree/master](https://github.com/databrickslabs/lakeflow-community-connectors/tree/master) more info [https://docs.databricks.com/aws/en/ingestion/community-connectors](https://docs.databricks.com/aws/en/ingestion/community-connectors)

222hubert-dudek2w ago
RedditGeneral

[Passed] Databricks DEA Exam today

https://preview.redd.it/z6mcmrgvmjyg1.png?width=474&format=png&auto=webp&s=28e010f62635d49af3a815998011125d8f2cfa0f Just walked out of the exam and I’m glad to say I passed. I was sweating a bit because the exam content changes on the 4th, so I really didn't want to fail and have to deal with a new syllabus. I've had Databricks at work since late 2023. I’ve been using it because, well, it’s there, but I was mostly just "vibe coding"—picking up some Python and Spark here and there without any real depth. I ran jobs using whatever cluster settings the company gave me without actually knowing what they meant. If you’ve never touched Databricks, this exam is going to be a pain. Even if you’re good at coding, the internal components and the way everything fits together are hard to grasp just by reading. You really need to get your hands dirty in the workspace to get a "feel" for it. **Study Routine** I started with the Databricks Academy stuff, but since I’m juggling work and a toddler, I could only study on weekends. This was a disaster because by the next Saturday, I’d already forgotten what I learned the week before. One month before the exam, I ditched the theory and just hammered Mock Exams. * Udemy is your friend: I bought practice exams from Derar and Santosh. * I snagged them at discounted price. Just wait for the sale if you are not in a hurry. Personally, Santosh’s exams felt closer to the real thing. I saw maybe 5-6 questions that were almost word-for-word. Derar is also solid; honestly, just solve as many problems as possible. Since my study time was limited, I focused on reviewing the questions I got wrong. I realized pretty early that Productionizing Data Pipelines was my weak spot. I didn't try to become an expert in it. I just aimed for a 60% "pass" in that section and doubled down on the areas I was actually good at. Don't completely ignore your weak areas though. If you bomb one section too hard, a couple of silly mistakes in other sections will kill your score. **What's on the exam** The questions are mostly scenario-based. You have to read the prompts carefully. Some things I remember: * Autoloader: This came up a lot. * DLT (now called Lakeflow Spark Declarative Pipelines): should understand what it actually does * Unity Catalog: Permissions (Granting minimum access) and the actual SQL code for it. * Delta Sharing: Knowing the difference between sharing with Databricks vs. non-Databricks users. * Egress Costs: How to avoid them in cross-cloud sharing (Cloudflare R2 was the answer for one). * SQL Warehouses: Classic vs. Pro vs. Serverless. Know when to use which. * DABs (Databricks Asset Bundles): I got at least 3 questions on this. Don't skip it. * Medallion Architecture: It’s not just "what is Bronze/Silver/Gold." They’ll give you a scenario and ask which layer the data should go to next. Also, those "select two" questions are the absolute worst, super confusing. I know the syllabus is changing on the 4th, so I’m not sure how much of this will still apply. But honestly, if you have some background and get familiar with the core concepts, it’s a very doable exam. I’ve learned a lot through this process. Good luck to everyone preparing!

64Significant_Pace3612w ago
RedditDiscussion

Here are 5 topics that showed up much more than I expected in my DEA exam

I took the Databricks Data Engineer Associate exam recently and wanted to share what actually came up because it was quite different from what I spent most of my time studying. I went in thinking Delta Lake theory and platform architecture would be the big topics. They weren't. The exam is way more practical than I expected. **The first thing** that caught me off guard was how heavily they test Auto Loader. Not just the basics but real scenarios. One question described a pipeline receiving 50,000 new files per day and asked which ingestion method to use and why. You need to understand when Auto Loader makes sense versus COPY INTO, how schema evolution works with mergeSchema, and the difference between directory listing and file notification mode. I probably got six or seven questions just on this one topic. **The second thing** was lazy evaluation. I knew the concept but I wasn't prepared for how they test it. They give you a block of code with four or five DataFrame transformations and ask what happens when you run the cell. The answer is nothing happens because there is no action at the end. But the way they frame the questions makes you second guess yourself if you only memorized the definition without really understanding it. **Third** was Lakeflow expectations. The old name was Delta Live Tables but they use Lakeflow in the exam now. You need to know the three expectation types and when to use each one. They gave me a scenario where the pipeline should log bad records but never drop them and I had to pick the right expectation decorator. Also know the difference between streaming tables and materialized views because that came up more than once. **Fourth** was Unity Catalog permissions. Not just the three level naming pattern but actual grant scenarios. Something like a data analyst needs to read tables in the sales schema but should not be able to create new tables and you have to pick the correct grant statement. I got at least three or four questions like this. **Fifth** was MERGE INTO. They really love this command. Upsert scenarios, deduplication, slowly changing dimensions. If you cannot write a MERGE statement from memory with the WHEN MATCHED and WHEN NOT MATCHED clauses you should spend an hour practicing just that before you sit for the exam. What surprised me about what was not heavily tested. Cluster configuration was maybe one question. The architecture diagrams with control plane and data plane were one or two questions at most. Delta Sharing was one question. Spark internals like shuffle details were barely mentioned. The biggest thing I wish I had done differently is spend less time reading documentation and more time actually running code. When you have actually executed a MERGE INTO on a real table and seen the results, the exam question feels like something you have done before instead of something you read about once. I used Databricks Free Edition for all my practice and it was more than enough. Hope this helps someone who is preparing right now. Feel free to ask anything about the exam in the comments and I will try to answer.

318InevitableClassic2612w ago
RedditTutorial

No-Code Pipelines: Databricks Lakeflow Designer Demo (4-min demo)

In this demo for Lakeflow Designer you will see me: \-Pulling sales data from Unity Catalog AND from local Excel workbooks \-Building a one big table report and more specialized reports \-Exporting reports to Excel \-Storing outputs back into Unity Catalog tables for use by analysts, BI developers, and even business users using the Databricks' Excel connector \-Setting up a recurring pipeline so that the data is kept fresh automatically \-All of the above without writing any code I hope you enjoy the video, but more importantly, that you try it out yourself AND give feedback to the folks at Databricks on how to make the product even better. There are still many out-of-the-box pieces not yet available, and I know that the amount of feedback Databricks gets from customers will affect direction and priorities!

31JosueBogran3w ago
Databricks CommunityData Engineering

Lakeflow Connect: Data Ingestion from SQL Server to Databricks

003w ago
RedditDiscussion

Heading into the May 2026 Databricks Data Engineer Associate Exam? Read this first.

So if you've been scrolling through older study guides for the Databricks Data Engineer Associate exam — be careful. The syllabus got a pretty big update this month, and the focus has shifted toward the platform's newer declarative features. I spent some time going through the new guidelines. Here's what I found. Lakeflow is the new standard. The exam has moved away from manual ETL logic. You need to understand Lakeflow Spark Declarative Pipelines (formerly DLT) and how Streaming Tables and Materialized Views actually differ. If your notes still say "DLT" everywhere, time to update them. DABs are no longer a side topic. Databricks Asset Bundles — basically infrastructure-as-code for workflows — is now a core part of the exam. They want to see that you can deploy through DABs, not just click around the UI. Unity Catalog is the default assumption. No more legacy Hive Metastore questions. The exam lives in a UC-enabled world now. Three-tier namespace (catalog.schema.table), Volumes for unstructured data, column-level lineage — that's where your time should go. Serverless Compute is showing up more. When do you pick Serverless SQL Warehouses or Serverless Jobs over classic clusters? That tradeoff — less config overhead vs. less control — is fair game now. The weightings that surprised me → 31% on Processing (Lakeflow, Spark, Streaming Tables) → 18% on Productionizing (DABs, Workflows, deployment) That's almost half the exam right there. Honestly, if you just understand why Databricks is pushing toward declarative tools — letting the platform handle the boring parts so you can focus on the actual logic — a lot of the questions start to make sense. For practice material, BricksNotes has an updated practice test that follows the May 2026 format — 45 questions, 90 minutes, same weightings. → [bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026](http://bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026) Good luck to everyone testing this month! Drop questions below if you're stuck on any of the new topics — happy to help where I can.

104InevitableClassic2613w ago
Databricks CommunityGet Started Discussions

Lakeflow jobs file trigger thru overwritten files

003w ago
RedditGeneral

System Tables are... overcomplicated? + some helpers

Had a chance to play with System Tables a bit more in the last 2 weeks - every meaningful query takes 70-100+ lines of SQL due to horrendous design decisions, even Genie Code makes mistakes all the time when writing these. **SCD2 tables lack the basic timestamps** The advertised SCD2-like tables (jobs/tasks/clusters/warehouses) lack the basic timestamping funcitonality, like \_\_start\_time / \_\_end\_time, to use them in an appropriate way one has to use windowing function every single time. It's even more surprising considering how Databricks promotes the autoCDC functionality which adds these by default. **No SCD1-like VIEWs** One has to remember to use `ROW_NUMBER()` on SCD2-like tables, or suffer from the duplicates post-JOIN **job runs / job task runs slicing** Inconsistency between these, and `system.query.history`. The former emits hourly slices while the latter updates the already emitted row in-place. Every time a job run time is needed, one has to use a group by. Additionally, `compute_ids` column on job runs doesn't contain all computes attached to its tasks - it’s a documented flaw, but still. Is there any good source for SQL queries against System Tables? Jobs system tables documentation seem to be the only place to list anything more complicated, but it's still lacking basics like AVG CPU usage per job run, together with a cluster/worker configuration at a runtime (ie, how many workers the autoscaling scaled to etc) [https://docs.databricks.com/aws/en/admin/system-tables/jobs](https://docs.databricks.com/aws/en/admin/system-tables/jobs) Maybe some of you will find these helpful, I wish we were able to create views inside the `system` catalog. First part - 4 SCD1 tables for jobs/tasks/clusters/warehouses, then 2 SCD1 tables for job runs / task runs with additional `run_start`,`run_end`, `run_last_seen`,`run_duration_seconds`, and `retries` columns CREATE OR REPLACE VIEW shared_prod.system.jobs_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id ORDER BY change_time DESC) rn FROM system.lakeflow.jobs) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.tasks_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id, task_key ORDER BY change_time DESC) rn FROM system.lakeflow.job_tasks) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.warehouses_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, warehouse_id ORDER BY change_time DESC) rn FROM system.compute.warehouses) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.clusters_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, cluster_id ORDER BY change_time DESC) rn FROM system.compute.clusters) WHERE rn = 1; and same for job runs / job task runs CREATE OR REPLACE VIEW shared_prod.system.job_run_timeline_scd1 AS WITH base AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id, run_id ORDER BY period_end_time DESC) rn FROM system.lakeflow.job_run_timeline QUALIFY rn = 1 ), agg AS ( SELECT workspace_id, job_id, run_id, MIN(period_start_time) AS run_start, MAX(period_end_time) AS run_last_seen, SUM(CASE WHEN result_state IS NOT NULL THEN 1 ELSE 0 END) - CASE WHEN MAX_BY(result_state, period_end_time) IS NOT NULL THEN 1 ELSE 0 END AS retries, CASE WHEN MAX_BY(result_state, period_end_time) IS NOT NULL THEN MAX(period_end_time) ELSE NULL END AS run_end FROM system.lakeflow.job_run_timeline GROUP BY ALL ) SELECT b.* EXCEPT(rn, period_start_time, period_end_time), a.run_start, a.run_end, a.run_last_seen, a.retries, TIMESTAMPDIFF(SECOND, a.run_start, a.run_end) AS run_duration, (a.run_end IS NULL) AS is_running FROM […truncated]

73Own-Trade-22433w ago