Databricks SQL
Recent items mentioning Databricks SQL across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Databricks has launched a new learning pathway for SQL practitioners on Databricks Academy, covering ETL, data modeling, semantic layers, and conversational agents 1. For demanding workloads, 5XL SQL Warehouses are now available 3, and users are seeing performance boosts by switching to Serverless SQL Warehouses for tools like Power BI 7. Additionally, Materialized Views and Streaming Tables are in beta for Serverless Notebooks 2.
Generated daily from the 10 most recent items mentioning Databricks SQL. Click any [N] to jump to the source.
Announcing the Databricks analytics engineer learning pathway
A new learning pathway for Databricks SQL practitioners is now available on Databricks Academy, covering skills to use the full SQL ETL toolkit for data modeling, pipelines, semantic layers, and conversational agents. Courses are offered in self-paced and instructor-led formats, and are included with any active Databricks Learning Subscription.
Beta alert: Materialized Views and Streaming Tables in Serverless Notebooks
Hi folks, Wanted to share a new feature that's in [beta](https://docs.databricks.com/aws/en/ldp/dbsql/compute#serverless-general-compute) \- creating and refreshing materialized views and streaming tables from serverless compute! Users can create MVs natively in SQL or using `spark.sql("CREATE MATERIALIZED VIEW test_mv AS SELECT * from samples.wanderbricks.booking_updates")` in their notebooks and jobs attached to serverless compute. Workspace admins can enable the beta feature, "MV and ST in Serverless Notebooks and Jobs" in their preview settings. It’s currently available in [select regions](https://docs.databricks.com/aws/en/resources/feature-region-support#serverless-aws). Would love to hear y'all's feedback!
Introducing 5XL SQL Warehouses: A Practical Guide to Meeting SLAs for Your Most Demanding Workloads
Modular structure for Databricks Apps (Streamlit)
Hey, I wanted to share something that's been bugging me for a while and get your take. The official Databricks Streamlit tutorial puts everything into a single **app.py.** Fine for a demo. But the moment a real internal app grows past \~500–600 lines, it stops being fun: * Two people on the team touch the same file → merge conflicts every PR. * Hard to write unit tests when UI, data access, and business logic live in one module. * Git diffs become unreadable, and code review suffers. * When I point Cursor/Claude at the repo, it has to re-read the whole monolith on every prompt. Context window and cost both balloon. So I refactored our internal template into something more boring and modular: app. py # entry point only, routing pages/ ├── home. py ├── analytics. py └── settings. py components/ # reusable UI bits services/ # SQL warehouse / UC / SDK calls assets/ ├── styles.css └── logo.png tests/ *This is my own repo, not a product. Sharing because the single-file pattern bit us hard, and I figured others might find it useful -* [*https://github.com/protmaks/databricks\_apps\_streamlit\_mod\_template*](https://github.com/protmaks/databricks_apps_streamlit_mod_template)
Serverless SQL Warehouses Strategy
Hi, we're a big industrial company and have some pretty diverse use cases in terms of data volume, speed requirements etc. Many of them are quite sporadic (serving data to PowerBI dashboards which are queried a few times per day, but need to be performant then). We are currently thinking on how to provision SQL Serverless Warehouses to our users. How do you do this in your companies: \- Do you have one (or a few) larger warehouses that serve all different use cases? Or \- Do you create / have users create their own warehouses per use case? \- Or do you use a/multiple shared classic warehouses running 24/7? Cost allocation wise the latter one is easier to track, but from a compute cost point of view I imagine the former one is probably more efficient?
Why You Cannot Choose the SQL Warehouse in Databricks Chat & Assistant Features?
How Switching from JDBC/ODBC Clusters to Serverless SQL Warehouses Boosted Our Power BI Performance
pivot() workarounds in Lakeflow Spark Declarative Pipelines
Problem: In Lakeflow Spark Declarative Pipelines, the `pivot()` function is not supported. The `pivot` operation in Spark requires the eager loading of input data to compute the output schema. This capability is not supported in pipelines. Source: [https://docs.databricks.com/aws/en/ldp/limitations](https://docs.databricks.com/aws/en/ldp/limitations) # How can this be mitigated? **Workaround 1: Rewrite PIVOT Using CASE WHEN** This is the most common workaround. You manually expand the pivot into conditional aggregations. >Original Query: SELECT * FROM sales_data PIVOT ( SUM(sales) FOR region IN ('North', 'South', 'East', 'West') ) >Rewritten without PIVOT: SELECT product, SUM(CASE WHEN region = 'North' THEN sales ELSE 0 END) AS North, SUM(CASE WHEN region = 'South' THEN sales ELSE 0 END) AS South, SUM(CASE WHEN region = 'East' THEN sales ELSE 0 END) AS East, SUM(CASE WHEN region = 'West' THEN sales ELSE 0 END) AS West FROM sales_data GROUP BY product This works perfectly in Lakeflow Pipelines because the output schema is fully deterministic at parse time, no eager data loading required. **Workaround 2: Rewrite PIVOT Using aggregate FILTER** Databricks SQL supports the `FILTER(WHERE ...)` clause on aggregates, which is a cleaner alternative to CASE WHEN: >Original PIVOT query: SELECT year, region, q1, q2, q3, q4 FROM sales PIVOT ( SUM(sales) AS sales FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten with FILTER: SELECT year, region, SUM(sales) FILTER(WHERE quarter = 1) AS q1, SUM(sales) FILTER(WHERE quarter = 2) AS q2, SUM(sales) FILTER(WHERE quarter = 3) AS q3, SUM(sales) FILTER(WHERE quarter = 4) AS q4 FROM sales GROUP BY year, region This syntax is often more readable than nested CASE WHEN, especially with multiple aggregations. **Multi-Column PIVOT Rewrite** >For pivoting on multiple columns simultaneously: SELECT * FROM sales PIVOT ( SUM(sales) AS sales FOR (quarter, region) IN ((1, 'east') AS q1_east, (1, 'west') AS q1_west, (2, 'east') AS q2_east, (2, 'west') AS q2_west) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'east') AS q1_east, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'west') AS q1_west, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'east') AS q2_east, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'west') AS q2_west FROM sales GROUP BY year **Multiple Aggregations** You can also rewrite PIVOTs that use multiple aggregate functions. >Original Query SELECT * FROM (SELECT year, quarter, sales FROM sales) AS s PIVOT ( SUM(sales) AS total, AVG(sales) AS avg FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1) AS q1_total, AVG(sales) FILTER(WHERE quarter = 1) AS q1_avg, SUM(sales) FILTER(WHERE quarter = 2) AS q2_total, AVG(sales) FILTER(WHERE quarter = 2) AS q2_avg, SUM(sales) FILTER(WHERE quarter = 3) AS q3_total, AVG(sales) FILTER(WHERE quarter = 3) AS q3_avg, SUM(sales) FILTER(WHERE quarter = 4) AS q4_total, AVG(sales) FILTER(WHERE quarter = 4) AS q4_avg FROM sales GROUP BY year **Summary** Both approaches produce identical results and work fully within SDP pipelines with complete lineage tracking.
Live Cost Estimator
I'm building a **live cost estimator** that doesn't have to wait for the system tables or billing data to update. It gives me immediate cost feedback every second and I'm sharing the development journey on YouTube. I already have live costs estimates for **all-purpose clusters, SQL warehouses and interactive serverless compute.** I would love some feedback, suggestions and if you want to try it out or contribute let me know!
MLflow 3.12.0 introduces multimodal tracing, allowing storage and rich rendering of PDFs, audio, and images as artifact attachments in tracing spans. It also adds AI Gateway guardrails to prevent unsafe model inputs/outputs and extends coding agent tracing support to Codex, Gemini, and Qwen.
Why does the same Databricks SQL query take different time to run?
Pricing for Genie Code: Cluster usage vs. LLM tokens?
Hi everyone, I’m looking into implementing **Databricks Genie Code Agent** in our workspace and I have a question regarding the billing model. My company currently keeps a cluster (SQL Warehouse) running throughout the day. When using Genie Code to ask questions or generate logic, how exactly is the cost calculated? * **Is it just the compute cost?** Since our cluster is already active, does Genie simply "consume" those existing resources to run the generated queries? * **Are there extra LLM costs?** Does Databricks charge a separate fee for the LLM tokens (input/output) used to process natural language, or is the model usage included in the platform fee? Basically, I want to know if using Genie heavily will result in a surprise bill for "AI Tokens" or if it stays within the standard DBU consumption of our active warehouses. Thanks in advance!
Marimo on Databricks
My workflow for a long time involved me switching back/forth between vscode and browser/databricks ui. I like to write my "production code" in normal python, but notebooks are great for exploration, spikes, visualization, triage etc. I could write a small dissertation but for various reasons I don't really like jupyter, and databricks notebooks have their own problems with commented magic commands etc. This led me to check out [marimo](https://marimo.io/), and wow, these are so cool. Code that runs in normal python, merges cleanly, has visualizations, widgets, the the app runs locally and doesn't glitch out, and even the vscode extension works nicely. The problem was, the databricks support wasn't great. It just felt a bit dated. It required a warehouse for sql, doesn't seem to really support serverless, and there were just so many oppurtunities to plug databricks into Marimo. This led me to create [marimo-databricks-connect](https://github.com/brookpatten/marimo-databricks-connect) [pypi](https://pypi.org/project/marimo-databricks-connect/) I tried to plug in "all the things" databricks into the place where they go in Marimo. I'm pretty happy with the result. - Connect to databricks using databricks-connect & spark (not sql warehouse) - Authenticate/configure spark using the default databricks-connect process (env vars, .databrickscfg etc), no additional auth config. - Execution of both python & sql cells - Autocomplete Catalog/Schema/Table/Column Names - Browsing of catalogs/schemas/tables/columns in the marimo data sources view - Browsing of external locations, volumes, dbfs, workspace in the marimo storage browser Notebook widgets to monitor and control of specific instances of databricks capabilities (clusters, workflows, vector search, apps etc) - Widgets to browse & explore databricks capabilities (compute, workflows, unity catalog) - Works in local marimo marimo edit notebook.py, in the vscode extension - Deploy as a databricks app to provide an alternative web based marimo UI. I'm working on adding serving endpoints as AI providers to the notebooks too. In particular what I like to use this for is creating "command center" notebooks for given processes that can include some normal pyspark/sql code to query/triage, widgets to monitor/control various databricks resources, visualizations to monitor dq etc. I just wanted to share and see what the community thinks, would you use it? contributions are welcome. throwaway account because i'm doxing myself via gh repo.
[Passed] Databricks DEA Exam today
https://preview.redd.it/z6mcmrgvmjyg1.png?width=474&format=png&auto=webp&s=28e010f62635d49af3a815998011125d8f2cfa0f Just walked out of the exam and I’m glad to say I passed. I was sweating a bit because the exam content changes on the 4th, so I really didn't want to fail and have to deal with a new syllabus. I've had Databricks at work since late 2023. I’ve been using it because, well, it’s there, but I was mostly just "vibe coding"—picking up some Python and Spark here and there without any real depth. I ran jobs using whatever cluster settings the company gave me without actually knowing what they meant. If you’ve never touched Databricks, this exam is going to be a pain. Even if you’re good at coding, the internal components and the way everything fits together are hard to grasp just by reading. You really need to get your hands dirty in the workspace to get a "feel" for it. **Study Routine** I started with the Databricks Academy stuff, but since I’m juggling work and a toddler, I could only study on weekends. This was a disaster because by the next Saturday, I’d already forgotten what I learned the week before. One month before the exam, I ditched the theory and just hammered Mock Exams. * Udemy is your friend: I bought practice exams from Derar and Santosh. * I snagged them at discounted price. Just wait for the sale if you are not in a hurry. Personally, Santosh’s exams felt closer to the real thing. I saw maybe 5-6 questions that were almost word-for-word. Derar is also solid; honestly, just solve as many problems as possible. Since my study time was limited, I focused on reviewing the questions I got wrong. I realized pretty early that Productionizing Data Pipelines was my weak spot. I didn't try to become an expert in it. I just aimed for a 60% "pass" in that section and doubled down on the areas I was actually good at. Don't completely ignore your weak areas though. If you bomb one section too hard, a couple of silly mistakes in other sections will kill your score. **What's on the exam** The questions are mostly scenario-based. You have to read the prompts carefully. Some things I remember: * Autoloader: This came up a lot. * DLT (now called Lakeflow Spark Declarative Pipelines): should understand what it actually does * Unity Catalog: Permissions (Granting minimum access) and the actual SQL code for it. * Delta Sharing: Knowing the difference between sharing with Databricks vs. non-Databricks users. * Egress Costs: How to avoid them in cross-cloud sharing (Cloudflare R2 was the answer for one). * SQL Warehouses: Classic vs. Pro vs. Serverless. Know when to use which. * DABs (Databricks Asset Bundles): I got at least 3 questions on this. Don't skip it. * Medallion Architecture: It’s not just "what is Bronze/Silver/Gold." They’ll give you a scenario and ask which layer the data should go to next. Also, those "select two" questions are the absolute worst, super confusing. I know the syllabus is changing on the 4th, so I’m not sure how much of this will still apply. But honestly, if you have some background and get familiar with the core concepts, it’s a very doable exam. I’ve learned a lot through this process. Good luck to everyone preparing!
New Databricks Apps: What About Cost at Scale?
I’ve been looking into the new Databricks Apps compute model, and I have one concern.From what I understand, each Databricks App now runs with its own dedicated app compute, rather than simply relying on a shared SQL Warehouse as the main execution layer. I’m wondering what this means at scale. If an organization has dozens or even hundreds of small internal apps, could this become significantly more expensive if each app requires its own compute instead of how it was before all of them sharing a single SQL serverless cluster that can scale to 0? I’d be interested to hear how others are approaching this: Are you consolidating multiple use cases into fewer apps, stopping unused apps, or using another pattern to control costs?
Tried the Lovable + Databricks connector on a hackathon project
I originally thought the Lovable/Databricks connector was kind of a gimmick. Then I had a hackathon project where all the heavy lifting was in Databricks (data processing, enrichment, a bit of ML), but the result had to be shown as a simple app for non-technical users. Tried Lovable mostly out of curiosity, and honestly, it worked better than I expected for an MVP. A couple of practical notes in case anyone else tests it: * service principal needs access not just to the data, but also to the SQL warehouse / compute * I got it working fine on Databricks Free Edition * if you don’t cache responses, repeated queries can get expensive fast because you’re paying for warehouse runtime I still wouldn’t treat this as my default production setup, but for demos / internal prototypes/idea validation, it was surprisingly useful. I wrote a short article with examples - [https://medium.com/@protmaks/databricks-lovable-a-practical-case-study-and-what-it-costs-to-build-an-app-085f61b07126](https://medium.com/@protmaks/databricks-lovable-a-practical-case-study-and-what-it-costs-to-build-an-app-085f61b07126)
Getting started with multi table transactions in Databricks SQL
Transactions let you coordinate operations across multiple SQL statements and tables. All changes succeed together or roll back together, ensuring data consistency across your operations and tables
Heading into the May 2026 Databricks Data Engineer Associate Exam? Read this first.
So if you've been scrolling through older study guides for the Databricks Data Engineer Associate exam — be careful. The syllabus got a pretty big update this month, and the focus has shifted toward the platform's newer declarative features. I spent some time going through the new guidelines. Here's what I found. Lakeflow is the new standard. The exam has moved away from manual ETL logic. You need to understand Lakeflow Spark Declarative Pipelines (formerly DLT) and how Streaming Tables and Materialized Views actually differ. If your notes still say "DLT" everywhere, time to update them. DABs are no longer a side topic. Databricks Asset Bundles — basically infrastructure-as-code for workflows — is now a core part of the exam. They want to see that you can deploy through DABs, not just click around the UI. Unity Catalog is the default assumption. No more legacy Hive Metastore questions. The exam lives in a UC-enabled world now. Three-tier namespace (catalog.schema.table), Volumes for unstructured data, column-level lineage — that's where your time should go. Serverless Compute is showing up more. When do you pick Serverless SQL Warehouses or Serverless Jobs over classic clusters? That tradeoff — less config overhead vs. less control — is fair game now. The weightings that surprised me → 31% on Processing (Lakeflow, Spark, Streaming Tables) → 18% on Productionizing (DABs, Workflows, deployment) That's almost half the exam right there. Honestly, if you just understand why Databricks is pushing toward declarative tools — letting the platform handle the boring parts so you can focus on the actual logic — a lot of the questions start to make sense. For practice material, BricksNotes has an updated practice test that follows the May 2026 format — 45 questions, 90 minutes, same weightings. → [bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026](http://bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026) Good luck to everyone testing this month! Drop questions below if you're stuck on any of the new topics — happy to help where I can.
TutorialsLakebase - OLTP Workloads on Databricks!
Lakebase is a fully managed, serverless PostgreSQL offering from Databricks that decouples compute and storage, enabling independent scaling, auto-scaling to zero, and deep integration with the Databricks Lakehouse. It supports reverse ETL to bring data from the Lakehouse into Lakebase for OLTP applications and forward ETL to sync transactional data back to the Lakehouse for analytics.
NewsStop Guessing Table Health — Let These Dashboards Tell You
Databricks offers two dashboards for monitoring table health and access: the Table Access Advisor and the Table Health Advisor. These dashboards provide insights into table ownership, read/write patterns, staleness, optimization status, and underlying file structures, helping users identify ghost tables and ensure best practices.
NewsDatabricks News: Free Tier, Multi-statement transactions, Declarative Automation Bundles, Genie Code
Databricks now offers a free tier for Lakeflow Connect, providing 100 DBUs per day per workspace, and has introduced multi-statement transactions in Unity Catalog that ensure atomicity with rollback capabilities. The platform also announced a Databricks One mobile app, a new AI runtime with pre-installed tools for GPU use cases, and enhanced Genie Code that understands project structure for automated development tasks. Additionally, Databricks Asset Bundles are now called Declarative Automation Bundles and use a faster direct engine, and a new 5X-Large SQL warehouse is available for processing terabytes of data.
TutorialsDatabricks End-To-End Project | Zero-To-Expert | Streaming, AI, Lakeflow, Unity Catalog, AI/BI
This video demonstrates building an end-to-end restaurant analytics platform on Databricks, covering streaming and batch data ingestion, AI-powered sentiment analysis, and dashboard creation. It teaches how to use Unity Catalog, Lake Flow Connect for CDC, Spark declarative pipelines for real-time data from Event Hub, and how to construct a medallion architecture with fact and dimension tables.
NewsDatabricks Breaking News: 2026 Week 6: 2 February 2026 to 8 February 2026
Databricks introduces agentic data quality monitoring with anomaly detection, LLM judge UI builder for MLflow, and new SQL warehouse features including a default option and activity details. The platform also enhances its assistant to connect with MCP servers, improves Google Sheets integration with pivot table functionality, and adds direct Git deployment and tagging for Databricks apps.
SQL warehouses now support "5X-Large" cluster sizes and a higher maximum of 40 clusters. This release also fixes permanent drift for external model credentials in databricks_model_serving and improves dashboard file content change detection.
NewsClaude Code: 5 Essentials for Data Engineering
The video introduces five essential concepts for using Claude Code in data engineering: the cloud.mmd file for core project information, skills for packaging expertise, commands for predefined prompts, sub-agents for focused tasks, and Model Context Protocol (MCP) for standardized tool interaction. These components help manage context and memory for effective AI-enhanced development.
NewsDatabricks: What’s new in September 2025? #databricks
Databricks now supports geospatial data types (geography and geometry) with new functions for visualization and spatial operations, and introduces serverless GPU clusters for distributed GPU code execution. The platform also offers enhanced notebook features like side-by-side editing and a notebook-specific search, along with new options for managing serverless environments, SQL warehouses, and access requests in Unity Catalog.
EventsIntroducing Lakebridge: Free, Open Data Migration to Databricks SQL
Lakebridge is a free, open, AI-powered tool for migrating data warehouses to Databricks SQL. It works by analyzing the existing environment, converting code using an LLM, migrating data, and then reconciling to validate the migration.
NewsCrypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data
TutorialsHealthcare Interoperability: End-to-End Streaming FHIR Pipelines With Databricks & Redox
TutorialsDatabricks Metrics - create a semantic layer and improve data engineering
UCX now requires matching account groups to be created before assessment and clarifies Service Principal setup for installation. It also fixes table migration when a default catalog is set and pauses the migration progress workflow schedule by default.
Tutorials42 Streaming Tables and Materialized Views in DBSQL | Background Working | Schedule data Refresh
TutorialsHow to read files with Databricks SQL # 5/6 of file handling series
NewsLearnings From the Field: Migration From Oracle DW and IBM DataStage to Databricks on AWS
NewsIncreasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM
NewsReal-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL
NewsDatabricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product
TutorialsDatabricks SQL Serverless Under the Hood: How We Use ML to Get the Best Price/Performance
Tutorials



















