Unity Catalog
Recent items mentioning Unity Catalog across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Unity Catalog recently announced the General Availability (GA) of new Attribute-Based Access Control (ABAC) and data classification tools, including row filtering and column masking 12. Interoperability has also expanded significantly, with new Open APIs allowing external engines like Apache Spark, Flink, and DuckDB to create, read, and write to UC managed Delta tables 410, and enabling processing of unstructured data in volumes 9. Additionally, Unity Catalog continues to power governed data access for various use cases, from real-time intelligence for regulated data in financial services 7 to direct integration with Google Sheets for business users 8.
Generated daily from the 10 most recent items mentioning Unity Catalog. Click any [N] to jump to the source.
Unity Catalog: New ABAC & Data Classification Tools Now GA
Govern Once, Protect Everywhere: ABAC Row Filtering and Column Masking Is GA in Unity Catalog
The question your commercial data should already be able to answer
Databricks and Veeva now embed Genie AI agents and AI/BI dashboards directly into Veeva Vault CRM, enabling life sciences commercial teams to get real-time answers to their questions without leaving their workflow. This unified Databricks lakehouse with Unity Catalog delivers governed commercial data to every persona, from sales reps to MSLs, in the format and depth their role requires.
Expanded interoperability with Unity Catalog Open APIs
Just saw Databricks’ post about the expanded interoperability in Unity Catalog with the new open APIs. Looks like a pretty interesting step toward making Unity Catalog even more open with other tools and platforms. Has anyone had a chance to try out the new API yet? Curious to hear how it’s working in practice.
Unity Catalog OSS + S3 Issue: "S3 bucket configuration not found"
Hey team, I’m currently facing the following issue while working with Unity Catalog OSS + S3: org.apache.spark.sql.execution.QueryExecutionException: io.unitycatalog.client.ApiException: generateTemporaryPathCredentials call failed with: 400 - {"error\_code":"FAILED\_PRECONDITION","message":"S3 bucket configuration not found."} Has anyone here faced this before? If yes, could you share what configuration was missing or how you resolved it?
PipelineIQ: Forward‑Looking Sales Intelligence That Drives Action
Your CRM data is a mess. Everyone knows it. Most AI tools pretend it isn't. Databricks took a different approach with PipelineIQ - instead of building yet another forecasting model that assumes clean data (spoiler: it never is), they built a prescriptive action engine that works with the chaos. The result? Every deal in the pipeline gets one of three verdicts: 🚶 Walk - disengage, this isn't worth your time 🔄 Pivot - viable deal, wrong approach 🚀 Accelerate - conditions are right, lean in now No vague "insights." No dashboards that require a PhD to interpret. Just: here's what to do today. Built on Databricks' own stack (Foundation Model APIs, Delta Lake, Unity Catalog) and used internally by their own sales org - this is a rare "we built it for ourselves first" story. Read the full blog here: [https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action](https://www.databricks.com/blog/pipelineiq-forward-looking-sales-intelligence-drives-action)
NewsAIA Group x Databricks: Turning Regulated Data into Real-Time Intelligence
AIA Group leverages Databricks to manage regulated data across 18 markets, addressing challenges like data residency and varying tech maturity with features like Unity Catalog for governance. The platform enables real-time intelligence for investment decisions, fraud detection, and personalized agent coaching, with future plans for conversational analytics and autonomous AI.
TutorialsConnect Google Sheets to Databricks
The Databricks Google Sheets add-in allows users to explore, import, and refresh governed data from the Databricks Lakehouse directly within Google Sheets. It demonstrates how to browse Unity Catalog, select tables or metric views, apply filters, schedule data refreshes, and use direct SQL queries with parameters.
Processing Unstructured Data in Volumes with Unity Catalog Open APIs
Expanded interoperability with Unity Catalog Open APIs. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables.
NewsNo More Table Locks for Multi Statement Transactions #databricks #dataengineering #sql
Databricks now supports multi-table transactions, allowing changes to multiple tables within a single atomic transaction that rolls back all changes if any part fails. This feature, managed by Unity Catalog, prevents table locking during updates and supports up to 100 tables per transaction using a simple "BEGIN ATOMIC...END" syntax.
Backstage with Lakebase, part 2
Lakebase enables running production OLTP applications like Backstage on a serverless Postgres surface within Databricks, offering 1-second database branching and sub-4-second point-in-time recovery for schema migrations. Unity Catalog unifies governance for operational databases, providing single SQL query auditing, automatic row-level security propagation to branches, and zero-ETL cost attribution for FinOps.
Expanded interoperability with Unity Catalog Open APIs
Unity Catalog Open APIs now offer expanded interoperability, with external access to UC managed Delta tables in Beta and credential vending generally available with M2M OAuth support. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits feature for safe concurrent writes and audibility.
TutorialsMaking AI Feel Personal: User-Delegated Actions in MCP Agent Systems
The video demonstrates how to build an AI agent in Databricks that provides personalized responses by integrating user-delegated actions through Model Context Protocol (MCP) servers. It walks through setting up Unity Catalog functions, external MCP tools like web search, and custom MCP servers to access internal APIs, all while maintaining user context for relevant information retrieval.
TutorialsHow to Build an AI Security Governance Hub with Agent Bricks
Databricks Agent Bricks enables building an AI Security Governance Hub by transforming static security playbooks into adaptive multi-agent systems. The video demonstrates combining a knowledge assistant for unstructured documents and a Genie space for structured data into a supervisor agent, then details how to tune and monitor these agents for improved performance and data privacy.
What is the recommended approach to enforce row-level security in Unity Catalog for external BI tool
General Availability of Attribute-Based Access Control (ABAC), Governed Tags, and Data Classification in Unity Catalog.
ABAC policies, governed tags, and automated data classification are now generally available in Unity Catalog. Governance teams define access rules and they apply automatically across the entire data estate. Sensitive data is discovered, tagged, and protected as it's created, with no manual configuration per table. Together, they deliver consistent, scalable protection with less operational overhead and stronger compliance posture as data grows. [https://databricks.com/blog/abac-row-filtering-and-column-masking-policies-governed-tags-and-data-classification-are-now](https://t.co/roCr8NFpif)
Schema Evolution and Schema Enforcement without Delta live Tables & Unity catalog
ABAC row filtering and column masking policies, governed tags, and data classification are now generally available in Unity Catalog
ABAC row filtering and column masking policies, governed tags, and data classification are now generally available in Unity Catalog. These capabilities unify data governance, eliminating manual security and ensuring consistent, real-time protection across your data estate.
Managing Unity Catalog Permissions for Databricks Apps via DABs
The Rise of Sports Intelligence: How the Lakehouse Turns Tracking Data into Competitive Advantage
Pro teams now leverage the Lakehouse to transform exploding tracking and biomechanical data into sports intelligence, driving real-time decisions on the court, in training, and in the front office. The Databricks Data Intelligence Platform acts as the governed "sports brain," unifying diverse data with Lakeflow, Unity Catalog, ML, and AI Search to power proactive injury management, coaching insights, and next-gen fan experiences.
unable to manage Unity Catalog Allow List / identify Metastore Admin
I created the Azure Databricks workspace and I’m also Azure Subscription Owner and Databricks Workspace Admin. Unity Catalog is enabled and the workspace is attached to the metastore: metastore\_azure\_centralindia However, when trying to manage Allowed JARs / Init Scripts, I get: Requires permission MANAGE\_ALLOWLIST on metastore I tried: GRANT MANAGE\_ALLOWLIST ON METASTORE TO \`my-user\` but received: User is not a metastore admin for Metastore 'metastore\_azure\_centralindia' I also checked: SHOW GRANTS ON METASTORE; and only see operational privileges for the workspace admin group like: CREATE CATALOG CREATE EXTERNAL LOCATION CREATE STORAGE CREDENTIAL No OWNER / MANAGE / MANAGE\_ALLOWLIST privileges are visible. Questions: How can I identify the actual Databricks Account Admin or Metastore Admin? Is there a way to trace which user/group owns the metastore? In Azure Databricks + Unity Catalog setups, does the metastore owner sometimes become a hidden/system/bootstrap principal instead of the workspace creator? What is the recommended approach to recover or reassign metastore governance access in this situation? Any guidance from people who faced this in Azure Databricks environments would help. Thanks!!!
Announcing Native Lakehouse Sync
Native Lakehouse Sync (Public Preview) now automatically replicates Lakebase Postgres data into Unity Catalog managed tables, eliminating pipelines and external compute. This enables live ML features, operational data as the Bronze layer with full SCD Type 2 history, and built-in audit capture, all with zero Postgres performance impact and no added cost.
Faster Queries and New Capabilities with the Open-Source Databricks JDBC Driver
The new open-source Databricks JDBC driver delivers up to 30% faster large result retrieval and adds support for multi-statement transactions, stored procedures, Arrow compatibility, and Unity Catalog metric views. This fully owned, open-source driver enables faster fixes, external contributions, and tighter platform integration.
NewsData + Semantic Context = AI Ready | How TK Elevator Built It on Databricks
TK Elevator built an AI-ready data platform on Databricks Lakehouse, centralizing fragmented elevator data at scale. This platform integrates semantic context and expert knowledge, using Unity Catalog for governance and a medallion architecture to prepare data for AI applications.
Lakehouse Sync
Lakehouse Sync replicates data from Lakebase/Postgres directly into Unity Catalog Delta tables. It uses CDC from PostgreSQL WAL, with wal2delta doing the work. #databricks
What's new in AIBI Dashboards - April 2026
* **Publish with service principal credentials**: Authors can publish dashboards using the data credentials of a service principal. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/share/share#publish-dashboard) * **Service principal ownership**: Workspace admins can transfer dashboard ownership to a service principal in the UI. 📖 [Documentation](https://docs.databricks.com/aws/en/ai-bi/admin/#transfer-ownership) * **Choropleth map admin levels**: Choropleth maps support US admin levels 3 (regions, multi-state groupings) and 4 (states). 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/manage/visualizations/maps) * **SQL editor line numbers**: The SQL query editor displays line numbers to help with legibility and debugging. * **PDF subscription page selection**: Dashboard authors can select which pages to include in PDF email subscriptions. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/share/schedule-subscribe) * **Parameter values in widget titles and descriptions**: Dashboard authors can reference parameter values in widget titles and descriptions, so the text updates dynamically as viewers change parameter selections. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/manage/filters/parameters) * **Table cross-filtering and drill-through**: Tables support cross-filtering and drill-through. * **Counter prefix and suffix**: Numbers in counters support custom prefixes and suffixes. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/manage/visualizations/types#counter) * **Schema browser default dataset type:** Adding a table to a dashboard from the schema browser creates a [local metric view](https://docs.databricks.com/aws/en/dashboards/manage/data-modeling/local-metric-views) by default instead of a SQL dataset. * **Warehouse overload message**: Dashboards show a message explaining when rendering is delayed due to the warehouse being overloaded. * **Tabular attachments in email subscriptions**: Dashboard email subscriptions include tabular attachments. * **Fullscreen scroll position**: Exiting fullscreen mode on a published dashboard returns you to your previous scroll position instead of jumping to the top of the page. * **Local metric views**: A new dataset type lets you create metric views directly in a dashboard using a low-code visual interface, without publishing to Unity Catalog first. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/manage/data-modeling/local-metric-views) * **Edit hex color values inline**: Authors can click directly on a hex color value to edit it in place. * **View SQL for visualization widgets**: Authors can view the SQL behind specific visualization widgets while in draft mode. * **Waterfall chart totals**: Waterfall charts with categorical X-axis support a total bar. * **Scatter plot shape field**: Scatter plots support a shape field to differentiate data points by category. * **Clear applied filters individually**: Dashboard viewers can individually clear applied filters from the active selection bar. * **Text box vertical alignment**: Text box widgets support vertical alignment (top, center, and bottom). * **Choropleth map boundaries**: Choropleth maps support additional boundary types, including ZIP code and NUTS regions. * **“Explain this change” chart types**: The “Explain this change” feature is available for pivot table cells, horizontal bar charts, pie charts, and heatmaps, in addition to time series charts. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/genie-spaces#explain-chart-changes)
Databricks is now supporting Microsoft Outlook in Lakeflow Connect (Beta)
Azure Databricks has introduced a managed Microsoft Outlook connector for Lakeflow Connect, currently available in Beta, enabling organizations to ingest Outlook email data directly into Azure Databricks. With this new connector, teams can now integrate Outlook-based communication data into analytics, governance, automation, and AI workflows more efficiently. https://preview.redd.it/8r0xry5yz50h1.png?width=1402&format=png&auto=webp&s=78ae55a64c3df3d47b0edcb4ed6f246f421e7b08 Key capabilities currently supported: * Incremental ingestion * Unity Catalog governance * UI & API-based pipeline authoring * Databricks Workflows orchestration * Declarative Automation Bundles * Column selection/deselection Supported authentication: * OAuth M2M (Machine-to-Machine) Current Beta limitations: * SCD Type 2 support * Automated schema evolution * API-based row filtering * Multiple tables per pipeline (currently limited to 1) Since the connector is still in Beta, workspace admins must enable the feature from the Previews page before use. Nice to see Databricks continuing to expand Lakeflow Connect integrations across enterprise ecosystems. [Source Link](https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/outlook-overview)
The next generation of Databricks Genie just launched. Here is what data engineers actually need to know.
I have been following Genie since it first launched with AI/BI last year. Back then, I honestly thought it was mostly for business users. A chatbot on top of your data that could answer basic questions in plain English. Useful, but not something I thought data engineers really needed to care much about. After seeing the new 2026 version, I completely changed my mind. Genie is no longer just a business chatbot. The biggest change is Genie Code, which is basically an AI agent designed for data professionals. It can generate pipelines, debug failures, create dashboards, monitor systems, and work directly with Lakeflow and Unity Catalog. That part caught my attention immediately because it moves beyond simple Q&A and starts touching actual engineering workflows. What surprised me most is how connected the whole system has become. It can pull context from dashboards, Genie Spaces, apps, metadata, documentation, and external systems like GitHub, Jira, and Confluence through MCP. Instead of only searching tables, it tries to understand relationships across the environment. That feels very different from the first version. The operational side is also interesting. Genie Code can monitor pipelines, investigate failures, help with DBR upgrades, and respond to issues before teams even notice them. The more I read about it, the more it felt less like a chatbot and more like an assistant sitting beside the engineering team. But honestly, the biggest takeaway for me is not the AI itself. It is what this means for data engineers. A lot of people immediately jump to “AI will replace data engineers,” but I think the opposite is happening. These systems are only as good as the data foundation underneath them. If metadata is incomplete, if tables are messy, if naming conventions are inconsistent, or if documentation is missing, the AI layer will give poor answers confidently. That means clean data modeling, governance, metadata, documentation, and data quality are becoming even more important than before. The engineers building those foundations become more valuable, not less. I think the role is slowly shifting away from spending hours writing repetitive boilerplate transformations and more toward building trustworthy, AI-ready data systems. One thing I keep noticing while learning Databricks through BricksNotes and the wider community is that the platform is moving very quickly toward AI-native data engineering. Features like Unity Catalog, Lakeflow, and now Genie all connect together. It feels like understanding metadata and governance is becoming just as important as understanding Spark itself. Also interesting that Genie now has a full mobile experience on iOS and Android. Business users can access dashboards, apps, and chat directly from their phones, which means the underlying data quality matters even more because people are going to depend on these systems everywhere, not only during work hours. Curious if anyone here is already using Genie or Genie Code in production. I would genuinely like to hear how the answer quality has been and whether your teams are changing how they approach metadata and documentation because of it.
ReleasesIntroducing Databricks Document Intelligence
Databricks Document Intelligence is a new solution for extracting, processing, and analyzing unstructured data from documents using large language models. It offers a unified platform for document processing, including data extraction, summarization, and question answering, with a focus on accuracy and scalability.
Performing DML Operations on Unity Catalog Managed Tables from External Engines
NewsDatabricks Genie, Unity AI Gateway, Project Glasswing, and Model Mania | AI Newsround - April 2026
Databricks Genie is now the business user home screen for Databricks, offering a unified chat interface, external knowledge store connections, and a mobile app. The Unity AI Gateway, integrated with Unity Catalog, provides comprehensive governance for agentic AI, including permissions, auditing, and policy controls for models and tools.
NewsDatabricks in 3 minutes. The unified data and AI platform, explained.
Databricks unifies diverse data sources into a single data lake, providing a governed platform for analytics and AI. It offers capabilities like fine-grained access control, natural language querying with AI, and company-wide intelligent agents.
Unity Catalog - How to read prod data in dev with appropriate read-only access?
Managed vs External Tables in Unity Catalog: The Decision That’s Silently Inflating Your Cloud Bill
How would you onboard legacy data stores that don't use OAuth into Databricks Unity Catalog?
Lakehouse federation typically uses OAuth, this the above qn.
Does Databricks natively support tokenization? If so, how?
I'm focused on static data masking as opposed to dynamic data masking (where the latter is done through RLS and CLS). I'm also not talking about encryption such as AES 256 where compromising the key compromises the Data. If not supported natively by Databricks Unity Catalog, what are some partner companies that support tokenization and integrate well with UC?
I thought Unity Catalog was overkill… until this changed my mind.
I used to think Unity Catalog was overkill. Too many layers. Too many configs. So I just memorized permissions and moved on. but Unity catalog is not just about **Permissions**, its about **Ownership and Structure.** * Catalog = domain * Schema = grouping * Tables = team-owned assets. Now governance actually feels predictable. Wrote a simple breakdown (no jargon): [https://medium.com/@wnccpdfvz/unity-catalog-demystified-how-databricks-solves-data-governance-once-and-for-all-11ed30fadf2b](https://medium.com/@wnccpdfvz/unity-catalog-demystified-how-databricks-solves-data-governance-once-and-for-all-11ed30fadf2b)
Native Excel support is now GA
Hey r/databricks! Native Excel ingestion on Databricks is now **Generally Available** across AWS, Azure, and GCP. With this release, you can ingest, parse, and query `.xls` / `.xlsx` / `.xlsm` files directly. Public docs: [https://docs.databricks.com/aws/en/query/formats/excel](https://docs.databricks.com/aws/en/query/formats/excel) **📂 What is it?** Native Excel support that lets you: * Directly read `.xls`, `.xlsx`, and `.xlsm` files using Spark (`spark.read.excel(...)`) or SQL (`read_files`, `COPY INTO`). * Upload Excel files through the "Create or modify table" UI and land them as Delta. * Specify exact sheets and cell ranges (e.g., `"Sheet1!A2:D10"`) for complex layouts. * Infer schema, headers, and data types automatically, or bring your own. * Stream Excel files with Auto Loader using `cloudFiles.format = "excel"`. * List sheets in a workbook programmatically before ingesting. **🤷 Why?** Until now, Databricks didn't have a native Excel reader. That meant writing custom Python with pandas / openpyxl to convert Excel → DataFrame → Delta, manually exporting sheets to CSV before you could ingest them, or giving up on workflows because the Databricks file-upload UI rejected `.xlsx`. GA makes Excel a first-class file format across Spark, SQL, Auto Loader, and the table-creation UI. It also opens the door to Excel ingestion via our managed file connectors ([SharePoint](https://docs.databricks.com/aws/en/ingestion/sharepoint), [Google Drive](https://docs.databricks.com/aws/en/ingestion/google-drive#google-drive-metadata-column), [SFTP](https://docs.databricks.com/aws/en/ingestion/sftp), and more coming soon). **🧑💻 How do I try it?** 1️⃣ Requirements * Databricks Runtime 18.1 or above. 2️⃣ Try it in the UI * Click New → Add Data → Create or modify table. * Upload an `.xls`, `.xlsx`, or `.xlsm`file. * Pick the sheet. Adjust header rows or cell range if needed. * Preview the inferred schema. * Click Create table. It lands as a Delta table in Unity Catalog. 3️⃣ Try it in Spark (batch) # Read the first sheet of a workbook df = spark.read.excel("<path to excel file>") # Use a header row and a specific sheet + range df = ( spark.read .option("headerRows", 1) .option("dataAddress", "Sheet1!A1:E10") .excel("<path to excel directory or file>") ) df.write.mode("overwrite").saveAsTable("<catalog>.<schema>.my_table") 4️⃣ Try it in SQL with read\_files CREATE TABLE my_sheet_table AS SELECT * FROM read_files( "<path to excel directory or file>", format => "excel", headerRows => 1, dataAddress => "Sheet1!A2:D10", schemaEvolutionMode => "none" ); 5️⃣ Try it with COPY INTO COPY INTO excel_demo_table FROM "<path to excel directory or file>" FILEFORMAT = EXCEL; 6️⃣ Try it with Auto Loader (streaming) df = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "excel") .option("cloudFiles.inferColumnTypes", True) .option("headerRows", 1) .option("cloudFiles.schemaLocation", "<schema location>") .load("<path to excel directory or file>") ) (df.writeStream .format("delta") .option("checkpointLocation", "<checkpoint path>") .table("<catalog>.<schema>.excel_stream")) 7️⃣ List sheets in a workbook sheets = ( spark.read .option("operation", "listSheets") .excel("<path to workbook>") ) sheets.show() # returns sheetIndex, sheetName **🎛️ Supported options** |Option|Description| |:-|:-| |`dataAddress`|Cell range in Excel syntax. Examples: `"MySheet!C5:H10"`, `"C5:H10"`, `"Sheet1"`. Defaults to all valid cells on the first sheet.| |`headerRows`|Number of header rows inside `dataAddress` (0 or 1). Default: 0.| |`operation`|`"readSheet"` (default) or `"listSh […truncated]
If your Lakeflow SDP pipeline broke with DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE, here's a recovery script
I ran into this recently and wanted to share. A Delta table I was streaming from got dropped and recreated by an upstream team. Same name, same schema, but the new table has a fresh internal ID. Spark Structured Streaming checkpoints bind to that ID, so the next pipeline run error with: `[DIFFERENT_DELTA_TABLE_READ_BY_STREAMING_SOURCE] The streaming query was reading from an unexpected Delta table...` In open-source Spark you'd delete the checkpoint directory. Lakeflow SDP manages those paths internally, so that's not an option. The fix is the Pipelines API parameter `reset_checkpoint_selection` (added in `databricks-sdk` 0.100): pass a list of FQN flow names and start an update that clears only those checkpoints. Bronze/Silver/Gold targets stay untouched. I packaged the recovery as a sub-template in my Databricks bundle template repo. One CLI call ships the script (with a `--dry-run` flag), a workspace notebook variant, and a README: `databricks bundle init https://github.com/vmariiechko/databricks-bundle-template --template-dir assets/sdp-checkpoint-recovery` It also includes a fallback for environments where you can't pip-upgrade the SDK (for me it was the case when using the Databricks serverless runtime, which bundles its own SDK). Repo: https://github.com/vmariiechko/databricks-bundle-template/tree/main/assets/sdp-checkpoint-recovery Two gotchas worth knowing: - Flow names must be three-part Unity Catalog FQNs (`catalog.schema.table`), or you hit `IllegalArgumentException`. - Resetting checkpoints triggers a pipeline update; the API has no "reset only" mode. If you want the pipeline stopped after, cancel from the UI as soon as the call returns. Happy to answer questions or hear how you have handled this situation. P.S. Feel free to submit issues or PRs.
Databricks' Zerobus Event Data Ingestion Deep-Dive Demo (w/ Databricks' Staff Developer)
So very excited to share this demo + presentation with the one and only, [Scott Haines](https://www.linkedin.com/feed#), Staff Developer Advocate @ [Databricks](https://www.linkedin.com/feed#). The topic? Zerobus, which is a great option for easily ingesting event data at scale into Unity Catalog. We do a demo and overview of the technology, talk about how it is similar & different to Kafka, when to use Real-Time Mode vs Zerobus, and much more! Hope you enjoy this very technical overview!
LLMs access to few delta tables inside unity catalog
I want llm to access few tables (not all tables) either through an api endpoint or mcp. Which is the cleanest way? And secure as well Do I create service principles or genie mcp (add only specific tables to genie space)
The learning order that actually works for Databricks. I wasted 3 months before figuring this out.
I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.
Using Genie Code to build a Genie space
First time building a [Databricks](https://www.linkedin.com/feed?nis=true#)’ Genie space using Genie Code. Surprisingly, you can get 80% of what you'd need with one prompt, with the other 20% being tailoring things even more with prompts. The key to making it happen? Spending time upfront on governance inside the Unity Catalog, especially leveraging its' documentation capabilities. 👉 Quick walkthrough of what I did here: \-Started off from the home screen on my Databricks workspace. \-Wrote a single prompt into Genie Code to create a Genie space, pointing at the schema containing a handful of dimensions & two fact tables. \-The tables and respective fields already had "Comments" in the Unity Catalog to document what they represent. \-Genie Code handled the Genie space creation, table relationships, created reusable measures, and created a handful of starter questions that would be appropriate for business users. \-I picked one of the suggested questions which leveraged "Agent Mode", a mode for complex questions. \-I asked a follow up question to have it give me some actionable recommendations. 👉 General recommendations: \-Proper governance is more important than ever. Spend time making the most out of Unity Catalog first to make the most out of the platform! \-Always review the configurations, logic, and code generated by coding agents, specially when money is involved! \-Become familiar with the different capabilities Databricks offers, and then use Genie Code to help you get started using the ones that make business sense to you, fast. Hope you enjoyed this post!
Marimo on Databricks
My workflow for a long time involved me switching back/forth between vscode and browser/databricks ui. I like to write my "production code" in normal python, but notebooks are great for exploration, spikes, visualization, triage etc. I could write a small dissertation but for various reasons I don't really like jupyter, and databricks notebooks have their own problems with commented magic commands etc. This led me to check out [marimo](https://marimo.io/), and wow, these are so cool. Code that runs in normal python, merges cleanly, has visualizations, widgets, the the app runs locally and doesn't glitch out, and even the vscode extension works nicely. The problem was, the databricks support wasn't great. It just felt a bit dated. It required a warehouse for sql, doesn't seem to really support serverless, and there were just so many oppurtunities to plug databricks into Marimo. This led me to create [marimo-databricks-connect](https://github.com/brookpatten/marimo-databricks-connect) [pypi](https://pypi.org/project/marimo-databricks-connect/) I tried to plug in "all the things" databricks into the place where they go in Marimo. I'm pretty happy with the result. - Connect to databricks using databricks-connect & spark (not sql warehouse) - Authenticate/configure spark using the default databricks-connect process (env vars, .databrickscfg etc), no additional auth config. - Execution of both python & sql cells - Autocomplete Catalog/Schema/Table/Column Names - Browsing of catalogs/schemas/tables/columns in the marimo data sources view - Browsing of external locations, volumes, dbfs, workspace in the marimo storage browser Notebook widgets to monitor and control of specific instances of databricks capabilities (clusters, workflows, vector search, apps etc) - Widgets to browse & explore databricks capabilities (compute, workflows, unity catalog) - Works in local marimo marimo edit notebook.py, in the vscode extension - Deploy as a databricks app to provide an alternative web based marimo UI. I'm working on adding serving endpoints as AI providers to the notebooks too. In particular what I like to use this for is creating "command center" notebooks for given processes that can include some normal pyspark/sql code to query/triage, widgets to monitor/control various databricks resources, visualizations to monitor dq etc. I just wanted to share and see what the community thinks, would you use it? contributions are welcome. throwaway account because i'm doxing myself via gh repo.
[Passed] Databricks DEA Exam today
https://preview.redd.it/z6mcmrgvmjyg1.png?width=474&format=png&auto=webp&s=28e010f62635d49af3a815998011125d8f2cfa0f Just walked out of the exam and I’m glad to say I passed. I was sweating a bit because the exam content changes on the 4th, so I really didn't want to fail and have to deal with a new syllabus. I've had Databricks at work since late 2023. I’ve been using it because, well, it’s there, but I was mostly just "vibe coding"—picking up some Python and Spark here and there without any real depth. I ran jobs using whatever cluster settings the company gave me without actually knowing what they meant. If you’ve never touched Databricks, this exam is going to be a pain. Even if you’re good at coding, the internal components and the way everything fits together are hard to grasp just by reading. You really need to get your hands dirty in the workspace to get a "feel" for it. **Study Routine** I started with the Databricks Academy stuff, but since I’m juggling work and a toddler, I could only study on weekends. This was a disaster because by the next Saturday, I’d already forgotten what I learned the week before. One month before the exam, I ditched the theory and just hammered Mock Exams. * Udemy is your friend: I bought practice exams from Derar and Santosh. * I snagged them at discounted price. Just wait for the sale if you are not in a hurry. Personally, Santosh’s exams felt closer to the real thing. I saw maybe 5-6 questions that were almost word-for-word. Derar is also solid; honestly, just solve as many problems as possible. Since my study time was limited, I focused on reviewing the questions I got wrong. I realized pretty early that Productionizing Data Pipelines was my weak spot. I didn't try to become an expert in it. I just aimed for a 60% "pass" in that section and doubled down on the areas I was actually good at. Don't completely ignore your weak areas though. If you bomb one section too hard, a couple of silly mistakes in other sections will kill your score. **What's on the exam** The questions are mostly scenario-based. You have to read the prompts carefully. Some things I remember: * Autoloader: This came up a lot. * DLT (now called Lakeflow Spark Declarative Pipelines): should understand what it actually does * Unity Catalog: Permissions (Granting minimum access) and the actual SQL code for it. * Delta Sharing: Knowing the difference between sharing with Databricks vs. non-Databricks users. * Egress Costs: How to avoid them in cross-cloud sharing (Cloudflare R2 was the answer for one). * SQL Warehouses: Classic vs. Pro vs. Serverless. Know when to use which. * DABs (Databricks Asset Bundles): I got at least 3 questions on this. Don't skip it. * Medallion Architecture: It’s not just "what is Bronze/Silver/Gold." They’ll give you a scenario and ask which layer the data should go to next. Also, those "select two" questions are the absolute worst, super confusing. I know the syllabus is changing on the 4th, so I’m not sure how much of this will still apply. But honestly, if you have some background and get familiar with the core concepts, it’s a very doable exam. I’ve learned a lot through this process. Good luck to everyone preparing!
Here are 5 topics that showed up much more than I expected in my DEA exam
I took the Databricks Data Engineer Associate exam recently and wanted to share what actually came up because it was quite different from what I spent most of my time studying. I went in thinking Delta Lake theory and platform architecture would be the big topics. They weren't. The exam is way more practical than I expected. **The first thing** that caught me off guard was how heavily they test Auto Loader. Not just the basics but real scenarios. One question described a pipeline receiving 50,000 new files per day and asked which ingestion method to use and why. You need to understand when Auto Loader makes sense versus COPY INTO, how schema evolution works with mergeSchema, and the difference between directory listing and file notification mode. I probably got six or seven questions just on this one topic. **The second thing** was lazy evaluation. I knew the concept but I wasn't prepared for how they test it. They give you a block of code with four or five DataFrame transformations and ask what happens when you run the cell. The answer is nothing happens because there is no action at the end. But the way they frame the questions makes you second guess yourself if you only memorized the definition without really understanding it. **Third** was Lakeflow expectations. The old name was Delta Live Tables but they use Lakeflow in the exam now. You need to know the three expectation types and when to use each one. They gave me a scenario where the pipeline should log bad records but never drop them and I had to pick the right expectation decorator. Also know the difference between streaming tables and materialized views because that came up more than once. **Fourth** was Unity Catalog permissions. Not just the three level naming pattern but actual grant scenarios. Something like a data analyst needs to read tables in the sales schema but should not be able to create new tables and you have to pick the correct grant statement. I got at least three or four questions like this. **Fifth** was MERGE INTO. They really love this command. Upsert scenarios, deduplication, slowly changing dimensions. If you cannot write a MERGE statement from memory with the WHEN MATCHED and WHEN NOT MATCHED clauses you should spend an hour practicing just that before you sit for the exam. What surprised me about what was not heavily tested. Cluster configuration was maybe one question. The architecture diagrams with control plane and data plane were one or two questions at most. Delta Sharing was one question. Spark internals like shuffle details were barely mentioned. The biggest thing I wish I had done differently is spend less time reading documentation and more time actually running code. When you have actually executed a MERGE INTO on a real table and seen the results, the exam question feels like something you have done before instead of something you read about once. I used Databricks Free Edition for all my practice and it was more than enough. Hope this helps someone who is preparing right now. Feel free to ask anything about the exam in the comments and I will try to answer.
TutorialsStep-by-Step: Using the Databricks Excel Add-in to Analyze Governed Lakehouse Data
Unlocking SAP Business Context in Databricks with Semantic Metadata Delta Sharing
SAP Business Data Cloud now automatically syncs semantic metadata, including descriptions and key relationships, into Unity Catalog, making SAP data instantly AI-ready and more discoverable. SAP PersonalData governance tags are also automatically available in Unity Catalog, enabling fine-grained access controls with ABAC.
You can now adopt pre-existing Databricks Postgres branch and endpoint resources using the replace_existing argument. A change to databricks_external_location prevents Terraform drift related to the effective_file_event_queue field.
Databricks & BigQuery just announced bidirectional catalog federation - no more data duplication
**Hot off the press today**: Databricks and Google Cloud announced that customers can now access the same copy of data from either Unity Catalog or BigQuery without duplication. Learn more about the announcement here: https://www.databricks.com/blog/interoperability-between-unity-catalog-and-google-bigquery-catalog-federation
NewsTalkdesk Powers AI-Driven CX with Databricks on AWS
Talkdesk uses Databricks on AWS as a unified data platform to power its AI-driven customer experience (CX) platform, which automates and accelerates customer interactions. Databricks centralizes data storage, provides consistent data modeling, and unifies data processing pipelines, enabling Talkdesk to manage both unstructured and structured data in Iceberg format and leverage generative AI capabilities.
TutorialsHow To Connect Power Apps to Databricks for Secure, Zero‑Copy Data Access
The video demonstrates how to connect Microsoft Power Apps to Azure Databricks for secure, zero-copy data access. It shows how to create a connection, load data into a Power App, and perform create, read, update, and delete operations directly on Databricks data, with auditing capabilities.
Stripe data now available on Databricks via Databricks Marketplace
Stripe data is now available on Databricks Marketplace, enabling you to activate a Stripe data pipeline with Delta Sharing in minutes and instantly power AI applications. Share Stripe payment and business data directly into Unity Catalog to create a single source of truth and query live payment data for models, agents, and Genie workspaces.
Interoperability Between Unity Catalog and Google BigQuery via Catalog Federation
Google Cloud now supports catalog federation to Unity Catalog, enabling BigQuery users to read tables in Unity Catalog without duplication. Unity Catalog also supports catalog federation to Google Cloud's Lakehouse, allowing it to read Iceberg tables written from BigQuery and other engines.
From months to minutes: Building real-time clinical data pipelines with natural language
Databricks and Redox now enable real-time clinical data pipelines from EHRs to Unity Catalog with natural language prompts, reducing integration time from months to minutes. This partnership allows AI outputs to be written back into the EHR in real time, transforming Databricks into an operational layer for point-of-care interventions.
No-Code Pipelines: Databricks Lakeflow Designer Demo (4-min demo)
In this demo for Lakeflow Designer you will see me: \-Pulling sales data from Unity Catalog AND from local Excel workbooks \-Building a one big table report and more specialized reports \-Exporting reports to Excel \-Storing outputs back into Unity Catalog tables for use by analysts, BI developers, and even business users using the Databricks' Excel connector \-Setting up a recurring pipeline so that the data is kept fresh automatically \-All of the above without writing any code I hope you enjoy the video, but more importantly, that you try it out yourself AND give feedback to the folks at Databricks on how to make the product even better. There are still many out-of-the-box pieces not yet available, and I know that the amount of feedback Databricks gets from customers will affect direction and priorities!
Show HN: Rocky – Rust SQL engine with branches, replay, column lineage
Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for. The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG. A few things I think are interesting: - Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse. - Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP. - Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe. - Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event. - Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint. - Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land. What Rocky isn't: - Not a warehouse — it's the control plane on top. - Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC. - Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration. Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0. I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread. --- top comments --- [Xiaoher-C] The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality. Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary. Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review. [ramon156] If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run? [mollerhoj] Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks inclu […truncated]