What the community is asking.
Recent threads from r/databricks and Stack Overflow's [databricks] tag — practical pain points, integration questions, and edge cases worth knowing about.
This week
70 questionsClones don't mix with declarative pipelines?
I can't find an answer online anywhere, so let me try here if anyone had a similar experience or have a workaround. The source database managers imposed the ingestion process can only hit the database once per day, not once per environment per day. So basically I have to ingest into bronze once in production, and then copy that data into UAT and dev environments within Databricks. My solution was to have a job that daily runs `create or replace table uat.schema.table shallow clone prod.schema.table` for every table, once per day, after the ingestion into prod is finished. This way I can have the data available everywhere, with proper isolation between the environments (outside of this job), without the need to physically copy the data. The issue is this data is used by Spark declarative pipelines. Although the ingestion is append-only, sometimes 'vacuum' and 'optimize' operations will delete physical files. When the cloning happens, the lower environment table will have a single record in its history, for the clone operation, that "merges" all operations that happened during the day - including write, vacuum and optimizing. Now when the UAT/dev pipelines try to run, they will fail with DELTA_SOURCE_TABLE_IGNORE_CHANGES "there have been non-append changes to the data". It's easily reproducible: have a table, shallow clone it, run a streaming pipeline on it (in my case it uses the Auto CDC function). Now write to the original append-only, then call optimize on it (in a way that deletes at least 1 file). Now "clone or replace" again. Try to run a pipeline again - get an error. Note the pipeline defined against the base table runs fine, just the clone fails, though it's the same data (a literal clone). It seems the issue comes from the clone operation not holding history metadata about the multiple operations it merges into the clone operation in the cloned table's history. So I can understand where the error comes from - the pipeline has to way to prove the clone has append-only history. But this sounds kinda... wrong? It should be able to run. Is this the intended way for clone to work? It just doesn't work with streaming/declarative pipelines/Auto CDC? Or am I doing something stupid or an anti-pattern?
From Databricks One to Genie UI: A Shift from Platform to Experience
ABAC Policies Not Working on Metric Views
Solution Accelerator Series | Digital Twins
DABs Python Mutators: Stop Copy-Pasting the Same Config Across 50 Jobs
# Situation You've got 30, 50, maybe 100 jobs in your Declarative Automation Bundle. Every single one needs failure notifications. Every single one needs cost-center tags. Every single one needs the right cluster policy. And every time someone adds a new job, they forget at least one of those things. You could write **one Python function** that enforces it automatically at deploy time. That's what DABs Python mutators do. # What Are Mutators? A mutator is a Python function that runs during `databricks bundle deploy`. It receives every job (or pipeline) in your bundle, whether defined in YAML or Python, and returns a modified copy. Think of it as middleware for your deployment config. Write a tag, permission, or compute standard once, and apply it automatically to every resource at deploy time. No drift. Decorate a function with \`@job\_mutator\`, \`@pipeline\_mutator\`, \`@schema\_mutator\`, or \`@volume\_mutator\`. The function receives the resource + bundle context, and returns a transformed copy. You register them in `databricks.yml`: python: mutators: - 'mutators:add_pipeline_mutators' # Example This example defines common pipeline standards for every pipeline in your bundle: * Specifies common tags. * Enforces serverless compute. * Defines default notifications group and when to trigger an alert. ​ from databricks.bundles.core import Bundle, pipeline_mutator @pipeline_mutator def add_pipeline_mutators(bundle: Bundle, p: Pipeline) -> Pipeline: p = replace(p, tags=_add_common_tags(bundle, p.tags)) p = replace(p, serverless=True) default = Notifications.from_dict( { "email_recipients": "${var.recipients}", "alerts": ["on-update-failure", "on-update-fatal-failure", "on-flow-failure"] } ) p = replace(p, notifications=[default]) return p Other resources: * The [bundle-examples](https://github.com/databricks/bundle-examples) repo has a working example at [knowledge\_base/job\_programmatic\_generation](https://github.com/databricks/bundle-examples/tree/main/knowledge_base/job_programmatic_generation). * And as well the documentation page: [https://docs.databricks.com/aws/en/dev-tools/bundles/python/#modify-resources-defined-in-yaml-or-python](https://docs.databricks.com/aws/en/dev-tools/bundles/python/#modify-resources-defined-in-yaml-or-python) # Use Cases https://preview.redd.it/2v2ikiexd4yg1.png?width=632&format=png&auto=webp&s=fa39246e3830b857f1da43777aacfb1079a261a8 Job Mutator Examples: * Enforce default email notifications, owners, tags. * Standardize job clusters / serverless environments. * Inject common job parameters or health/queue settings. Pipeline Mutator Examples: * Enforce pipeline cluster / environment settings. * Apply consistent configuration, catalog/schema, or triggers across all pipelines. Schema Mutator Examples: * Apply standard permissions or tags to all schemas. * Enforce naming conventions or lifecycle settings. Volume Mutators: * Set default storage locations, ACLs, or lifecycle flags. * Add org‑wide tags or conventions to all volumes.
Most Databricks performance problems don’t start with code — they start with the wrong cluster setup.
I just published a practical guide on Databricks Clusters covering what actually matters in production: • All-Purpose vs Job Clusters • Cluster Pools for faster startup • Cluster Policies for governance • Photon for faster SQL + Delta performance • Spot Instances for serious cost savings • Autoscaling + Auto-termination best practices A lot of teams spend weeks optimizing queries while ignoring the real issue: poor cluster architecture. Sometimes the biggest performance gain is just choosing the right cluster strategy. Wrote this to simplify the concepts and make them useful for real production workloads. Would love to know how your team handles cluster optimization. Medium Blog - [https://medium.com/@wnccpdfvz/everything-you-need-to-know-about-databricks-clusters-production-ready-guide-c5e5ebe90757](https://medium.com/@wnccpdfvz/everything-you-need-to-know-about-databricks-clusters-production-ready-guide-c5e5ebe90757)
CUSTOMER STORY | easyJet: Creating better travel experience for all
Server Error: Invalid Request URL
Are there limits set for Agent Mode in Databricks Free Edition?
Governed Tags are GA
Are you currently leveraging the Tags ? If yes where and how? Curious to know
What's new in AIBI Dashboards March 2026
* **External user embedding GA**: External user embedding for dashboards is generally available. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/share/embedding/external-embed) * **Hide Databricks logo**: Option to hide the Databricks logo for embedded dashboards. 📖 [Documentation](https://docs.databricks.com/aws/en/dashboards/share/embedding/external-embed#hide-logo) * **Mobile-responsive layouts**: Dashboards support mobile-responsive layouts for small screens. * **Custom email subject lines**: Custom email subject lines for dashboard subscriptions. * **Custom font selection**: Select custom fonts for your dashboards, applied to data point labels. * **Accounting-style number formatting**: Parentheses formatting for negative numbers. * **Individual filter removal**: Remove individual filters without resetting all filters. * **Waterfall chart faceting**: Waterfall charts support faceting. * **Pivot table drill-through**: Added drill-through functionality for pivot tables. * **Right-click explanations**: Right-click explanations for time series visualizations. * **Dashboard snapshot audit logs**: sendDashboardSnapshot events added to audit logs for Slack/Teams subscriptions. * **Date filter keyboard input**: Fixed date filter keyboard input for relative date ranges. * **Point map recentering**: Point maps auto-recenter when filters are applied. * **Advanced cells in pivot tables**: Pivot tables support advanced cells including image, HTML, JSON, and link.
Databricks academy lab susbcription, what are the features?
I teach DataBricks, and other big data technologies, try to understand DataBricks Academy lab offered for 200 USD per year subscription. I use both free edition and enterprise edition for personal and professional, I wish to use DataBricks Academy Lab for 200$ yearly subscription for learning, writing books and researches and articles Can anyone compare Free Edition vs DataBricks Academy lab, what clusters are allowed, I need compute clusters, serverless, etc. I don't find details about clusters offered inside DataBricks Academy lab
Splitting string into respective columns
I am trying to read the log files and split the data into multiple columns using Databricks and Python. So far I have been able to split the string into the Array format, but I not able to put them into columns. Like: type, date, time, comment content INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3 from pyspark.sql.functions import split df_split = ( df.select(split('content', r' ', 0).alias('row')) ) df_split.select('row').display() row [INFO, 2025-10-10, 08:01:23, Starting, Spark, application] [WARN, 2025-10-10, 08:02:01, Memory, usage, is, high:, 75%] [ERROR, 2025-10-10, 08:03:05, Task, failed, for, partition, 3] I need this to be displayed into the following format: type date time comments INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3
Why does databricks_catalog import require provider-level workspace_id and ignore provider_config?
I am trying to import an existing databricks_catalog resource using terraform import (or terragrunt import ), but I'm encountering a persistent error related to a missing workspace_id . When I run the import command, I get the following error: Error: cannot read catalog: managing workspace-level resources requires a workspace_id, but none was found in the resource's provider_config block or the provider's workspace_id attribute This happens when my provider configuration looks like this: hcl # databricks_provider.tf provider "databricks" { alias = "accounts" host = "https://accounts.cloud.databricks.com" client_id = var.client_id client_secret = var.client_secret account_id = var.databricks_account_id } provider "databricks" { alias = "workspace" host = var.workspace_host client_id = var.client_id client_secret = var.client_secret } # catalog.tf resource "databricks_catalog" "this" { for_each = var.catalogs name = each.value.name storage_root = each.value.storage_root comment = each.value.comment isolation_mode = each.value.isolation_mode enable_predictive_optimization = each.value.enable_predictive_optimization provider = databricks.accounts provider_config { workspace_id = data.databricks_mws_workspaces.all.ids[each.value.workspace_name] } } My understanding was that I could use the provider_config block within the resource to specify the workspace_id dynamically, especially when managing resources across multiple workspaces. However, the error message suggests that this block is being ignored. The import only succeeds if I explicitly add provider = databricks.workspace to the catalog instead of provider = databricks.accounts provider_config { workspace_id = data.databricks_mws_workspaces.all.ids[each.value.workspace_name] } My Question Given this behavior, is the Databricks provider's inability to use provider_config for the databricks_catalog resource during an import a bug, or is this feature […truncated]
cant apend results of streaming group by agregations
Hi, I'm relatively new to Databricks. I have a medallion architecture with the following components: \- cor\_project (catalog) \- bronze (schema) \- raw\_swell\_metrics (table) \- data (volume) \- landing \- checkpoints \- raw\_swell\_metrics \- silver (schema) \- swell\_metrics (table) \- quarantine\_swell\_metrics (table) \- data (volume) \- checkpoints \- swell\_metrics \- quarantine\_swell\_metrics \- gold (schema) \- wave\_daily\_summary (table) \- data (volume) \- checkpoints \- wave\_daily\_summary The flow is as follows: Add file(s) to bronze.data.landing -> manually execute job -> read only new file(s) and add them to bronze.raw\_swell\_metrics -> read only new rows in bronze.raw\_swell\_metrics (transform and data quality) and add them to swell\_metrics or quarantine\_swell\_metrics -> read only new rows in silver.swell\_metrics (transform) and add them to gold.wave\_daily\_summary. The data is uploaded every month with a new file. The data is flowing correctly from landing to silver.swell\_metrics It fails when I'm transforming it to gold. Code: df_silver_swell_metrics = ( spark.readStream .format("delta") .table(f"cor_project.silver.swell_metrics") ) df_silver_swell_metrics_transformed = ( df_silver_swell_metrics .groupBy( F.date_trunc("day", "datetime").alias("day"), "coast_name" ).agg( F.max("wave_height_m").alias("max_wave_height_m"), F.expr("max_by(wave_period_s, wave_height_m)").alias("max_wave_period_s"), F.expr("max_by(wave_direction_deg, wave_height_m)").alias("max_wave_direction_deg"), F.expr("max_by(wind_speed_ms, wave_height_m)").alias("max_wave_wind_speed_ms"), F.expr("max_by(wind_direction_deg, wave_height_m)").alias("max_wave_wind_direction_deg"), F.min("wave_height_m").alias("min_wave_height_m"), F.expr("min_by(wave_period_s, wave_height_m)").alias("min_wave_period_s"), F.expr("min_by(wave_direction_deg, wave_height_m)").alias("min_wave_direction_deg"), F.expr("min_by(wind_speed_ms, wave_height_m)").alias("min_wave_wind_speed_ms"), F.expr("min_by(wind_direction_deg, wave_height_m)").alias("min_wave_wind_direction_deg"), F.avg("wave_height_m").alias("avg_wave_height_m") ) ) df_gold_wave_daily_summary = ( df_silver_swell_metrics_transformed .select( F.col("day").alias("date"), F.col("coast_name"), F.col("max_wave_height_m").cast("float"), F.col("max_wave_period_s").cast("float"), F.col("max_wave_direction_deg").cast("float"), F.col("max_wave_wind_speed_ms").cast("float"), F.col("max_wave_wind_direction_deg").cast("float"), F.col("min_wave_height_m").cast("float"), F.col("min_wave_period_s").cast("float"), F.col("min_wave_direction_deg").cast("float"), F.col("min_wave_wind_speed_ms").cast("float"), F.col("min_wave_wind_direction_deg").cast("float"), F.col("avg_wave_height_m").cast("float") ) ) ( df_gold_wave_daily_summary.writeStream .format("delta") .trigger(availableNow=True) .option("checkpointLocation", f"/Volumes/cor_{ambiente}/gold/data/checkpoints/wave_daily_summary") .toTable(f"cor_project.gold.wave_daily_summary") ) This generates the following error: [STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION] Invalid streaming output mode: append. This output mode is not supported for streaming aggregations without watermark on streaming DataFrames/DataSets. SQLSTATE: 42KDE I have tried including F.watermark it works, but doesn't load the records for the last day. Any idea how to solve it? Thanks for any advice
Solutions Architect Interview – What to Expect?
Hey everyone, I have an upcoming interview for a Solutions Architect role at Databricks and wanted to hear from anyone who’s gone through the process recently. Would really appreciate insights on: * What rounds were involved * The type of questions asked (technical, system design, customer scenarios, etc.) * How deep they go on architecture vs hands-on knowledge For context, I come from a consulting background. I’m also a Databricks champion and have all 7 Databricks certs. Any tips on how to best prepare or areas to focus on would be super helpful. Thanks in advance!
ProfilingError: SPARK_ERROR. Spark encountered an error while refreshing metrics.
Login Issue
VNet Data Gateway unable to connect to Azure Databricks Serverless SQL via Private Endpoint
"Databricks Certified Data Engineer Associate exam voucher
Databricks Genie app
The Databricks Genie app is available on the Play Store!
DataHacks 2026: University Alliance in Action at UCSD
AI/BI dashboard filter having no effect?
As seen in screenshot the Product filter is set to allow two values. But ALL values are still displayed. Why does this filter not work? Note: https://preview.redd.it/4rm0z9gxmyxg1.png?width=2092&format=png&auto=webp&s=02cc5ff480269c2e29fa7fd9d55214bb41df5121
I built a reusable DABs template for multi-environment bundle projects (open source)
I've been working with Databricks Asset Bundles (recently renamed to Declarative Automation Bundles, same DABs acronym) on my project for over a year now. At some point I realized the setup I'd landed on was general enough to be reusable, so I spent about three months of evenings and weekends turning it into a proper Databricks CLI template. It ended up being more comprehensive than what I run on my own project, honestly. You run `databricks bundle init <repo-url>`, answer some prompts (cloud provider, compute type, CI/CD platform, environment setup), and it generates a complete bundle project with: - Multi-environment targets (user/stage/prod, optional dev) - Schema-per-user dev isolation (dbt-style approach: everyone shares the dev catalog, schemas prefixed with username) - CI/CD pipelines for GitHub Actions, Azure DevOps, or GitLab - Medallion architecture schemas as bundle resources - Configurable compute (classic, serverless, or both) - Optional RBAC with environment-aware groups It uses the new direct deployment engine (requires CLI v0.296.0+), so no Terraform dependency. The generated project comes with docs, a quickstart guide, and sample pipelines to start from. Repo: https://github.com/vmariiechko/databricks-bundle-template Example output: https://github.com/vmariiechko/databricks-bundle-template-example MIT licensed. Happy to hear feedback or answer questions about the design decisions. And if something doesn't fit your setup, issues and PRs are welcome.
No-Code Pipelines: Databricks Lakeflow Designer Demo (4-min demo)
In this demo for Lakeflow Designer you will see me: \-Pulling sales data from Unity Catalog AND from local Excel workbooks \-Building a one big table report and more specialized reports \-Exporting reports to Excel \-Storing outputs back into Unity Catalog tables for use by analysts, BI developers, and even business users using the Databricks' Excel connector \-Setting up a recurring pipeline so that the data is kept fresh automatically \-All of the above without writing any code I hope you enjoy the video, but more importantly, that you try it out yourself AND give feedback to the folks at Databricks on how to make the product even better. There are still many out-of-the-box pieces not yet available, and I know that the amount of feedback Databricks gets from customers will affect direction and priorities!
Is the serverless budget/usage feature officially broken for certain serverless job "types"?
Governance in Databricks Apps
I built an app using Streamlit and it's running on Databricks Apps. The app has modules that query catalogs, sending the user's token to use Databricks governance. However, my managers didn't like that the user could execute queries within Databricks. I could use the service principal in the catalog permissions instead of authorizing the user, but I would have to create an ACL system within the app, which could make it complex. Have you built something similar and could offer some ideas?
[FREE WEBINAR] Running Supply Chain Operations on Databricks: From Dashboards to Agents (BrickTalk)
Hey r/Databricks! We're hosting a free community sponsored BrickTalk this Thursday, May 7th, focusing on modern Supply Chain Management using Databricks. BrickTalks is a community event series where Databricks experts share real-world use cases, demos, and practical insights for building with data and AI, giving customers a direct line to the people behind the products. Discover how to build a unified Control Tower that delivers real-time inventory visibility, AI-powered demand forecasting, and autonomous planning. We'll demo an end-to-end operational platform featuring Databricks AI/BI Genie Rooms and multi-agent Supervisor workflows. You'll see: * Dashboards surfacing key insights. * Genie answering natural language queries grounded in live data. * Agentic systems autonomously processing inbound requests to generate fulfillment plans. This is a great chance to see real-world use cases and get practical insights directly from Databricks experts. **When:** Thursday, May 7 * 9:00 am PT * 12:00 pm ET * 5:00 pm London * 9:30 pm IST [Register here](https://usergroups.databricks.com/events/details/databricks-user-groups-bricktalks-presents-supply-chain-management-bricktalk-running-supply-chain-operations-on-databricks-from-dashboards-to-agents/) Drop any questions below! 👇
Community Alert: Free BrickTalk on Supply Chain Management with Databricks!
Data deletion on the underlying S3 files
Hi. We are designing a data platform for AWS Databricks and we keep hitting an issue, which is the deletion of data on the underlying files of our tables. To give an example, the Delta tables in Databricks exist as parquet files + a Delta log in one of our S3 buckets, so when deleting a record from a table (for example for a "Right to be Forgotten" request) we also need to delete it from the underlying files to be compliant. This particular case we managed to solve it via External tables + VACUUM. Basically make an External Delta table, delete the record from the table, VACUUM to delete all the orphaned files (lets ignore retention for simplicity). This solves the issue... as long as those files are forever retained and contained on that specific bucket. But what about backups, or lifecycle policies that move our data between different S3 tiers as time passes? Things that happen inside Databricks like the creation of downstream tables are "easier" to control, but how do you deal with what happens on AWS side to those files? Of course you can create Lambdas to delete the data from those backups, etc, but having two separate deletion processes creates the issue of ensuring consistency between the two, which is a huge headache on its own. I imagine this is a common challenge, does Databricks offer any out of the box solution?
Exam Passed on April 21st but Certification Still Missing
Databricks One is now renamed as Genie
TLDR: * **Account-level Genie is now GA** – a single Genie experience shared across all workspaces in an account * **Unified Genie Chat** – ask once and get answers powered by full context across your data estate, including Genie Spaces, tables, metric views, dashboards, documents, and more * **Expanded connectors and sources** – native integration with platforms like SharePoint, Confluence, Google Drive, Glean, and others * **Genie Mobile** – native iOS and Android app, currently available in private preview * **Product unification** – Databricks One has been renamed to **Genie** as the unified product brand The next generation of Databricks Genie is here - check this blog out for more details: [https://www.databricks.com/blog/next-generation-databricks-genie](https://www.databricks.com/blog/next-generation-databricks-genie)
🌟 Community Pulse: Your Weekly Roundup! April 20 – 26, 2026
Is there a way to natively mount external Iceberg REST Catalogs (e.g., BigLake) in Unity Catalog?
ai_classify() - Am I going insane?
I need someone to double check me here because I'm hoping I'm confused. So Databricks has it's nice no-code data classification tool in the UI to let users orchestrate AI functionality. Very cool so far. However, one small problem - there's no option in the UI to select a model. It just runs the default offering from the model reg. So I think maybe that's just not implemented yet and go the SQL route to use ai\_classify() directly. Lo and behold you can't choose the model here either, and the built in model choice is still opaque! What? Databricks doesn't seem to do any analysis on my workflow to automatically route to an appropriate model. Is it really just using the same model regardless of prompt complexity? If this is the case, what is the actual usecase for this functionality? I have been cheerleading Databricks at work for a while now and have onboarded a ton of users, but how can I endorse using ai\_classify() in workflows when the spend has the potential to be so wildly mismatched to the task? I'm currently advising people who want low code agentic solutions to use n8n, which is a shame because pretty much all the other resources they might want to include are in Databricks. Of course technical users can use ai\_query(), but the overlap of people who want to code their own agentic workflows and people who want to use Databricks built-ins is pretty small.
Databricks One is now Genie
Tried the Lovable + Databricks connector on a hackathon project
I originally thought the Lovable/Databricks connector was kind of a gimmick. Then I had a hackathon project where all the heavy lifting was in Databricks (data processing, enrichment, a bit of ML), but the result had to be shown as a simple app for non-technical users. Tried Lovable mostly out of curiosity, and honestly, it worked better than I expected for an MVP. A couple of practical notes in case anyone else tests it: * service principal needs access not just to the data, but also to the SQL warehouse / compute * I got it working fine on Databricks Free Edition * if you don’t cache responses, repeated queries can get expensive fast because you’re paying for warehouse runtime I still wouldn’t treat this as my default production setup, but for demos / internal prototypes/idea validation, it was surprisingly useful. I wrote a short article with examples - [https://medium.com/@protmaks/databricks-lovable-a-practical-case-study-and-what-it-costs-to-build-an-app-085f61b07126](https://medium.com/@protmaks/databricks-lovable-a-practical-case-study-and-what-it-costs-to-build-an-app-085f61b07126)
Databricks + Lovable: A Practical Case Study of Building an MVP and Managing Costs
Lakeflow Connect: Data Ingestion from SQL Server to Databricks
45 days left for databricks data engineer associate cert ! Help me please.
Hey guys , Iam not working in databricks or any other data platforms . I have very limited knowledge in cloud ,zero knowledge in databricks. But familiar with sql and python . I want to learn and pass associate examination with in 45 days . Is this possible, where should I start , what to learn .please guide me some course or resource , approach and tips to crack the exam with good scores. Thankyou !
Memory error in LightGBM training data processing
[PARTNER BLOG] Access Databricks Data Natively in Excel Using the New Excel Add-in
What's new in AIBI Genie - March 2026
⚡ Inspect : It automatically improves Genie’s accuracy by reviewing the initially generated SQL, authoring smaller SQL statements to verify specific aspects of the query, and generating improved SQL as needed. 📖 Documentation [https://docs.databricks.com/aws/en/genie/#inspect-mode](https://docs.databricks.com/aws/en/genie/#inspect-mode) 🔥 Conversation sharing: You can share Genie conversations with privacy settings: Private, Reviewable by managers, or Account-wide. 🌊 Space management APIs are GA: Create, Update, Get, List, and Trash APIs for Genie spaces are generally available. 📖 Documentation => [https://docs.databricks.com/aws/en/genie/conversation-api](https://docs.databricks.com/aws/en/genie/conversation-api) 🔊 Benchmark APIs : The run benchmarks and retrieve benchmark results API, 📖 Documentation => [https://docs.databricks.com/api/workspace/genie/geniecreateevalrun](https://docs.databricks.com/api/workspace/genie/geniecreateevalrun) 🎸 Workspace-level color palette: Genie spaces integrate with workspace-level color palettes for consistent branding. 📣 Improved context identification: Genie better identifies context from previous messages for more accurate responses. 🎧 Ask Genie to explain chart changes: Users can right-click on bar, line, and area time series visualizations (including multi-series and stacked charts) and ask Genie to explain changes. Genie enters Agent mode to analyze the change and identify top drivers. 📖 Documentation => [https://docs.databricks.com/aws/en/dashboards/genie-spaces#explain-chart-changes](https://docs.databricks.com/aws/en/dashboards/genie-spaces#explain-chart-changes) 🎤 Genie space descriptions on dashboards: Authors can add descriptions to Genie spaces embedded in dashboards. 📖 Documentation => [https://docs.databricks.com/aws/en/dashboards/genie-spaces#genie-space-description](https://docs.databricks.com/aws/en/dashboards/genie-spaces#genie-space-description) 🪕 SQL download settings for full query results: The APIs to download full query results respect workspace-level settings for SQL downloads. 📱 Share Genie space with all account users: The share modal includes an option to share your Genie space with all account users Don't forget to watch the first episode of SuperSkills about Genie: [https://youtu.be/2gKS72dmGIk](https://youtu.be/2gKS72dmGIk)
Best practices for initial large-scale ingestion from on‑premises Oracle to Databricks
Getting started with multi table transactions in Databricks SQL
Transactions let you coordinate operations across multiple SQL statements and tables. All changes succeed together or roll back together, ensuring data consistency across your operations and tables
Migrate SSRS reports from Snowflake to databricks..!
We are being onboarded on project (SSRS reports from Snowflake to databricks) have never worked on similar thing. As we were mostly in support role. So, can you guys please guide us how to approach this project. And what thing needed to be take care of. And if anyone worked on similar thing can you guide us with the rough process so that we can get a broad idea and move further. Thanks!
Serverless Notebooks and Jobs Environment Variables [let's design this together]
**Quick scenario**: you spend time getting your notebook working. `OPENAI_API_KEY` is set, `PIP_EXTRA_INDEX_URL` is pointing to your private registry, everything runs. You click Schedule. The job fails. Env var not found. Should these be set at the workspace, folder, or user? Sound familiar? If so, we should be friends. It's Justin Breese (PM at Databricks) and I am back to chat about dependency management - researching how to make environment variables work seamlessly across serverless notebooks and jobs - so clicking Schedule just works, no extra config, no surprises. Want them to work cross workspace, project, etc.? **I want to talk to you if:** * 🔁 You re-set env vars every session because they don't persist * 💥 You've had a notebook-to-job failure caused by a missing env var * 🔐 Managing API keys or credentials in notebooks feels more manual than it should * 🏢 You're a workspace admin who wants to set shared config (pip registry, endpoints) once for everyone Options: 1. 30 minutes, no prep needed. Grab time here: [https://calendar.app.google/CxxpHKBWvxRVQM7i9](https://calendar.app.google/CxxpHKBWvxRVQM7i9) 2. Email me direct feedback and tons of context: [j@databricks.com](mailto:j@databricks.com) 3. Messenger pigeon: Send one to me? 4. Or just drop a comment - even a "yes this is a pain" tells me something useful. Thanks!
Unity Catalog storage credential fails although same Access Connector works in another credential
Using Global Filters with Master Detail Dashboard
As seen in the screenshot a number of Global Filters have been created. Identical parameters are associated with a Visualization in a dashboard. How to link them? The dashboard gui designer is confusing. We can see that the parameters are all missing in the details panel visualization on the bottom half. https://preview.redd.it/cacowzicetxg1.png?width=1024&format=png&auto=webp&s=51f0d2a71591380b2d05da198d5e998f13ccd1d4 In the Dataset itself I tried to link local parameters to the global ones but can not see how to do it. Here's an example for one of the parameters: ***usage\_date*** which matches the name of a global filter. But there is no apparent way to link the two. Where/how can that dataset-specific parameter be linked to the global filter? https://preview.redd.it/bve4lswihtxg1.png?width=512&format=png&auto=webp&s=4122b90fbddce39edb2225b43ae96f0e91f3e4a9
My credentials aren't showing...
Databricks Pipelines -Pulling Stale Wheel Files
I am building Databricks pipelines which use logic stored in wheel files to build the pipeline, however as I push new updates the wheel files are being cached in the pipeline causing pipeline updates to continue failing even after the update was made. (Wheel file updates in the UC volume | pipeline does not pull a fresh copy) The only wait to resolve this has been to delete the \*\*\*\*\*\*\* pipeline as a full refresh will not force the cached files to update. Furthermore, even if a full refresh works that is a terrible solution. Does anyone know how to force the pipeline to pull the new wheel file? I did not have this issue 2 months ago so wondering if one of there updates has caused this issue.
I built a 54-minute hands-on RAG tutorial on Databricks — from PDF loading to retrieval and LLM answers
Hi Everyone I recently published a hands-on tutorial where I build a basic **RAG pipeline on Databricks** from scratch. The goal of the video is not just to use a high-level RAG framework, but to show what actually happens behind the scenes. In the video, I cover: * Loading PDF files inside Databricks * Extracting text from PDF pages * Splitting documents into chunks * Creating embeddings using Databricks embedding endpoints * Building a simple manual retrieval system using vector similarity * Creating prompts from retrieved chunks * Generating grounded answers using Databricks LLM endpoints * Using `databricks-langchain` for embeddings and chat models I intentionally kept the implementation simple so that beginners can understand the core mechanics of RAG before moving to more production-level tools like Vector Search, Unity Catalog, MLflow, etc. Here is the video: [https://youtu.be/7QY1iXPLgRg](https://youtu.be/7QY1iXPLgRg) Would love to hear feedback from people working with Databricks, RAG, LangChain, or enterprise GenAI systems. Also curious: for production RAG on Databricks, would you prefer starting with a simple manual implementation like this first, or directly using Mosaic AI Vector Search / Databricks Vector Search from the beginning?
How do you reframe data engineering for a CEO who thinks it's "data quality oversight"?
The next generation of Databricks Genie
The new Genie can answer questions beyond the boundaries of a Genie Space, connect to enterprise knowledge sources like Google Drive and Sharepoint, and combine structured and unstructured data to generate insights. Genie now includes all capabilities previously known as Databricks One, marking a significant shift in how business users engage with the platform. Along with account-level access and native iOS and Android apps, Genie is becoming the primary way users experience Databricks - available anytime, anywhere.
Security Analysis Tool for Databricks (Deep-Dive w/ Arun, Principal Security Engineer @ Databricks)
SOC 2, HITRUST, and so many other security things! Security requires a multi-pronged approach. Databricks' Arun Pamulapati did a deep-dive into the Security Analysis Tool, a tool created by field engineers at Databricks to help you improve your organization's Databricks deployments security posture against threats. This is a very technical deep-dive and I hope you enjoy it! Link to repo: [https://github.com/databricks-industry-solutions/security-analysis-tool](https://github.com/databricks-industry-solutions/security-analysis-tool) Link to Databricks' security best practices: [https://www.databricks.com/trust/security-features/best-practices](https://www.databricks.com/trust/security-features/best-practices)
Where do i look for a solid DQ setup?
I would love to see some repos or video tutorials with a clean and easy DQ setup for monitoring of quality and alerting. I find the builtin Databricks capabilities a bit limited, for instance: the Alerts part cannot send a full table in the e-mail notifications. Can anybody recommend me somewhere to look? Its a jungle
Databricks outage
There’s an outage affecting **Databricks compute services** (clusters, jobs, and serverless). Unfortunately, it affects all cloud providers. **What’s happening** * Failures when starting clusters * Job runs failing or terminating during startup https://preview.redd.it/yegrpp87yrxg1.png?width=1360&format=png&auto=webp&s=e09c555073e68fd543ca1a7ddb0d5dc2ecd10eda https://preview.redd.it/bdrstpccyrxg1.png?width=684&format=png&auto=webp&s=4459cb189e95263ecc490484ceb9967c90bbc268
Another week, another databricks outage.
I'm so sick of this. This is, I think, the 3rd one in a few weeks? Has something changed over there? And all we get is a "sorry." I like a lot of the features of Databricks, but it feels so half baked in reliability and some features that one would assume would be core.
Feature Request: API support for Context-Based Ingress Control IP lists
AWS GovCloud Feature Availability Question
Avoid High Write Costs in Storage when Using Spark Declarative Pipelines
Hi All. Earlier this year a pipeline was turned on using Spark Declarative Pipelines in continuous mode. Immediately we noticed an explosion in storage write costs in the datalake. We lowered these costs massively by making some configuration changes and I hope our learnings help someone in the future that has the same problem. Two very important settings to configure are pipelines.trigger.interval and spark.sql.shuffle.partitions. The fundamental issue is that there are changelog files that get updated with every refresh that a table has within the pipeline. The partitions are important because it means if you have the default partitions set (default is 200) that there will be 200 changelog files and 200 changelog.crc files that get updated for every refresh for each table. For every refresh there are also four write operations in the datalake to each of these changelog files: flush, append, create and rename. This means that for every refresh there will be 1,600 write operations. (200 partitions x 4 write operations x 2 changelog types). If you do not set a trigger interval this will be set at 5 seconds by default but we were seeing even less at times. This means if you do not make these configuration changes there will be \~27 million write operations per table per day if you have a pipeline that has frequent amounts of data being ingested throughout the day. We have since updated these settings and although the partitions help just changing the pipeline interval to 60 seconds will see large savings if that sort of latency is acceptable. If you want very low latency then leave the interval as is but please look into partitions as a minimum. Not only have our storage write costs decreased but we have also noticed significantly less compute power required. Please keep in mind this was for an ingestion a few million records a day so the refreshes were frequent which may not always be the case depending on your data volumes. Either way I would highly recommend you look at your pipelines and storage write costs specifically to ensure the same is not happening to you. \_\_\_\_\_\_\_\_\_\_\_\_ This is an update following this earlier post [https://www.reddit.com/r/databricks/comments/1slgxmb/do\_you\_set\_pipelinestriggerinterval\_on\_spark/8](https://www.reddit.com/r/databricks/comments/1slgxmb/do_you_set_pipelinestriggerinterval_on_spark/8)
Heading into the May 2026 Databricks Data Engineer Associate Exam? Read this first.
So if you've been scrolling through older study guides for the Databricks Data Engineer Associate exam — be careful. The syllabus got a pretty big update this month, and the focus has shifted toward the platform's newer declarative features. I spent some time going through the new guidelines. Here's what I found. Lakeflow is the new standard. The exam has moved away from manual ETL logic. You need to understand Lakeflow Spark Declarative Pipelines (formerly DLT) and how Streaming Tables and Materialized Views actually differ. If your notes still say "DLT" everywhere, time to update them. DABs are no longer a side topic. Databricks Asset Bundles — basically infrastructure-as-code for workflows — is now a core part of the exam. They want to see that you can deploy through DABs, not just click around the UI. Unity Catalog is the default assumption. No more legacy Hive Metastore questions. The exam lives in a UC-enabled world now. Three-tier namespace (catalog.schema.table), Volumes for unstructured data, column-level lineage — that's where your time should go. Serverless Compute is showing up more. When do you pick Serverless SQL Warehouses or Serverless Jobs over classic clusters? That tradeoff — less config overhead vs. less control — is fair game now. The weightings that surprised me → 31% on Processing (Lakeflow, Spark, Streaming Tables) → 18% on Productionizing (DABs, Workflows, deployment) That's almost half the exam right there. Honestly, if you just understand why Databricks is pushing toward declarative tools — letting the platform handle the boring parts so you can focus on the actual logic — a lot of the questions start to make sense. For practice material, BricksNotes has an updated practice test that follows the May 2026 format — 45 questions, 90 minutes, same weightings. → [bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026](http://bricksnotes.com/blog/databricks-data-engineer-associate-new-exam-guide-may-2026) Good luck to everyone testing this month! Drop questions below if you're stuck on any of the new topics — happy to help where I can.
Databricks Lake base - Modern Enterprise Healthcare Agents with Lake base Memory
Accidentally submitted my exam
Transitioning from ADF to Databricks Workflows: Best Practices in a Multi-Workspace (dev-prod)
Is there any alternative way to write xlxs files and it will take less of time to write
I am using Pysprak as my language to do write xlxs files So I was trying to find a way to write xlxs files quicker. So I need to write around 20 files and it takes around an hour to write those I am not doing any formating in the xlxs files but still it takes that so any of you know a way quicker way to write those xlsx file
Lakeflow jobs file trigger thru overwritten files
OpenAI GPT-5.5 + Codex, now available and fully-governed in Databricks
System Tables are... overcomplicated? + some helpers
Had a chance to play with System Tables a bit more in the last 2 weeks - every meaningful query takes 70-100+ lines of SQL due to horrendous design decisions, even Genie Code makes mistakes all the time when writing these. **SCD2 tables lack the basic timestamps** The advertised SCD2-like tables (jobs/tasks/clusters/warehouses) lack the basic timestamping funcitonality, like \_\_start\_time / \_\_end\_time, to use them in an appropriate way one has to use windowing function every single time. It's even more surprising considering how Databricks promotes the autoCDC functionality which adds these by default. **No SCD1-like VIEWs** One has to remember to use `ROW_NUMBER()` on SCD2-like tables, or suffer from the duplicates post-JOIN **job runs / job task runs slicing** Inconsistency between these, and `system.query.history`. The former emits hourly slices while the latter updates the already emitted row in-place. Every time a job run time is needed, one has to use a group by. Additionally, `compute_ids` column on job runs doesn't contain all computes attached to its tasks - it’s a documented flaw, but still. Is there any good source for SQL queries against System Tables? Jobs system tables documentation seem to be the only place to list anything more complicated, but it's still lacking basics like AVG CPU usage per job run, together with a cluster/worker configuration at a runtime (ie, how many workers the autoscaling scaled to etc) [https://docs.databricks.com/aws/en/admin/system-tables/jobs](https://docs.databricks.com/aws/en/admin/system-tables/jobs) Maybe some of you will find these helpful, I wish we were able to create views inside the `system` catalog. First part - 4 SCD1 tables for jobs/tasks/clusters/warehouses, then 2 SCD1 tables for job runs / task runs with additional `run_start`,`run_end`, `run_last_seen`,`run_duration_seconds`, and `retries` columns CREATE OR REPLACE VIEW shared_prod.system.jobs_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id ORDER BY change_time DESC) rn FROM system.lakeflow.jobs) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.tasks_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id, task_key ORDER BY change_time DESC) rn FROM system.lakeflow.job_tasks) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.warehouses_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, warehouse_id ORDER BY change_time DESC) rn FROM system.compute.warehouses) WHERE rn = 1; CREATE OR REPLACE VIEW shared_prod.system.clusters_scd1 AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, cluster_id ORDER BY change_time DESC) rn FROM system.compute.clusters) WHERE rn = 1; and same for job runs / job task runs CREATE OR REPLACE VIEW shared_prod.system.job_run_timeline_scd1 AS WITH base AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY workspace_id, job_id, run_id ORDER BY period_end_time DESC) rn FROM system.lakeflow.job_run_timeline QUALIFY rn = 1 ), agg AS ( SELECT workspace_id, job_id, run_id, MIN(period_start_time) AS run_start, MAX(period_end_time) AS run_last_seen, SUM(CASE WHEN result_state IS NOT NULL THEN 1 ELSE 0 END) - CASE WHEN MAX_BY(result_state, period_end_time) IS NOT NULL THEN 1 ELSE 0 END AS retries, CASE WHEN MAX_BY(result_state, period_end_time) IS NOT NULL THEN MAX(period_end_time) ELSE NULL END AS run_end FROM system.lakeflow.job_run_timeline GROUP BY ALL ) SELECT b.* EXCEPT(rn, period_start_time, period_end_time), a.run_start, a.run_end, a.run_last_seen, a.retries, TIMESTAMPDIFF(SECOND, a.run_start, a.run_end) AS run_duration, (a.run_end IS NULL) AS is_running FROM […truncated]
Best Spark observability tool for run-to-run stability in 2026?
Spent the last 6 weeks tuning Spark configs across a handful of jobs. Executor memory, parallelism, shuffle partitions, went through the usual levers. Runtime improved on most runs but job stability didn't move. The same jobs that ran faster after tuning still fail or slow down randomly under load. The numbers look better on average but the variance is wider than before. A job that used to take 28 min consistently now finishes anywhere between 18 and 55 min. Checked for skew, GC pressure, shuffle spill. Nothing obvious. Bad runs don't leave much of a trace, just slow tasks with no clear pattern between them. The working theory is that tuning without run-to-run visibility just moves the problem. You improve one metric and introduce instability somewhere else without seeing it. What's missing is a Spark observability tool that shows what shifts between a good run and a bad one,not just aggregate stage times but the specific conditions that differ. How do you approach stability separately from raw performance. And has a Spark observability tool helped you connect the variance to a root cause?
Last week
93 questionsI built a free 11-tab Cost Observability Dashboard as a Databricks App — open source
Best option for parallel processing
Kind request to reschedule my Databricks Data Engineer Professional Exam
Looking for a Data Engineering Mentor / Enterprise-Level Hands-On Project (Azure Databricks + ADF)
Hi everyone, I’m currently a Data Analyst transitioning into a Data Engineering role, and I’m looking for structured hands-on guidance. Over the past few months, I’ve been learning through YouTube tutorials, documentation, and building small projects. However, I now realize I’m missing real enterprise production experience — understanding how everything fits together end-to-end. Tech stack I’m focusing on: • Azure Data Factory (ADF) • Azure Databricks • PySpark • Delta Lake / Medallion Architecture (Bronze–Silver–Gold) What I’m looking to learn through a real enterprise-style project: • Proper project structure in Azure Databricks • Unity Catalog governance • CI/CD setup using Databricks Asset Bundles • Orchestration & workflow design • Batch & Streaming pipelines • Delta Live Tables (DLT) pipelines • Optimization & performance tuning • Error handling, monitoring & logging strategies • Slowly Changing Dimensions (SCD) implementation • End-to-end pipeline design from ingestion to serving layer I’m looking for: \- A mentor / tutor / experienced data engineer \- Short-term structured guidance (\~1 month) \- Paid mentorship or project-based learning is completely fine If you can guide me or know a reliable mentor/platform, please comment or DM me. Thank you — I truly appreciate any help 🙏
Databricks admin and maintenance
What maintenance should we be doing as admin? A query was taking 1 minute and LLM suggested a stats update command which improved the time to .2s. Makes me think I’m not doing my job!
Lakebase Makes Databricks Feel More Complete
$200 per year lab subscription
Additional thoughts after spending about 10 hours trying the new Lakeflow Designer
After spending some more time on Lakeflow Designer last night/into the early hours of today, it is now my favorite feature from Databricks this year, and I really hope Databricks builds even more features around it. Some thoughts: \-Just about everything in this pipeline shown was "vibed", but every step has a visual representing the flow of data in a way that technical and tech-savvy business folks can easily audit and/or modify as needed. \-The ability to use the AI functions like ai\_classify, ai\_summarize, and others is the cleanest for discovery, testing, and getting things into production. AI functions for everyone. Don't even need to know SQL! \-Being able to easily bring in data from Excel into a governed environment to be joined with governed tables is very practical, representing how things happen in the real world. And of course, being able to export back to Excel is also nice (or connect via the Excel connector!). \-Is it 100% ready to beat all the tools out there in this category? No. Will it eventually get there? I believe so and more. Today, you can already begin to do a lot with it. \-If you do try it out and wish it had some other capabilities, I highly encourage you to share your feedback with Databricks as I know they are actively listening to make this product beneficial for a broad range of customers. The fact that there are already more than a handful of videos out there showcasing Lakeflow Designer tells me many others are very excited about this as well. In the next few days, I'll be sharing a video or two of my own around this.
Unable to see lakeflow designer option in my free edition databricks account
Databricks certificate Last Name
Hi, I have registered for databricks certified data engineer associate and while registering the Last Name section was mandatory but I don't have a last name on my legal documents so I put 'NA'. Is it acceptable?
Passed the Databricks Data Engineer Associate last week and here with sharing what worked and a free practice test.
Got my DEA cert last week with 82%. Figured I'd write up what I did since I wasted a lot of time early on trying to figure out what was worth studying and what wasn't. I've been using Databricks at work for about a year. PySpark and Delta Lake stuff mostly. But I had real gaps in Unity Catalog, Lakeflow Declarative Pipelines, and Databricks Asset Bundles because my team doesn't touch those much. Started with the official exam guide. The November 2025 one from Databricks. Honestly this should be the first thing anyone reads. It breaks down exactly how much of the exam comes from each section. I had no idea 18% was about productionizing and DABs until I read it. Did some Databricks Academy stuff. It's fine. If you already use Databricks every day you can skip the intro material and just hit the areas you're shaky on. But the thing that helped the most was taking practice tests. Reading is one thing. Sitting down with a timer and actually answering questions is completely different. I found a free one at [bricksnotes.com](http://bricksnotes.com) that matched the format pretty well. 45 questions, 90 minute timer, same five sections. Took it three times over two weeks and went from 58% to 71% to 84%. It breaks your score down by section so you can see exactly what you're bad at instead of guessing what to study next. The real exam was a bit harder but the format was the same. Scenarios, code to read, answers that all sound reasonable if you don't really know the material. Some stuff I wish someone told me before: Actually read the code in the questions. Don't just glance at it. Some questions have small things in the code that completely change the answer. Unity Catalog permissions come up a lot. GRANT vs ownership vs inheritance. What a metastore admin can do vs a catalog owner. External locations and storage credentials. Know this cold. Delta Sharing is tested more than I expected. Not just "what is it" but internal vs external sharing, cost stuff, limitations, what recipients can actually do. Medallion Architecture questions aren't "what are the three layers." They're more like "should this transformation happen in silver or gold and why." They test your judgment not your memory. Lakeflow Declarative Pipelines questions focus on why you'd use it and how expectations work. You don't need to have built a complex pipeline. You need to understand the advantages over traditional ETL and how streaming tables vs materialized views differ. DABs came up more than I expected. Know the basic structure and why you'd use them over manually deploying notebooks. 90 minutes is plenty of time. I had 25 minutes left. If you're finishing practice tests comfortably you'll be fine. I prepped for about 3 weeks. Couple hours a day after work. If you use Databricks already that's enough. If you're starting fresh probably give yourself 6 to 8 weeks. Happy to answer questions if anyone's prepping right now.
databricks x openui
Github - learning and applying
Does anyone have a recommendation for learning and applying git in the databricks environment? Our team is struggling with implementing DABs for dev,test,uat, and prod. We have one pattern that is specifically for a migration project that incorporates a lot of scripts for curating data. We know that we have gaps of understanding around the src and resource folders. In this prototype, we could not rollback a build with GitHub practices. Once we overcome that hurdle, we have to move forecasting models into this single repo. Is it common practice for one repo to have curating data and models for 2-3 lines of business or is it normal to be conflicted of what should be in a single repo?
What to Attend at Data + AI Summit?
It's my first time going and I've been using Databricks for years. I'm a Data Engineer but looking to possibly transition to ML Engineer in the future. 1. When should I arrive? Is the training/cert worth it on Monday? If it is, I guess I'd arrive early. 2. The responses in [this post](https://www.reddit.com/r/databricks/comments/1rfpvvs/data_ai_summit_worth_it/) say the expo is the best part. I see that's on Tuesday/Wednesday. I'm guessing Thursday is pretty quiet? 3. There are so many breakout sessions. I definitely want to learn something and not attend some business focused, hand-wavy ones. How do I choose?
DABS: do like a pro: all the best tips & tricks
Proud to announce that I will be speaking at the databricks Data+AI Summit. "DABS: do like a pro: all the best tips & tricks". If you want to join me at DAIS, DM me for a discount code. #databricks
Web scraping -> entity resolution -> normalized model -> API serving layer pipeline
Data is fetched from a variety of sources: XML files from FTP server, public JSON API, web scraping HTML pages, downloading PDF pages that need OCR, ... These sources contain data about private companies and their shareholders. Entities need to be resolved: link two address observations if they are the same, link two people observations if they are the same, ... This needs to be brought togheter into one combined model. This is followed by a very fast serving layer to power my own API that will be directly consumed by users, app and mcp server. There is an initial load of about 10 million company and people rows, as well as 50 million PDF pages that need OCR. Every day about 10k elements are added. Currently I'm doing this in PostgreSQL hosted on Railway, with DuckDB to perform the entity resolution. I have 260 GB of data in total. I have a cron job for each source. These are the schemas: raw (separate schema for each source), xref (entity resolution), core (normalized) and mart (serving layer). I have 1 mono repo with all of the code, most of it is Typescript with Bun. My problem is that it has become hard to manage. Things feel a bit duck taped as I have little observability. I don't have a clear overview of the data pipeline. Additionally, doing intial loads can take many hours. I was thinking Databricks could be a unified data platform from which I can manage this. One thing I'm not sure about is how to manage the scraping as I don't think Databricks is really built for this. Anyone that had to work on a similar problem. How would you solve this?
Genie Code System Prompt
Any figure out Genie Code system prompt?
Sub-Second Latency in Spark: Real-Time Mode is Generally Available On Databricks
Databricks Lakeflow Designer — Design Visual data preps
Coinbase Scales Real-Time Security
By leveraging Real-Time Mode in Spark Structured Streaming, we’ve achieved an 80%+ reduction in end-to-end latencies, hitting sub-100ms P99s, and streamlining our real-time ML strategy at massive scale. This performance allows us to compute over 250 ML features all powered by a unified Spark engine.
Testing lightweight app performance on Linux clusters (any simple workflow or tools?)
An easier way to build your slow changing dimensions model in your warehouse
Try it out in your Query Editor or on your Databricks SQL warehouse today!
Uploading file to volume and start ingestion job
Conceptual Modeling Is the Context Engineering Nobody Is Doing
Terraform errors with "provider_config is not a supported block" for databricks_rfa_access_request_destinations
I am trying to create a databricks_rfa_access_request_destinations resource in Terraform, but I'm encountering an error: provider_config is not a supported block. My goal is to dynamically assign a workspace for the resource based on a loop. I'm using the provider_config block to specify the workspace_id, but it seems this is not the correct approach. Here is the code I am using: resource "databricks_rfa_access_request_destinations" "catalog_manager" { for_each = var.catalogs destinations = \[for item in var.request_access_destinations: {destination_id = item.id, destination_type = item.type}\] securable = { type = "CATALOG" full_name = each.value.name } provider = databricks.accounts provider_config { workspace_id = data.databricks_mws_workspaces.all.ids\[var.catalogs\[each.value.name\].workspace_name\] } depends_on = \[databricks_catalog.this\] } Based on the official Terraform documentation for the databricks_rfa_access_request_destinations resource, it appears that provider_config is indeed a valid argument. However, Terraform is not recognizing it in my implementation. Version's informations: terraform: Initializing the backend... terraform: Initializing provider plugins... terraform: - Reusing previous version of databricks/databricks from the dependency lock file terraform: - Reusing previous version of hashicorp/google from the dependency lock file terraform: - Reusing previous version of hashicorp/google-beta from the dependency lock file terraform: - Using previously-installed databricks/databricks v1.113.0 terraform: - Using previously-installed hashicorp/google v7.29.0 terraform: - Using previously-installed hashicorp/google-beta v7.29.0 terraform: Terraform has been successfully initialized! Could someone please help me understand why I am getting this error and what is the correct way to dynamically assign a workspace to the databricks_rfa_access_request_destinations resource?
The Lakebase Hub: Official Community Space for Lakebase Insights
Lakeflow Connect | Confluence (GA)
Hi all, Lakeflow Connect's Confluence connector is now GA! The Lakeflow Connect Confluence connector provides a managed, secure, and native ingestion solution for Atlassian Confluence data — ingesting pages, spaces, blogposts, attachment metadata, and more into Delta tables. Try it now: 1. [**Set up Confluence as a data source**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/confluence-source-setup) 2. [**Create a Confluence Connection in Catalog Explorer**](https://docs.databricks.com/aws/en/connect/managed-ingestion#confluence) 3. [**Create the ingestion pipeline via the UI, a Databricks notebook, or the Databricks CLI**](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/confluence-pipeline)
Announcing the Public Preview of Lakeflow Designer
Data Quality on Databricks design
Hey, i am deciding between DQX and Deequ for data quality on Databricks, or even to use them both, i think Deequ is amazing because of AnomalyCheck which let us compare batch to batch and make the data flow consistently over time which is very under appreciated, while DQX is amazing at the row level detection. How did u design your data quality on Databricks? I was thinking using DQX for in-transit Data Quality checks for hard fails, while Deequ for AnomalyCheck for observartion/dashboards/notifications.
Databricks Serverless environment resets after installing custom JFrog packages
I'm trying to install custom packages from our JFrog artifactory repo in DBX Serverless and it is somehow not an easy task. I was suggested to install our python package via a script in each task run, but that seems to be kinda weird to me. My goal is to reduce code complexity with a shared package, not duplicate the code on each task run. I was suggested to run it always as first cell in a Notebook, but I mostly use python files. I tried to run it as initial task, but the environment gets reset afterward and my package and the env settings are gone then. How can I do this without resetting my environment?
From 150 Lines of MERGE INTO to 7 Lines of SQL: AUTO CDC Comes to Databricks SQL
What the maximum size to read using dbutils.fs.head
Delta Lake Secrets: What Happens After You Run Write, Update or Merge
I wrote a practical deep dive on Delta Lake that explains what actually happens behind the scenes—not just the basic theory. Most tutorials stop at “Delta supports ACID and Time Travel,” but I wanted to understand *how* it really works. In this blog, I covered: • `_delta_log` and transaction logs • Why Delta never deletes old files immediately • Checkpoints and snapshot mechanism • Data skipping and how Z-Ordering improves performance • History, Restore, and Time Travel • Merge, Update, Delete operations • Convert Parquet to Delta • Optimize and the small file problem • Real PySpark examples for every concept I tried to explain everything in a simple, practical way with real examples instead of documentation-style theory. [https://medium.com/@wnccpdfvz/why-delta-lake-is-faster-than-traditional-data-lakes-5c865f67b66b](https://medium.com/@wnccpdfvz/why-delta-lake-is-faster-than-traditional-data-lakes-5c865f67b66b)
Learning Series | DevOps Essentials for Data Engineering
Lakebase login via REST for a service principal
How Long have you been using Genie Code ? Any thoughts ?
Genie Code in Action: [https://www.youtube.com/watch?v=heouBA5U1bE&list=PLTPXxbhUt-YXHs9ooubi7td9jsPC\_DuC7](https://www.youtube.com/watch?v=heouBA5U1bE&list=PLTPXxbhUt-YXHs9ooubi7td9jsPC_DuC7)
Why Unity AI Gateway ?
# Fine-Grained Permissions and Guardrails Fine-grained permissions and guardrails prevent what shouldn't happen in the first place. **Granular access control for tools** When agents call MCP servers to access internal systems, Unity AI Gateway supports [on-behalf-of user](https://docs.databricks.com/aws/en/generative-ai/mcp/external-mcp) execution. The MCP executes with the requesting user's exact permissions, not a shared service account. If a user can't access a Salesforce record, neither can the agent—even with elevated privileges. **Flexible guardrails powered by LLM judges (Beta)** Unity AI Gateway's guardrails use a prompt + model approach—configure them to run on requests, responses, or both: * **PII Detection & Redaction:** Detects and masks emails, SSNs, phone numbers before they reach external models * **Content Safety:** Block toxic, harmful, or inappropriate content with customizable filters * **Prompt Injection Detection:** Catch jailbreak attempts trying to override system instructions * **Data Exfiltration Prevention**: Prevent exposure of training data or proprietary content * **Hallucination Guard:** Validate responses against grounding sources * **Custom Guardrails:** Define your own with a custom prompt and model Each guardrail is backed by an editable prompt and configurable model—not rigid pre-built logic. When violated, Unity AI Gateway can reject the request or mask sensitive data. All actions get logged for audit. This capability is currently rolling out and will be available in all supported regions within the next week. # End-to-End Observability Three teams need answers when AI agents hit production: FinOps wants to know what's costing money, engineering needs to debug failures, security needs audit trails. Unity AI Gateway gives each team what they need from the same unified logging infrastructure.
Supporting File unrecognition in DLT Pipeline.
What are you using for Spark agents with Databricks at scale?
Mid-sized org, around 300 people. Running multiple Databricks workspaces across AWS and Azure, hundreds of Spark jobs daily. Debugging slow jobs, skew, small files, memory spills, shuffle issues takes too much manual time. Spark UI and Databricks monitoring cover the basics but at this scale someone always has to notice something is wrong before investigation starts. That lag is becoming a real problem. Started looking into Spark agents to automate detection and surface bottlenecks without someone manually going through every stage. The volume justifies it but not sure what's production-ready vs still experimental. What are people using for Spark agents at this scale? Does it hold up across multiple workspaces or does it fall apart once jobs get into the hundreds daily?
Ingest codebeamer data in databricks
Hi Everyone, Codebeamer is an Application Life Cycle Management tool . I have been assigned to get all the data out of this application to Databricks i.e ingesting all data in delta tables creating data model fit for analytics purpose. I have given only codeBeamer APIs to work with. Anyone has worked on similar projects, let me know your insights and suggestions and best practices. Thanks in advance.
Managed Delta table: time travel blocked after automatic VACUUM
Unable to make fresh deployments to an agent model serving endpoint due to permission issues
Suggest must review or do databricks end-to-end solutions aiming for data Architect role
Hi all, Can you suggest any GitHub projects or YouTube videos that cover end-end databricks projects which are complex and interesting to gain experience and exposure same as prod environment?
How Genie Code is Transforming Data Workflows in Databricks
DABs direct deployment engine plan-file tool
Hello, I originally built this for my team's local DevEx and CI/CD stacks, but figured others might get some use out of it too. With the direct deployment engine for Declarative Automation Bundles heading towards GA, i think it is a good time to share it. [dagshund](https://pypi.org/project/dagshund/), a visualizer for DAB plan files. Zero runtime dependencies. Pipe in a plan.json like this: databricks bundle plan -o json | uvx dagshund and depending on what you want you may get: * a colored diff on stdout for quick sanity checks * a markdown summary you can drop into a PR comment * a self-contained interactive HTML report with a resource graph and per-job task DAGs (React Flow + ELK layout, no external assets, just open the file). Live examples [here](https://dagshund-806de7.gitlab.io/). The CI piece came out of always feeling a bit blind when deploying large bundles without a terraform-style plan. It also exposes detailed exit codes so you can gate pipelines on drift or dangerous actions (resource replaces & destroys). The HTML report is for anyone who'd rather see a graph than scroll through a wall of diff text, or who wants more detail on job DAGs than a terminal can fit. It also surfaces lateral dependencies and hierarchy relationships that the plan file mentions but state doesn't track directly, so it's generally just more detailed. We also use them as built artifacts in our PRs. Docs and full feature list on [GitLab](https://gitlab.com/chinchy/dagshund), or on the [GitHub](https://github.com/chinchyisbored/dagshund) mirror. Fair disclaimer: since the direct deployment engine isn't GA yet, expect dagshund to follow along if the plan format shifts before release, sometimes this may take a day or two. Cheers! Edit: I opened up the Issues on github if there is need for interaction and you dont want to use gitlab for that.
Databricks Hosted Foundation Models usage and costs
Can change a delta table, but not its schema
From HMS to Unity Catalog: A Self-Service Migration Playbook
Ingest data from REST endpoint into Databricks
The Next Era of the Open Lakehouse: Apache Iceberg™ v3 in Public Preview on Databricks
This is a great news for Apache Iceberg users on Databricks. V3 is bringing some interesting features from Delta to Apache Iceberg.
Databricks 5 Minute Features: Lakeflow Designer
Lots of noise going around on the Public Preview of Lakeflow Designer yesterday. It is a low code/no code experience for building full scale data pipelines. UI/AI based - but with full code generation behind the scenes. Want to see for yourself what it is and what it can do for you? \- I got you covered with the latest 5 Minute Features!
Lakeflow Designer for technical and business users
Lakeflow Designer is finally here! A while back, I was able to see the early prototypes, so happy to finally see it become a full product that is available to Databricks customers, and frankly myself. As a former Alteryx user, it is great to have a Databricks native option for building workflows in a way that both developers and less technical business users can easily understand and audit the logic at a granular level. Good integration of both UI & Genie Code. Very good tool for finance in particular for working with data in a governed environment & at scale.
"Get Started with Lakebase" Course
Data in Unity Catalog that can't be previewed
CUSTOMER STORY | Toyota uses Zerobus Ingest for real-time factory data
Genie Code severely regressed over the past 2 days — no longer behaves as before
Lakeflow Designer is now in Public Preview
Hey all, I’m a PM working on Lakeflow Designer at Databricks. Designer is our new no-code, drag-and-drop product for data prep and analytics directly in Databricks. We entered Public Preview today and I’d love to hear any thoughts/feedback. [](https://www.databricks.com/blog/announcing-public-preview-lakeflow-designer) https://i.redd.it/qdg6ywdr30xg1.gif The goal with Designer is to lower the technical barrier to entry for data work. A lot of people need to transform and work with data, but the jump straight into SQL or Python can still be pretty steep. Agents can help and often one-shot things, but then the challenge becomes validating correctness and understanding how the result was produced. We are trying to make that easier with a natural language + drag-and-drop experience where each transformation is broken into discrete visual operators. Another principle we are trying to stick to is avoiding the usual low-code trap where something is easy to build or prototype but hard to productionize. In Designer, every transformation generates production-ready Python under the hood, so there is a more direct path into production workflows. Still early, and we know there is a lot more to improve, but I’d love to hear feedback, especially from anyone who has used other low-code tools. To get started, you can go to the global New menu and click “Visual data prep.” It should be available in your workspace, though an admin may need to enable ‘Lakeflow Designer’ in the Preview portal. Here's the release blog with more details: [https://www.databricks.com/blog/announcing-public-preview-lakeflow-designer](https://www.databricks.com/blog/announcing-public-preview-lakeflow-designer)
Inquiry Regarding Discount Vouchers for Databricks Data Engineer Associate Exam
How to query batch job runs + number of rows inserted to bronze (+ updated, deleted for silver)?
We're using Databricks Autoloader (in batch mode, not streaming mode) for data ingestion of Parquet files from Azure Datalake to bronze tables, and I wonder if we need to set up a custom table to keep track of what job run had which impact on bronze table, or can I get this out of system tables somehow. Same for data loading from bronze to silver btw. Perhaps someone here has a sample query snippet?
Take Control: Customer-Managed Keys for Lakebase Postgres
Urgent: Need to Switch Exam Format from Onsite to Online Proctored Within 48 Hours
Databricks SQL Alerts V2: Can’t access QUERY_RESULT_VALUE without aggregation?
Hi, I’ve been testing the new SQL Alerts V2 (preview) in Databricks and ran into something I can’t quite explain. In V1 (legacy alerts), I’m able to use Mustache template variables like `{{QUERY_RESULT_VALUE}}` in a custom notification template without any issues. However, in V2, I can’t seem to access the original query result in the template — even though I’m not explicitly using an alert-level aggregation. From the documentation, I understand that when you *do* use an aggregation in an alert, Databricks rewrites the query by wrapping it in a CTE. For example: If the original query is: SELECT 1 AS column_name And you apply a SUM aggregation in the alert, it becomes: WITH q AS (SELECT 1 AS column_name) SELECT SUM(column_name) FROM q; Because of this transformation, the original (pre-aggregated) query result is no longer available to template variables like `QUERY_RESULT_ROWS` or `QUERY_RESULT_VALUE`. That makes sense. However, in my case I’m not using an alert aggregation — only a query-level aggregation: SELECT SUM(fare_amount) AS amount FROM samples.nyctaxi.trips Alert condition: * Trigger when **First row → amount > 4000** When I enable a custom template, I still get this warning: >"The original query result (pre-aggregated) will not be shown in an alert custom body when there is an aggregation on an alert." What’s confusing: * This exact setup works in V1 * In V2, `{{QUERY_RESULT_VALUE}}` doesn’t behave as expected * The warning about aggregation appears even though I’m only using “First row” and not applying aggregation at the alert level So my questions: * Is this a known limitation or behavior in SQL Alerts V2? * Does V2 treat query-level aggregations the same as alert-level aggregations? * Is there any workaround to access the query result in custom templates? I couldn’t find anything clearly documented about this. Thanks!
The real gap isn't connecting Claude to Databricks, it's the 3,000 tokens it costs every time you do
Posted a days ago asking if manually copying Databricks schemas into Claude was a real pain point. Thread here: [Old post](https://www.reddit.com/r/databricks/comments/1srypxz/im_building_an_opensource_tool_that_gives_claude/) The community was right to push back. ai-dev-kit and the managed MCP already solve the connection problem. I was building something redundant. But digging into both tools after those comments, I found something nobody mentioned: **Every existing tool dumps raw JSON back to Claude.** This is what ai-dev-kit returns for a single table schema: json { "table_name": "orders", "columns": [ {"name": "order_id", "type": "LongType", "nullable": false, "metadata": {}, "comment": null}, {"name": "customer_id", "type": "LongType", "nullable": true, "metadata": {}, "comment": null}, {"name": "order_date", "type": "DateType", "nullable": true, "metadata": {}, "comment": null}, {"name": "amount", "type": "DoubleType", "nullable": true, "metadata": {}, "comment": null} ], "partition_columns": ["order_date"], "storage_location": "dbfs:/user/hive/warehouse/...", "table_type": "DELTA" } \~800 tokens. For one table. Two tables + sample rows in a real session = **3,000+ tokens just for context**, before Claude writes a single line of code. If you're iterating — write, fix, optimize, test — that cost repeats every message. This is what the same schema looks like after compression: orders: order_id!bigint customer_id bigint order_date*date amount dbl status str **15 tokens. Same information Claude needs to write correct PySpark.** `!` = primary key. `*` = partition key. Types shortened. Storage paths, nullability metadata, comments — all stripped. Claude never uses any of that for code generation anyway. **What I'm thinking of building:** A thin middleware layer. Not a new MCP server — just a compressor that sits on top of whatever you already use (ai-dev-kit, managed MCP, anything). Intercepts the raw schema response, strips the noise, returns the compressed format. No new auth. No YAML config. No PAT tokens. You keep your existing setup. This just makes each tool call 84% cheaper in tokens. **One honest question before I build it:** Does token bloat from schema fetches actually affect you day to day? Or are you on an API/enterprise plan where token cost isn't something you think about? If most people here are on enterprise plans where this doesn't register, I should know that now rather than after building it.
Stop Refreshing. Start Querying.
Are LLM agents good at join order optimization?
Everyone’s excited about LLM agents replacing “traditional systems” but what if we pointed them at one of the hardest classical problems in data engineering - SQL join order optimization.? A new blog from Databricks explores exactly that - and the results are both surprising and humbling. Traditionally, query optimizers rely on decades of research, heuristics, and cost models to decide the best join order (because getting it wrong can absolutely destroy performance). So naturally, the question is - **can LLM agents actually do better**? The answer: sometimes yes, but not in the way you might expect. The research shows that LLM agents can: \- Explore join strategies beyond fixed heuristics \- Adapt to specific queries and datasets \- Even outperform built-in optimizers in certain scenarios (\~1.3× improvements reported) LLMs are great at: \- reasoning over complex search spaces \- generating candidate plans \- adapting dynamically While traditional optimizers are still unmatched in: \- consistency \- guarantees \- efficiency at scale So, the future isn’t “LLMs vs systems” - it’s hybrid systems. [Are LLM agents good at join order optimization? | Databricks Blog](https://www.databricks.com/blog/are-llm-agents-good-join-order-optimization)
Bug with development mode
Hi, so basically i like the \`mode: development\` with Databricks Asset Bundle because i can specify cluster\_id, it removes schedulers, it allows concurrent runs, just faster development loop iteration. But the problem is that, i develop code from my local IDE, i change something in ML pipeline, i deploy the wheel, which holds my data and dependencies, and it installs the wheel globally on all purpose cluster. Which is a huge problem, because if some error occurs, i need to change again, redeploy, run on all purpose cluster and verify its ok, BUT since wheel was already installed globally on the first deployment, it will be ignored on the second deployment, and the job uses old code. Thats very bad.
🌟 Community Pulse: Your Weekly Roundup! April 13 – 19, 2026
Take Control: Customer-Managed Keys for Lakebase Postgres
Unable to View Tables While Setting Up PostgreSQL CDC via Lakeflow Connect
Best practices for Apache Spark optimization in 2026?
Number of Spark pipelines has been growing across teams, all running on shared infrastructure What stands out is how optimization starts to drift. Some jobs are well tuned, others aren’t, even when they follow similar patterns. Shows up in runtime and resource usage There are guidelines around partitioning, caching and configs, but each team applies them differently. Ownership plays a role, so does experience As more pipelines are added, keeping Apache Spark optimization consistent without centralizing everything gets harder How are teams keeping optimization aligned across pipelines without slowing teams down
LakeFlow Designer: No Code ETL + AI😯(This feature is in Public Preview).
Access flows between Genie and Copilot Studio.
How to Connect Genie to a Copilot Agent in Copilot Studio: A Complete Guide
Platform to practice databricks associate data engineer exam
Too Many Tools Can Slow Good Data Teams Down
Data Loss in Incremental Batch Jobs Due to Latency in delta file write to blob
How Deep clone works
Handling New Columns Using Auto Loader Rescue Mode but how will get newly added column
Best practices for using autoloader
Passing Score for Databricks associate data engineer
Lakebase: Your Only Guide - What Databricks Users Need to Know
Jobs & Pipelines: is it possible for "Run parameters" to display a value generated in code?
Introducing Genie Agent Mode
Cannot enable GPU for serving endpoint
Solution Accelerator Series | Propensity Scoring
PVS points
Does Lakeflow Connect Have Any Change Tracking Diagnostics?
Table update trigger and File Arrival trigger latency
Steps to become a Databricks Consultant.
How to update alias for catalogs
Implementing a naming convention
Data Engineer Associate Exam Guide May 2025 - Recommended Training
AI playground - Unable to access LLM's
Week of Apr 13
2 questionsDatabricks CLI token creation fails with “cannot configure default credentials” after previously working in CI pipeline
I have been generating a Databricks personal access token in my YAML-based CI pipeline using a bash script. The pipeline installs the Databricks CLI and then creates a token using a Service Principal (Azure AD application) credentials. Current working approach (previously working) #!/bin/bash dbx_host="${1}" dbx_client_id="${2}" dbx_client_secret="${3}" # Set the Environment Variables for Databricks authentication export DATABRICKS_HOST=$dbx_host export DATABRICKS_CLIENT_ID=$dbx_client_id export DATABRICKS_CLIENT_SECRET=$dbx_client_secret echo "Creating a new Databricks token" response=$(databricks tokens create \ --lifetime-seconds 31536000 \ --comment "Token for SPN for EDH Data Access. Validity 1 year.") echo "Token Created Successfully" token=$(echo $response | jq -r '.token_value') token_id=$(echo $response | jq -r '.token_info.token_id') expiry_time=$(echo $response | jq -r '.token_info.expiry_time') This used to work fine for generating tokens. Issue Recently, the same pipeline started failing with the following error: Error: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. Config: host=https://***, account_id=***, workspace_id=***, profile=DEFAULT, azure_tenant_id=***, client_id=***, client_secret=*** Env: DATABRICKS_HOST, DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET The documentation link provided in the error message does not really help in identifying what exactly needs to be changed or how to fix this specific CI/CD use case. Has there been a recent change in Databricks CLI authentication (especially unified authentication) that breaks Service Principal authentication using DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET environment variables? Any guidance or migration steps would be appreciated. UPDATE:Added Tenant ID […truncated]
Connecting to Databricks API (hosted LLM in model-serving) via PAT
I'm running code in my local IDE that connects to Databricks's API to pas text into an LLM that is hosted on Databricks. I'm using Personal Access Tokens to get up and running quickly. I'm able to get it working when I add "all scopes" to the PAT, but that is WAAY too much access, and I want to give it just the right access for the task it needs. BUT, I can't figure out which scopes are actually needed. Additional Context: The app is written in Python. It retrieves text from the open internet and then formats it. I want to use an LLM to summarize the content. I would prefer to use an LLM that is hosted on Databricks, via the Databricks API (as opposed to, for instance, using the OpenAI API or Claude API) because I want to use multiple APIs and evaluate them, which Databricks allows. Some of the things I have tried. Databricks lists what the different scopes are and what they do here .: mlflow and model-serving clusters, commang-execution, custom-llms, dashboards, dataclassification, dataquality, environments, files, forecasting, genie, global-init-scripts, instance-pools, instance-profiles, jobs, knowledge-assistants, libraries, mlflow, model-serving, notifications, pipelines apps, clusters, custom-llms, dataclassification, files, genie, global-init-scripts, jobs, knowledge-assistants, marketplace, mlflow, model-serving, secrets, sql, unity-catalog, workspace Initially, I used the OpenAI SDK (as recommended by an LLM), and then switched to the Databricks SDK (because I had hoe that would resolve my issues). Currently, my requirements.txt is: python-dotenv>=1.0.0 openai>=1.0.0 databricks-sdk>=0.49.0
Week of Mar 30
1 questionWeek of Mar 23
2 questionsDatabricks Truncates Rows Greater Than 64k in SQL
SQL query should give a result of 100,000 rows but the output is truncated and only displays 64,000 rows. How do I correct this to show 100,000 rows.
How to fix a cast invalid input error in Spark?
I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM TABLE_1 A INNER JOIN TABLE_2 B ON A.PRODUCT_ID = B.PRODUCT_ID WHERE A.ORDER_DATE BETWEEN DATE_SUB(CURRENT_DATE, 182) AND CURRENT_DATE AND A.CUSTOMER_ID IS NOT NULL AND B.ITEM_STATUS IN ('Live', 'Active') AND B.ITEM_CODE IN (2009, 2012) GROUP BY A.CUSTOMER_ID ''') df.display() It returns the following error when run as above: [ CAST_INVALID_INPUT ] The value 'UNKNOWN' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. I understand that this means spark is performing an implicit cast on a string column, and it's failing because that column contains the value 'UNKNOWN', which can't be cast to bigint format. So I've tried removing the value 'UNKNOWN' from every string column my query uses: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM ( SELECT * FROM TABLE_1 WHERE UPPER(PRODUCT_ID) <> 'UNKNOWN' AND UPPER(ORDER_DATE) <> 'UNKNOWN' AND UPPER(CUSTOMER_ID) <> 'UNKNOWN' ) A INNER JOIN ( SELECT * FROM TABLE_2 WHERE UPPER(PRODUCT_ID) <> 'UNKNOWN' AND UPPER(ITEM_STATUS) <> 'UNKNOWN' AND UPPER(ITEM_CODE) <> 'UNKNOWN' ) B ON A.PRODUCT_ID = B.PRODUCT_ID WHERE A.ORDER_DATE BETWEEN DATE_SUB(CURRENT_DATE, 182) AND CURRENT_DATE AND A.CUSTOMER_ID IS NOT NULL AND B.ITEM_STATUS IN ('Live', 'Active') AND B.ITEM_CODE IN (2009, 2012) GROUP BY A.CUSTOMER_ID ''') df.display() However, I still get the same error message. I don't understand how this is possible since the unknown values have been removed. Can anyone help explain what's happening and how to resolve the error please?
Week of Mar 16
5 questionsGrafana OSS with Databricks
I am new to grafana and would like to understand whether there are any limitations while using grafana oss (free tier) with databricks? Appreciate your inputs here. :-) Thanks!
Xtract Universal and Azure Databricks
I'm new here. A pleasure to meet you. So, my question is if anybody has advice o had to develop engineering with the tools Xtract Universal and Databricks/Azure Databricks because I'm in a project where the customer is asking a direct connection between the replicator tool and the destination (without services in the middle), and in the documention for both services there's no info available suppporting this type of work flow. The customer is asking for it because they have been told by the consultors from Theobald that it connects directly between both, but they didn't provided any type of knowledge base or something to function as a guide for this way. I asked the AI but they didn't give any reliable info to work on. Any advise is aprecciated. Thank you in advance.
Databricks table quota exceeded
I use ADF to orchestrate data extraction and load to managed databricks tables. It extracts to file storage and that part works fine. It's the next part, where it registers the external table by pointing it at the storage file, where it fails. This morning I have started seeing quota exceeded issues. So, after extract, it just creates a table using: drop table if exists ${catalogName}.landing.$TableName; USING PARQUET OPTIONS ( path 'abfss://unity-catalog@$DatalakeName.dfs.core.windows.net/landing/$TableName/$TableName.parquet' ); The final step (does not get that far) is to copy this into a managed table The error it returns is: UnityCatalogServiceException: ErrorClass=QUOTA_EXCEEDED.UC_RESOURCE_QUOTA_EXCEEDED] Cannot create 1 Table(s) in Metastore (estimated count: 1000728, limit: 1000000) It makes no sense though as we only have around 2000 tables So, a couple of questions What would cause this? Have I misinterpreted the definition of databricks table counts and it's not 1:1? Are temp and deleted (historic) tables also included - so maybe a lot of history that needs tidying here? Thanks
How can I speed up Pyspark unit tests?
I have code that I execute with Spark in a Databricks cluster and wrote unit tests to cover all logic including the PySpark code. I searched here , here , and official documentation for testing with PySpark . I made a fixture with an as slim as possible Spark session based on these articles: @pytest.fixture(scope="session") def spark(): """Minimal session-scoped SparkSession.""" spark = ( SparkSession.builder.master("local[*]") .appName("pytest-pyspark") .config("spark.sql.shuffle.partitions", "1") .config("spark.default.parallelism", "1") .config("spark.driver.extraJavaOptions", "-Djava.net.preferIPv4Stack=true") .config("spark.ui.enabled", "false") .config("spark.ui.showConsoleProgress", "false") .config("spark.sql.ui.retainedExecutions", "0") .config("spark.sql.catalogImplementation", "in-memory") .config("spark.driver.host", "127.0.0.1") .config("spark.driver.bindAddress", "127.0.0.1") .config("spark.sql.adaptive.enabled", "false") .config("spark.sql.execution.arrow.pyspark.enabled", "true") .getOrCreate() ) yield spark spark.stop() But every test that uses the Spark session takes 3-5min! The test data is a two rows 5 column dataframe so that is not the issue. How do I run unit tests with small datasets faster? It is slow both locally and in the CI/CD pipeline.
Databricks run SQL query in python without spark
I am trying to run a big Databricks query with a lot of CTEs, etc. but I do not really want to run it in spark. Some parts of the query that work on the normal SQL warehouse do not work on spark. I am generating parts of the query dynamically with python. Previously I was running it in the SQL warehouse and the query was really fast, but due to some requirements I had to run the script generation in python, since I needed to add parts manually. Now I have to run it in spark, but spark is not suited for this type of query and I was wondering if there is a smart way to do this? E.g. our cluster does not allow self joins for some tables, etc. The GPTs recommended me using the databricks-sql-connector, but that also seems weird for me. What is generally the best practice here?
Week of Mar 9
1 questionWeek of Mar 2
1 questionWeek of Feb 23
3 questionsRetrieve job metadata like job run id and name in a databricks job run
We are using databricks to execute our code. I am trying to make logs that are stored in a table. Amongst other things I also want the job run id and the job/task name so I can go back and check the job based on the logs and vice versa. Does databricks offer this info inside a job run? I found an example where this info is retrieved inside a notebook but I can find anything for databricks. Example for notebooks: # Databricks notebook source from pyspark.sql.types import ( IntegerType, StructField, StructType, TimestampType, StringType ) from pyspark.dbutils import DBUtils import json from delta import DeltaTable import datetime as dt from datetime import datetime dbutils.widgets.dropdown("env", "dev", ["dev", "prod"]) env = dbutils.widgets.get("env") print('env: ',env) def save_jobs_log(log_data, job_log_dir): job_schema = StructType( [ StructField("job_log_id", StringType()), StructField("run_id", StringType()), StructField("job_name", StringType()), StructField("notebookId", StringType()), StructField("user", StringType()), StructField("clusterId", StringType()), StructField("jobParametersCount", StringType()), StructField("startTimestamp", StringType()), StructField("taskKey", StringType()), StructField("operation", StringType()), StructField("target_table", StringType()), StructField("updated_rows", IntegerType()), StructField("processed_ts", TimestampType()), ] ) if not DeltaTable.isDeltaTable(spark, job_log_dir): df = spark.createDataFrame([], schema=job_schema) df.write.format("delta").option("overwriteSchema", "True").mode("append").save(job_log_dir) df = spark.createDataFrame(log_data, schema=job_schema) df.write.format("delta").mode("append").save(job_log_dir) def store_job_logs(df, operation, target_table, job_log_dir): […truncated]
How to transform DESC into a table with JSON column?
I am using the command DESC EXTENDED with success, and casting it to JSON, for example: DESC EXTENDED mycad.mysch.mytab1 AS JSON; But I need this information for data-govenance of many tables, in particular the Statistics paramer and the table name. So something as CREATE VIEW mycad.mygovsch.report1 AS (DESC EXTENDED mycad.mysch.mytab1 AS JSON) AS descext UNION ALL (DESC EXTENDED mycad.mysch.mytab2 AS JSON) UNION ALL (DESC EXTENDED mycad.mysch.mytab3 AS JSON) ; ... but, of course, this is not the syntax : how to express this kind of query in Databicks? NOTES I try many variations, VALUES ('myDesc', (DESCRIBE EXTENDED mycad.mysch.mytab1 AS JSON)); SELECT (DESCRIBE EXTENDED mycad.mysch.mytab1 AS JSON) as myDesc; ...
Configure simba odbc driver DSN set up via system DSN
Can anyone help me in configuring simba odbc driver set up via system DSN ? I’ve to integrate Tosca with databricks and for this I’ve to setup odbc connection and this is where I’m stuck and my scripting is blocked due to the same
Week of Feb 16
1 questionWeek of Feb 9
1 questionWeek of Feb 2
2 questionsETL Migration to Databricks via LLM Transpilation
I have several ETL jobs from DataStage in .dsx format. I use a PowerShell script to automatically run the migration for a larger number of files. And one job migrates successfully, while another similar one no longer migrates. My code: $jobsPath = "C:\lakebridge_test\jobs" Get-ChildItem $jobsPath -Filter *.dsx | ForEach-Object { $inputFile = $_.FullName $outputPath = "/Workspace/Users/xxx/lakebridge_out/$($_.BaseName)" $psi = New-Object System.Diagnostics.ProcessStartInfo $psi.FileName = "C:\Users\xxx\Desktop\xxx\lakebridge\databricks_cli_0.258.0_windows_amd64\databricks.exe" $psi.Arguments = @( "labs lakebridge llm-transpile", "--input-source `"$inputFile`"", "--output-ws-folder `"$outputPath`"", "--volume lakebridge_vol", "--catalog-name uc-test", "--schema-name lakebridge_test", "--source-dialect unknown_etl", "--accept-terms=true", "--profile lakebridge" ) -join " " $psi.RedirectStandardInput = $true $psi.RedirectStandardOutput = $true $psi.RedirectStandardError = $true $psi.UseShellExecute = $false $psi.CreateNoWindow = $true $process = [System.Diagnostics.Process]::Start($psi) $process.StandardInput.WriteLine("0") $process.StandardInput.Close() $stdout = $process.StandardOutput.ReadToEnd() $stderr = $process.StandardError.ReadToEnd() $process.WaitForExit() Write-Host "==== $($_.Name) ====" Write-Host $stdout if ($process.ExitCode -ne 0) { Write-Error "FAIL: $($_.Name)" Write-Error $stderr } } It automatically selects: Select a Foundation Model serving endpoint: [0] [Recommended] databricks-claude-sonnet-4-5 and starts the migration process to Databricks. A job consisting of a dataset and a Db2 connector migrates correctly, but a job with dataset → Transformer Stage → Db2 connector fails and returns: Exception: No records found for conversion. Please check if there are any records wit […truncated]
CONVERT TO DELTA fails to merge file schema
This is in Azure Databricks. I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this: CONVERT TO DELTA parquet.`abfss://container@storage_account.dfs.core.windows.net/directory_name`; But it throws this error: SparkException: [DELTA_FAILED_MERGE_SCHEMA_FILE] Failed to merge schema of file abfss://container@storage_account.dfs.core.windows.net/directory_name/file_name_123.parquet: [...] I ran this in an all-purpose cluster with the spark.databricks.delta.mergeSchema.enabled config set to true.
Week of Jan 26
1 questionWeek of Jan 19
2 questionsIs there a way to bypass hashes in the Databricks CLI when installing packages?
I’m trying to install Databricks Lakebridge to migrate ETL to Databricks. I followed the steps according to the documentation and got stuck at the transpile installation. As prerequisites, Python version 3.10 or higher, Java ≥ 11, and the Databricks CLI are required. In the next step, after the initial profile configuration (with the connection details defined to connect to Databricks using a token), I installed Lakebridge. There was also an issue with hashes there, but clearing the cache was sufficient. Commands below: C:\Users\xxx>py -3.10 -m pip cache purge Files removed: 276 (50.2 MB) C:\Users\xxx>pip cache purge WARNING: No matching packages Files removed: 0 C:\Users\xxx>set PIP_NO_BUILD_ISOLATION=1 C:\Users\xxx>cd Desktop\xxx\lakebridge\databricks_cli_0.258.0_windows_amd64 C:\Users\xxx\Desktop\xxx\lakebridge\databricks_cli_0.258.0_windows_amd64>databricks labs install lakebridge --profile lakebridge 13:47:15 INFO [src/databricks/labs/lakebridge] Successfully Setup Lakebridge Components Locally 13:47:15 INFO [src/databricks/labs/lakebridge] For more information, please visit https://databrickslabs.github.io/lakebridge/ In the next step, according to the documentation, it is required to install Transpile because I need to migrate DataStage jobs to Databricks, but in this case clearing the cache didn’t help anymore and I’m getting what’s shown below in the code: C:\Users\xxx\Desktop\xxx\lakebridge\databricks_cli_0.258.0_windows_amd64>databricks labs lakebridge install-transpile Looking in links: c:\Users\xxx\AppData\Local\Temp\tmppwi__0cz Processing c:\users\xxx\appdata\local\temp\tmppwi__0cz\setuptools-58.1.0-py3-none-any.whl Processing c:\users\xxx\appdata\local\temp\tmppwi__0cz\pip-22.0.4-py3-none-any.whl Installing collected packages: setuptools, pip Successfully installed pip-22.0.4 setuptools-58.1.0 Collecting databricks-bb-plugin Downloading databricks_bb_plugin-0.1.24-py3-none-any.whl (9.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0 […truncated]
Ignore merging republished data to final table while using DLT
I am new to DLT and trying to implement a use case where a republished record should not get merged into my curate layer as the record is already present with me in its latest form. I tried reading the curate table in my DLT and apply filtering on raw data but it is leading to circular dependency and is not working. Is there any way to tell the APPLY CHANGES INTO API to ignore the record if the same id with sequence keys is already present in target dataset?
Week of Jan 12
3 questionsParallelizing REST-API requests in Databricks
I have a list of IDs and want to make a get request to a REST-API for each of the ids and save the results in a dataframe. If I loop over the list it takes far too long so I tried to parallelize using ThreadPoolExecutor which reduced the execution time significantly. But then I read about pandas udfs and rdds and wondered if I could improve my approach even further. Since I have never really worked with either I cannot tell which approach is the best for my use case. The approaches I thought about were rdds, a pandas udf which takes the id column as a Pandas Series as Input and returns a Pandas Series of the resulting JSONs and a Pandas udf which takes the iterator of the Pandas Series as the input (what exactly is the difference between using iterator and series?). Or is it possible to use the whole dataframe as Input for the Pandas UDF and return the desired outcome df? Does anyone know what the best practice for my use case would be and could go a little bit into detail about the approaches?
How to display a greater number of completed jobs in Databrick's Spark UI?
To improve the performance of a databricks workflow, I need to analyse Spark UI. However, my workflows has 1295 jobs and Spark UI on Databricks only shows 904 jobs as you can see on the following image: Is there any configuration on Databricks allowing to display more jobs ?
Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger
I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data that was available at the time of the trigger. I have an external Delta live table that I access via Delta Sharing. This table is continuously updated with new events. Some of the events indicate a state change in the entities generating them. I want to enrich all events with the current state. So what I do is: Use auto CDC to create a type 2 slowly changing dimension table tracking the state history. Elsewhere in my pipeline, I join the state from the dimension table with each event. The pipeline runs in triggered mode. Because the pipeline understands the dependencies between tables, the query for (1) completes before the query for (2) starts. This is good. The problem comes because while the query for (1) was running using the availableNow trigger, new events are being added to the external table. So when the query for (2) starts, it sets its own end offset for availableNow , which means the update includes state change events that are not reflected in the state dimension table! I'm effectively using an out of date version of the dimension table at this point. Really I need both these queries to operate on identical rows, one after another. The cheapest thing to do would be to use the same end offset for the availableNow trigger for both queries. Is there a way to achieve this with Spark declarative pipelines? Or failing that, is there a way to make two Spark streaming reads in a regular job use the same end offset? I'm aware I could insert a buffer table to store a consistent and unchanging copy of the input events during the run time of the pipeline trigger, but managing storage and retention for such a table is expensive and annoying. So I am looking for other solutions. An illustration: Incoming data: t event 1 entity 1 has changed to state A 2 entity 1 has even […truncated]
Week of Dec 22, 2025
2 questionsErreur OLE DB ou ODBC: [DataSource.Error] ERROR [HY000] [Microsoft][Hardy] (35)
I have this error when i'm trying to pass my databricks dataset from directquery to import. Do you know where this cam from ? Error Message I'm trying to modify the data sources in order to be able to delete duplicate, because when i'm doing it by : transform data --> HOME --> delete row --> delete duplicate i have this error : DataSource.Error : ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: '[Microsoft][Hardy] (134) File 54864cbf-d4be-49e3-8f1a-18c3a022e370: A retriable error occurred while attempting to download a result file from the cloud store but the retry limit had been exceeded. Error Detail: File 54864cbf-d4be-49e3-8f1a-18c3a022e370: An error had occurred while attempting to download the result file.The error is: [Microsoft][DSI] An error occurred while attempting to retrieve the error message for key 'DSSslErrorMessage' with message parameters [' -The revocation status of the certificate or one of the certificates in the certificate chain is unknown. -The certificate chain is not complete. -The revocation status of the certificate or one of the certificates in the certificate chain is either offline or stale.'] and component ID 110: Could not open error message files - Check that "C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Spark ODBC Driver\en-US\DSMessages.xml" or "C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Spark ODBC Driver\DSMessages_en-US.xml" exists and are accessible.. Since the connection has been configured to consider all result file download as retriable errors, we will attempt to retry the download.'. Thx a lot for your help !
Databricks Notebook debugger says I'm not attached to a cluster, but I am
Environment Azure-Databricks Cluster version 13.3 (see small screenshot) A cell in a Notebook with some breakpoints. Problem When I want to debug that cell ("Debug Cell" option), that option is greyed out with the remark: "The debugger is only available after attaching to a cluster" But as you can see in the screenshot, I am attached (greed-bullet before the name). I even restarted the Cluster, but with the same result ('greyed-out' debug-cell option) What is going wrong here? Thanks in advance for your help.
Week of Dec 15, 2025
3 questionsHOW TO: [DBT Local] Elementary Anomaly Tests Custom Thresholds
Elementary is a tool that focus on data quality for dbt models. One of the data quality test it has is -- anomaly detection https://docs.elementary-data.com/data-tests/how-anomaly-detection-works I am implementing freshness anomaly tests for my model: https://docs.elementary-data.com/data-tests/anomaly-detection-tests/freshness-anomalies , and wanted to add custom thresholds so I can know when the test result is a warning VS. an error (by default, it will all be considered warnings) Below's what I tried: 1.1 and 2.8 are values I find by: looking at the past anomaly score for the table from a table called 'metrics_anomaly_score' that is generated while dbt test runs from the past anomaly scores, find the 85th percentile which is 1.1, 98th percentile which is 2.8, and use them as warn VS. error threshold tests: - elementary.freshness_anomalies: timestamp_column: "CAST(xxxx AS TIMESTAMP)" time_bucket: period: day count: 1 tags: ["elementary", "anomaly"] config: severity: "error" warn_if: ">1.1" error_if: ">2.8" However, after running dbt job in databricks I see this in the log: 14:40:16 Failure in test elementary_source_freshness_anomalies_xxxx_CAST_xxxx_at_AS_TIMESTAMP_ (models/sources/XXX/XXX.yml) 14:40:16 Got 3 results, configured to fail if >2.8 I analyzed and noticed that it is not using 2.8 as anomaly score, rather, it is using it as the number of anomalies the test found from all buckets. For example, this test splits the data to multiple buckets and will decide if each bucket is an anomaly, among all the buckets, if there are more than 2.8 buckets that are anomalies, it will fail the test -- this is how it interprets the threshold, which is not what I wanted. (but let me know if this is better than the percentile method I was thinking initially) So I want to ask the experienced dbt developers, have you met this situation when you need to define a threshold for your elementary anomaly tests to […truncated]
what are validation checks that are made inorder to push a data to reject folder in medallion architecture?
In my dataset, I noticed that the actual data type of a column differs from the expected data type.In this situation, should the data be type-cast during processing, or should such records be moved to a Reject folder in the Bronze layer? Could someone explain the scenarios in which data should be pushed to the Reject folder in a Medallion Architecture–based pipeline?
When should data go to Archive vs Reject in Bronze layer (Medallion Architecture)?
Can anybody help with understanding the Archive and Reject folders in bronze layer at Medallion Architecture. Let say i have 4 folders in Bronze namely Raw, Stage, Archive and Reject. At what extent a data can be pushed to Reject folder? Suppose If a data is pushed to Reject folder then for the next stage from which folder should I use the data.
Week of Dec 8, 2025
6 questionsHow to stop Databricks retaining widget selection between runs?
I have a Python notebook in Databricks. Within it I have a multiselect widget, which is defined like this: widget_values = spark.sql(f''' SELECT my_column FROM my_table GROUP BY my_column ORDER BY my_column ''') widget_values = widget_values.collect() widget_values = [i[0] for i in widget_values] if len(widget_values) >0: dbutils.widgets.multiselect("My widget", widget_values[0], widget_values) selection = dbutils.widgets.get("My widget") else: print("No data in my_table") I want the variable 'selection' to be blank if no values are selected in the widget. However, once a value or values has been selected, it gets retained until another selection overwrites it. For example, say I select the value 'A' in the widget and run the notebook. Selection now = 'A'. I then untick 'A' in the widget and run the notebook again. Selection should = '', but it still = 'A'. This happens even if I delete the value 'A' from my_table between the two runs. I have tried adding this code to the line before dbutils.widgets.get, but it doesn't do anything: selection = '' I have also tried starting a new session on the compute I'm using to run the notebook. If I do this, no variables appear in the 'variables' pane until I run the notebook again. But if I then run the notebook with no values selected in the widget, suddenly selection = 'A' again. Does anyone know why Databricks is retaining the selection like this and how I can stop it please?
Spark (Databricks) fails to read SPSS .sav files extracted from ZIP
I’m reading various file types in Databricks using Spark — including PDF , DOCX , PPTX , XLSX , and CSV . Some inputs are ZIP archives that contain multiple files, including SPSS .sav files. My workflow is: Detect if the input file is a ZIP. Extract it to a temporary directory. Iterate through the extracted files. Based on each file’s extension, call the appropriate reader function. Everything works correctly except for the .sav files. I use code that normally reads SPSS files, def dedup_columns(cols): seen = {} new_cols = [] for col in cols: if col not in seen: seen[col] = 0 new_cols.append(col) else: seen[col] += 1 new_cols.append(f"{col}.{seen[col]}") return new_cols def extract_sav(sav_file_path: str, update_path: bool): """ Extract text from a .csv file stored in GCS """ chunks = [] # Read .sav file into pandas DataFrame # df_pd, meta = pyreadstat.read_sav(sav_file_path) try: result = pyreadstat.read_sav(sav_file_path, apply_value_labels=True) df_pd = result[0] meta = result[1] # Optional: labels = result[2] if it exists except Exception as e: print(f"Error reading {sav_file_path}: {e}") return chunks df_pd.columns = dedup_columns(df_pd.columns) for idx, row in df_pd.itertuples(index=True): values = list(row[1:]) # skip index text = " ".join("" if v is None else str(v) for v in values).strip() if text: chunks.append((sav_file_path, "sav", idx, text)) return chunks but in this case Spark fails and only logs an error. I verified that the .sav files exist after extraction and their sizes are correct, so the files themselves seem fine. def check_file_size(temp_file_path: str): file_path = Path(temp_file_path) if file_path.is_file(): print(f"File {temp_file_path} exists!") print(f"File size: {file_path.stat().st_size} bytes") else: print(f"File {temp_file_path} does NOT exist!") def extract_zip_and_process(dbfs_zip_path: str): """ Extract zip […truncated]
Databricks pipeline fails on execute python script for expectations with error: Update FAILES; _UNCLASSIFIED_PYTHON_COMMAND_ERROR
I'm working on a databricks pipelibe and trying to create and apply expectations on a pipeline. I have the code but I keep getting an error that I cannot resolve.There is not much to go on, but I keep trying different methods, resoving all the errors and end up with the same error and I don't really understand what is going wrong. I've checked if it's a premission issue, I havve tried displaing the table and that works fine. In the pipeline view I should be able to see my expectaion but because it does not work it's not showing. The error is: be7a33 update is FAILES. Error class:_UNCLASSIFIED_PYTHON_COMMAND_ERROR %python from pyspark.sql.functions import col from pyspark import pipelines as dp @dp.table( name="orders", comment="Orders table with data quality constraints" ) @dp.expect_all_or_fail( "expect_table_row_count_to_be_between", "COUNT(*) > 100", "customer_id_not_null", "customer_id IS NOT NULL", "expect_column_values_to_be_in_set", "currency IN ('USD', 'EUR', 'GBP')" ) def orders(): return dp.read("Xyntrel_bronze.bronze.orders").filter( col("customer_id").isNotNull() ) I don't understand because the parser says the code is correct but on execution I get a fail. "timestamp": "2025-12-10T09:13:32.863Z", "message": "Update be7a33 is FAILED.", "level": "ERROR", "error": { "exceptions": [ { "message": "", "error_class": "_UNCLASSIFIED_PYTHON_COMMAND_ERROR", "short_message": "" } ], "fatal": true }, "details": { "update_progress": { "state": "FAILED" } }, "event_type": "update_progress", "maturity_level": "STABLE"}
php odbc connection to databricks returns junk bytes
I'm using the SimbaSparkODBC driver provided by DataBricks on Windows to connect to a DataBricks instance which is running in Azure. Most of the sqls are running fine, but sometimes the result contains invalid non-UTF8 symbols which shouldn't be there. I figured out that this happens more likely when the returned text is getting bigger and I was able to reproduce this error with a simple sql. Here the example: SELECT 'aslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkd' AS test The returned result was one row with one column with this value: aslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh saldkjh aslkdaslkdjh alskdjh lsakjdh lksajhd lksajh dlkajh sdlkjash ldkjh aslkdjh salkdjh laksjhd lkasjh dlksajh dlkajh dlkjah sldkh sa�\�Simba Spark ODBC Driver};UseProxy=1;ThriftTransport=2;SSL=1;ProxyPort=xxxxx;ProxyHost=xxxxxxxxxxxxxx;Port=443;HTTPPath=/sql/1.0/warehouses/xxxxxxxxxxx;Host=adb-xxxxxxxxxxxxx.azuredatabricks.net;AuthMech=3;UID=token;PWD=\���\� 6�8���Q�h��Q���7i*V�a6�TyC��lined The fact that parts of the connection string was in the returned data is really confusing. I'm using the PDO classes, but I've also tested the odbc_connect() functions with the same resul […truncated]
Use RSA key snowflake connection options instead of Password
I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as it exposes the password. pem_file_path = 'rsa_key.pem' with open(pem_file_path, 'r') as pem_file: pem_private_key = pem_file.read() # Configure Snowflake options sfOptions = { "sfURL": sfurl, "sfUser": sfuser, "sfDatabase": sfdatabase1, "sfSchema": sfschema, "sfWarehouse": sfwarehouse, "sfRole": sfrole, "pem_private_key": pem_private_key # Direct content of your unencrypted PEM file } # Establish your Snowflake connection (example, assuming you use Spark) from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("SnowflakeConnection") \ .getOrCreate() df = spark.read.format("snowflake") \ .options(**sfOptions) \ .option("dbtable", "your_table_name") \ .load() df.show()
Does Databricks Spark SQL evaluate all CASE branches for UDFs?
I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like: CASE WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path ... ELSE -- IPv4 path inet_aton(ip_address) END All test inputs are IPv6, so I expect the IPv4 branch to never be evaluated. But when running: WITH test_ipv6 AS ( SELECT * FROM VALUES ('3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562', 'Example IPv6'), ... ) SELECT ip_address, get_geo_location(ip_address), get_isp_location(ip_address), get_geo_country_code(ip_address), get_isp_country_code(ip_address) FROM test_ipv6; I get: [INVALID_ARRAY_INDEX] The index 4 is out of bounds. The error originates from my IPv4-only helper UDF: CREATE OR REPLACE FUNCTION inet_aton(ip_addr STRING) RETURNS BIGINT DETERMINISTIC RETURN ( SELECT element_at(regexp_extract_all(ip_addr, '(\\d+)'), 1) * POW(256, 3) + element_at(regexp_extract_all(ip_addr, '(\\d+)'), 2) * POW(256, 2) + element_at(regexp_extract_all(ip_addr, '(\\d+)'), 3) * POW(256, 1) + element_at(regexp_extract_all(ip_addr, '(\\d+)'), 4) * POW(256, 0) ); This function should never run for IPv6 inputs, but Databricks appears to evaluate or partially evaluate it anyway. Does Databricks / Spark SQL guarantee lazy evaluation of CASE or can the optimizer evaluate expressions in branches that are not taken? For SQL UDFs, does the optimizer treat the UDF body as fully transparent and attempt folding/evaluation regardless of the CASE predicate? Is there any official guidance or best practice for safely branching IPv4 vs IPv6 logic inside SQL UDFs?