Auto Loader
Recent items mentioning Auto Loader across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Recent discussions highlight Auto Loader's schema evolution capabilities, with one user reporting unexpected job failures even when cloudFiles.schemaEvolutionMode is set to addNewColumns 1. Another community post explored autoscaling strategies for Auto Loader without using Serverless DBU Pools (SDP) 2. Additionally, the Databricks Data Engineer Associate (DEA) exam, updated for 2026, continues to feature Auto Loader as a key topic 36.
Generated daily from the 6 most recent items mentioning Auto Loader. Click any [N] to jump to the source.
Job even fails with .option("cloudFiles.schemaEvolutionMode", "addNewColumns") set?
I'm using Autoloader to ingest data from Parquet files into a bronze table. Now there is a bunch of existing files, which have some columns less than newer files have. When I start the job with a new fresh checkpoint, it first walks through the older files (which is expected), and it fails once the first file is picked up with the new columns included. According to Genie Code this is expected behaviour, and it recommends to enable the retry option for the specific job task to mitigate this. I also noticed, that the data of the file, which was reported in the logs as causing the "issue", wasn't ingested at all to the table?! Here's my question: why should I want a job to fail, if I accept schema evolution all the way? Instead it should just silently add the new columns to the schema and move on. Is failing the job and doing a retry (= spin up the job cluster again) really best practice for this scenario? Feels odd to me. Generally I think Autoloader is really bad documented, and there aren't many tutorials treating all possible edge cases. Especially what to do, in case files were missed.
Autoscaling with the autoloader without SDP
Databricks Data Engineer Associate Exam Updated for 2026
The Databricks Data Engineer Associate exam changed on May 4, 2026. The exam now has 7 domains instead of 5. Two new domains were added. The first new domain is CI/CD. This includes: • Databricks Repos • Git integration • Branching and commits • Deploying Declarative Automation Bundles • Using the Databricks CLI • Moving code from dev to test to production Databricks Asset Bundles is now called Declarative Automation Bundles, so learn the new name. If you have never used Git or the Databricks CLI inside Databricks, spend some time practicing in the Free Edition. Connect a Git repo, make commits, and deploy bundles. Hands-on practice will help a lot. The second new domain is Troubleshooting, Monitoring, and Optimization. This includes: • Reading the Spark UI • Finding bottlenecks like data skew and excessive shuffling • Understanding Liquid Clustering • Predictive optimization • Troubleshooting cluster and memory issues Many courses do not teach Spark UI deeply, so try running queries yourself and checking the Spark UI. Compare good queries with inefficient ones to understand the difference. Some existing domains also changed. Ingestion now includes Lakeflow Connect along with Auto Loader and COPY INTO. Governance now includes: • Column-level masking • Row-level security • Attribute-based access control You now need to understand security beyond basic GRANT permissions. Lakeflow Jobs also tests three trigger types: • Scheduled • File arrival • Table update Know when to use each one. Some product names also changed: • Databricks Asset Bundles → Declarative Automation Bundles • Delta Live Tables → Lakeflow Declarative Pipelines The exam uses the new terminology, so update your study material if you are using older resources. The exam format is still: • 45 scored questions • 90 minutes • $200 There may also be extra unscored questions mixed into the exam. For preparation, the original Academy courses still help for the old domains. But for the two new domains, hands-on practice is very important. Practice: • Spark UI • Git integration • Databricks CLI • Deployments using bundles Also read the latest official exam guide PDF from the Databricks page. Good luck to everyone preparing for the exam.
Native Excel support is now GA
Hey r/databricks! Native Excel ingestion on Databricks is now **Generally Available** across AWS, Azure, and GCP. With this release, you can ingest, parse, and query `.xls` / `.xlsx` / `.xlsm` files directly. Public docs: [https://docs.databricks.com/aws/en/query/formats/excel](https://docs.databricks.com/aws/en/query/formats/excel) **📂 What is it?** Native Excel support that lets you: * Directly read `.xls`, `.xlsx`, and `.xlsm` files using Spark (`spark.read.excel(...)`) or SQL (`read_files`, `COPY INTO`). * Upload Excel files through the "Create or modify table" UI and land them as Delta. * Specify exact sheets and cell ranges (e.g., `"Sheet1!A2:D10"`) for complex layouts. * Infer schema, headers, and data types automatically, or bring your own. * Stream Excel files with Auto Loader using `cloudFiles.format = "excel"`. * List sheets in a workbook programmatically before ingesting. **🤷 Why?** Until now, Databricks didn't have a native Excel reader. That meant writing custom Python with pandas / openpyxl to convert Excel → DataFrame → Delta, manually exporting sheets to CSV before you could ingest them, or giving up on workflows because the Databricks file-upload UI rejected `.xlsx`. GA makes Excel a first-class file format across Spark, SQL, Auto Loader, and the table-creation UI. It also opens the door to Excel ingestion via our managed file connectors ([SharePoint](https://docs.databricks.com/aws/en/ingestion/sharepoint), [Google Drive](https://docs.databricks.com/aws/en/ingestion/google-drive#google-drive-metadata-column), [SFTP](https://docs.databricks.com/aws/en/ingestion/sftp), and more coming soon). **🧑💻 How do I try it?** 1️⃣ Requirements * Databricks Runtime 18.1 or above. 2️⃣ Try it in the UI * Click New → Add Data → Create or modify table. * Upload an `.xls`, `.xlsx`, or `.xlsm`file. * Pick the sheet. Adjust header rows or cell range if needed. * Preview the inferred schema. * Click Create table. It lands as a Delta table in Unity Catalog. 3️⃣ Try it in Spark (batch) # Read the first sheet of a workbook df = spark.read.excel("<path to excel file>") # Use a header row and a specific sheet + range df = ( spark.read .option("headerRows", 1) .option("dataAddress", "Sheet1!A1:E10") .excel("<path to excel directory or file>") ) df.write.mode("overwrite").saveAsTable("<catalog>.<schema>.my_table") 4️⃣ Try it in SQL with read\_files CREATE TABLE my_sheet_table AS SELECT * FROM read_files( "<path to excel directory or file>", format => "excel", headerRows => 1, dataAddress => "Sheet1!A2:D10", schemaEvolutionMode => "none" ); 5️⃣ Try it with COPY INTO COPY INTO excel_demo_table FROM "<path to excel directory or file>" FILEFORMAT = EXCEL; 6️⃣ Try it with Auto Loader (streaming) df = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "excel") .option("cloudFiles.inferColumnTypes", True) .option("headerRows", 1) .option("cloudFiles.schemaLocation", "<schema location>") .load("<path to excel directory or file>") ) (df.writeStream .format("delta") .option("checkpointLocation", "<checkpoint path>") .table("<catalog>.<schema>.excel_stream")) 7️⃣ List sheets in a workbook sheets = ( spark.read .option("operation", "listSheets") .excel("<path to workbook>") ) sheets.show() # returns sheetIndex, sheetName **🎛️ Supported options** |Option|Description| |:-|:-| |`dataAddress`|Cell range in Excel syntax. Examples: `"MySheet!C5:H10"`, `"C5:H10"`, `"Sheet1"`. Defaults to all valid cells on the first sheet.| |`headerRows`|Number of header rows inside `dataAddress` (0 or 1). Default: 0.| |`operation`|`"readSheet"` (default) or `"listSh […truncated]
[Passed] Databricks DEA Exam today
https://preview.redd.it/z6mcmrgvmjyg1.png?width=474&format=png&auto=webp&s=28e010f62635d49af3a815998011125d8f2cfa0f Just walked out of the exam and I’m glad to say I passed. I was sweating a bit because the exam content changes on the 4th, so I really didn't want to fail and have to deal with a new syllabus. I've had Databricks at work since late 2023. I’ve been using it because, well, it’s there, but I was mostly just "vibe coding"—picking up some Python and Spark here and there without any real depth. I ran jobs using whatever cluster settings the company gave me without actually knowing what they meant. If you’ve never touched Databricks, this exam is going to be a pain. Even if you’re good at coding, the internal components and the way everything fits together are hard to grasp just by reading. You really need to get your hands dirty in the workspace to get a "feel" for it. **Study Routine** I started with the Databricks Academy stuff, but since I’m juggling work and a toddler, I could only study on weekends. This was a disaster because by the next Saturday, I’d already forgotten what I learned the week before. One month before the exam, I ditched the theory and just hammered Mock Exams. * Udemy is your friend: I bought practice exams from Derar and Santosh. * I snagged them at discounted price. Just wait for the sale if you are not in a hurry. Personally, Santosh’s exams felt closer to the real thing. I saw maybe 5-6 questions that were almost word-for-word. Derar is also solid; honestly, just solve as many problems as possible. Since my study time was limited, I focused on reviewing the questions I got wrong. I realized pretty early that Productionizing Data Pipelines was my weak spot. I didn't try to become an expert in it. I just aimed for a 60% "pass" in that section and doubled down on the areas I was actually good at. Don't completely ignore your weak areas though. If you bomb one section too hard, a couple of silly mistakes in other sections will kill your score. **What's on the exam** The questions are mostly scenario-based. You have to read the prompts carefully. Some things I remember: * Autoloader: This came up a lot. * DLT (now called Lakeflow Spark Declarative Pipelines): should understand what it actually does * Unity Catalog: Permissions (Granting minimum access) and the actual SQL code for it. * Delta Sharing: Knowing the difference between sharing with Databricks vs. non-Databricks users. * Egress Costs: How to avoid them in cross-cloud sharing (Cloudflare R2 was the answer for one). * SQL Warehouses: Classic vs. Pro vs. Serverless. Know when to use which. * DABs (Databricks Asset Bundles): I got at least 3 questions on this. Don't skip it. * Medallion Architecture: It’s not just "what is Bronze/Silver/Gold." They’ll give you a scenario and ask which layer the data should go to next. Also, those "select two" questions are the absolute worst, super confusing. I know the syllabus is changing on the 4th, so I’m not sure how much of this will still apply. But honestly, if you have some background and get familiar with the core concepts, it’s a very doable exam. I’ve learned a lot through this process. Good luck to everyone preparing!
Here are 5 topics that showed up much more than I expected in my DEA exam
I took the Databricks Data Engineer Associate exam recently and wanted to share what actually came up because it was quite different from what I spent most of my time studying. I went in thinking Delta Lake theory and platform architecture would be the big topics. They weren't. The exam is way more practical than I expected. **The first thing** that caught me off guard was how heavily they test Auto Loader. Not just the basics but real scenarios. One question described a pipeline receiving 50,000 new files per day and asked which ingestion method to use and why. You need to understand when Auto Loader makes sense versus COPY INTO, how schema evolution works with mergeSchema, and the difference between directory listing and file notification mode. I probably got six or seven questions just on this one topic. **The second thing** was lazy evaluation. I knew the concept but I wasn't prepared for how they test it. They give you a block of code with four or five DataFrame transformations and ask what happens when you run the cell. The answer is nothing happens because there is no action at the end. But the way they frame the questions makes you second guess yourself if you only memorized the definition without really understanding it. **Third** was Lakeflow expectations. The old name was Delta Live Tables but they use Lakeflow in the exam now. You need to know the three expectation types and when to use each one. They gave me a scenario where the pipeline should log bad records but never drop them and I had to pick the right expectation decorator. Also know the difference between streaming tables and materialized views because that came up more than once. **Fourth** was Unity Catalog permissions. Not just the three level naming pattern but actual grant scenarios. Something like a data analyst needs to read tables in the sales schema but should not be able to create new tables and you have to pick the correct grant statement. I got at least three or four questions like this. **Fifth** was MERGE INTO. They really love this command. Upsert scenarios, deduplication, slowly changing dimensions. If you cannot write a MERGE statement from memory with the WHEN MATCHED and WHEN NOT MATCHED clauses you should spend an hour practicing just that before you sit for the exam. What surprised me about what was not heavily tested. Cluster configuration was maybe one question. The architecture diagrams with control plane and data plane were one or two questions at most. Delta Sharing was one question. Spark internals like shuffle details were barely mentioned. The biggest thing I wish I had done differently is spend less time reading documentation and more time actually running code. When you have actually executed a MERGE INTO on a real table and seen the results, the exam question feels like something you have done before instead of something you read about once. I used Databricks Free Edition for all my practice and it was more than enough. Hope this helps someone who is preparing right now. Feel free to ask anything about the exam in the comments and I will try to answer.
NewsDatabricks News: Catalog and External locations in DABS, Schema Evolution, File Events, Queries Tags
Databricks Runtime 18.1 introduces schema evolution for inserts, managed file events for Autoloader, and a simplified `TABLE` syntax for querying. The video also demonstrates new features like the AI Gateway for LLM governance, query tags for tracking, and the GA release of the supervisor agent.
NewsDatabricks Breaking News: Week 50: 8 December 2025 to 14 December 2025 #databricks news
Databricks now supports native reading and writing of Excel files in PySpark, SQL, and Autoloader, including features like sheet listing and range targeting. Additionally, Databricks Runtime 18 is available in beta, introducing improvements for streaming queries and new system columns for job tables, alongside a new Legase experience with project and branching capabilities for transactional databases.
NewsFrom Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%
NewsLakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader
Tutorials126. Databricks | Pyspark | Downloading Files from Databricks DBFS Location
News125. Databricks | Pyspark| Delta Live Table: Data Quality Check - Expect
Tutorials124. Databricks | Pyspark| Delta Live Table: Datasets - Tables and Views
Tutorials









