Apache Spark
Recent items mentioning Apache Spark across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Unity Catalog Open APIs now offer expanded interoperability, allowing external engines like Apache Spark, Flink, and DuckDB to create, read, and write to UC managed Delta tables 45. Databricks also introduced Lakeflow Designer for visual data preparation, with a workaround using Genie to convert visual workflows into clean PySpark/SQL notebooks 3. For data engineering system design, a six-step framework for mastering interviews covers pipeline design, data modeling, and storage, including file formats 1.
Generated daily from the 10 most recent items mentioning Apache Spark. Click any [N] to jump to the source.
CommunityHow I Mastered System Design Interviews
This video teaches a six-step framework for mastering data engineering system design interviews, covering requirements gathering, pipeline design, data modeling, storage and file formats, data quality and observability, and pipeline resilience. It demonstrates how to apply this framework with practical examples and back-of-the-envelope calculations to justify design choices.
Databricks Delivery Solutions Architect Interview Experience Needed (Hiring Manager Round)
Hi everyone, I recently completed the screening round for a Senior Solutions Architect role at Databricks, and after reviewing my profile and discussion, the team mentioned that my experience aligns more closely with the Delivery Solutions Architect role. I’ve now been scheduled for the Hiring Manager round next week. The recruiter shared that the interview will focus on areas like: * AI/Data & AI architecture discussions * Customer advisory / pre-sales scenarios * Driving adoption and execution * Stakeholder management * Bias for action and ownership * Team fit and real-world project experience I have around 7.9 years of experience in Big Data Engineering, PySpark, SQL, Python, Databricks, and AI/ML-related work. Has anyone recently interviewed for the Delivery Solutions Architect role at Databricks? Would really appreciate any insights on: * What kind of questions were asked * Depth of technical discussion expected * Whether coding/system design is involved * How much focus is on customer-facing/pre-sales scenarios * Tips to prepare for the Hiring Manager round Thanks in advance!
EventsDatabricks News: Lakeflow Designer, UV package manager, DABs templates, Genie scheduled tasks
Databricks introduces Lakeflow Designer for visual data preparation, though its generated code is messy; a workaround uses Genie to convert the visual workflow into clean PySpark/SQL notebooks. The UV package manager significantly speeds up package installations on Databricks serverless runtimes, and DABs templates allow for standardized, customizable Databricks Asset Bundles.
Expanded interoperability with Unity Catalog Open APIs. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables.
Expanded interoperability with Unity Catalog Open APIs
Unity Catalog Open APIs now offer expanded interoperability, with external access to UC managed Delta tables in Beta and credential vending generally available with M2M OAuth support. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits feature for safe concurrent writes and audibility.
TutorialsHow to use Meta Conversions API on Databricks to activate first-party data
The Databricks Meta Conversions API app enables users to send conversion events from the Databricks Lakehouse directly to Meta Ads Manager. It provides a guided setup to connect Databricks to Meta using a pixel ID and access token, allowing for quick testing with sample data, deploying customizable notebooks, or setting up automated jobs for continuous data flow.
Day to Day life of DATABRICKS ENGINEER
I need help from you guys that which coding language should I work more on for being a 4 year experienced data engineer is it Pyspark, python or SQL? IF possible can anyone please help me with there daily tasks what they do regularly as databricks engineer?
Tips for integrating data quality tests?
I've been brought on as a data engineering consultant for a small to mid-sized company who has a poorly built architecture in Databricks. There's currently no documentation or clear architecture, so I've been spending weeks trying to untangle everything. They now want me to start implementing data quality checks because as of now there's no testing within the process at all and they're unsure if their outputs are even correct. Currently the data they want me to test are just raw files uploaded into Databricks tables on an irregular schedule, all with different granularity and logic that will require more complex checks than just null checks and unique primary keys. What is the best starting point for this? They have jobs and jobs that run jobs but no pipelines established, and I don't think I have the power to change that yet, so I think that takes DLT off the table unless I can prove it's worth the refactor. My first thought was integrating pyspark testing scripts to run within the jobs, but there has to be a more sophisticated way to do this?
How to group connected materials and aggregate stocks in PySpark/SQL on Databricks
How can I solve my current issue I'll give you some sample raw data and expected data below. The process can use SQL and Python, currently using a Databricks notebook. As you can see in the data, I don't have a direct link between all of the materials. I have A1, B3 — any material is fine as long as it's within the same group. But in the group column, if I already have group A1 even though I add another material within that group, it still needs to be A1. I want to retain it. So my solution is to save it into ADLS and for the STOCKS column I want to aggregate the stocks also within the same group. Sample: https://i.sstatic.net/b2Q05lUr.png Any help would be greatly appreciated. Thanks!
NewsDatabricks News: watermark-based incremental ingestion, MCP in AI gateway, Genie, Vector Search
Databricks now offers watermark-based incremental ingestion from SQL databases without change data feed, allowing for efficient data updates and soft deletion handling. The AI Gateway supports custom MCP servers, enabling integration with external APIs like GitHub for enhanced AI application development.
The learning order that actually works for Databricks. I wasted 3 months before figuring this out.
I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.
Marimo on Databricks
My workflow for a long time involved me switching back/forth between vscode and browser/databricks ui. I like to write my "production code" in normal python, but notebooks are great for exploration, spikes, visualization, triage etc. I could write a small dissertation but for various reasons I don't really like jupyter, and databricks notebooks have their own problems with commented magic commands etc. This led me to check out [marimo](https://marimo.io/), and wow, these are so cool. Code that runs in normal python, merges cleanly, has visualizations, widgets, the the app runs locally and doesn't glitch out, and even the vscode extension works nicely. The problem was, the databricks support wasn't great. It just felt a bit dated. It required a warehouse for sql, doesn't seem to really support serverless, and there were just so many oppurtunities to plug databricks into Marimo. This led me to create [marimo-databricks-connect](https://github.com/brookpatten/marimo-databricks-connect) [pypi](https://pypi.org/project/marimo-databricks-connect/) I tried to plug in "all the things" databricks into the place where they go in Marimo. I'm pretty happy with the result. - Connect to databricks using databricks-connect & spark (not sql warehouse) - Authenticate/configure spark using the default databricks-connect process (env vars, .databrickscfg etc), no additional auth config. - Execution of both python & sql cells - Autocomplete Catalog/Schema/Table/Column Names - Browsing of catalogs/schemas/tables/columns in the marimo data sources view - Browsing of external locations, volumes, dbfs, workspace in the marimo storage browser Notebook widgets to monitor and control of specific instances of databricks capabilities (clusters, workflows, vector search, apps etc) - Widgets to browse & explore databricks capabilities (compute, workflows, unity catalog) - Works in local marimo marimo edit notebook.py, in the vscode extension - Deploy as a databricks app to provide an alternative web based marimo UI. I'm working on adding serving endpoints as AI providers to the notebooks too. In particular what I like to use this for is creating "command center" notebooks for given processes that can include some normal pyspark/sql code to query/triage, widgets to monitor/control various databricks resources, visualizations to monitor dq etc. I just wanted to share and see what the community thinks, would you use it? contributions are welcome. throwaway account because i'm doxing myself via gh repo.
Help reading data
I am working on a Python data project for which I need to read data from parquet files stored in a volume as well as delta tables. Downstream I need the data in pandas DataFrame. To read the parquet I have used pd.read\_parquet(), this however is really slow compare to when I read the file from my machine. With the delta table, it is quick when read as pyspark DataFrame, but the toPandas() operation is also slow. I realise I am probably doing it naively, I wondered if someone had some advice. Edit: Some additional info. The table and parquet are about 7GB. The .toPandas() operation doesn't complete after an hour and read_parquet takes about 20mins.
TutorialsApache Spark Streaming Real-Time Mode - Latency Demo
The video demonstrates how to deploy and run Apache Spark Streaming in Real-Time Mode (RTM) using a declarative automation bundle. It shows that RTM significantly reduces P50 and P95 latencies compared to microbatch mode, achieving 26ms and 50ms respectively in a simplified setup without an external messaging bus.
TutorialsAir Traffic Control with Apache Spark Structured Streaming Real-Time Mode
The video demonstrates building a real-time air traffic control application using Apache Spark Structured Streaming Real-Time Mode, Lakehouse, and Databricks Apps. This system processes live flight telemetry, detects congestion, and generates alerts with sub-second end-to-end latency, all within a single Databricks platform.
Splitting string into respective columns
I am trying to read the log files and split the data into multiple columns using Databricks and Python. So far I have been able to split the string into the Array format, but I not able to put them into columns. Like: type, date, time, comment content INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3 from pyspark.sql.functions import split df_split = ( df.select(split('content', r' ', 0).alias('row')) ) df_split.select('row').display() row [INFO, 2025-10-10, 08:01:23, Starting, Spark, application] [WARN, 2025-10-10, 08:02:01, Memory, usage, is, high:, 75%] [ERROR, 2025-10-10, 08:03:05, Task, failed, for, partition, 3] I need this to be displayed into the following format: type date time comments INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3
UnityCatalog 0.4.1
The Unity Catalog Spark connector now supports atomic REPLACE TABLE AS SELECT and Dynamic Partition Overwrite for managed Delta tables, and a new credential-scoped file system to prevent out-of-memory errors. This release also adds support for the VARIANT data type in the UC client and fixes a critical security vulnerability (CVE-2026-27478) requiring new server configuration for existing deployments with authorization enabled.
Delta Lake 4.2.0
This release enhances Unity Catalog managed tables with support for REPLACE TABLE, RTAS, Dynamic Partition Overwrite, and improved streaming read options like `startingTimestamp` and `skipChangeCommits`. It also introduces GA support for Variant columns, Geospatial types with data skipping, and collated strings, alongside fixes for Variant stats and decimal predicates.
NewsDatabricks News: AUTO CDC, Workspace skills, Ask Genie, and Type widening
Databricks introduces Auto CDC for efficient change data feed processing, notebook and govern tags for better organization, and workspace skills for Ask Genie to customize its responses. Databricks also adds type widening for streaming tables, allowing data types to automatically adjust to larger incoming values.
Tutorials54 Zerobus Ingest Lakeflow Standard Connector | Ingest Streaming data directly into Delta Table
The video demonstrates how to use Databricks Zero Bus Ingest, a push-based API, to directly stream various data types like IoT, event, and telemetry data into Unity Catalog Delta tables. It highlights Zero Bus Ingest's ability to simplify streaming ingestion by eliminating the need for intermediate message buses and managing their infrastructure.
NewsDatabricks News: Excel add-in, Metrics Views UI, and Quality Monitoring
Databricks announced Lake Watch for cybersecurity, new dynamic dropdown filters in SQL editor, and improved quality monitoring with null value scanning and automated alerts. The video also demonstrates a new UI for defining metric views, an Excel add-in for data preview and import, and the ability to publish dashboards as public web pages.
ReleasesIntroducing Pantheon - Agentic Engineering At Scale
Pantheon is a Databricks application that uses a multi-agent system to generate Lake Flow pipelines for data engineering, allowing users to define data ingestion and transformation rules through a conversational interface. It automates the design, validation, and code generation for lakehouse pipelines, enabling citizen engineers to build robust data solutions without deep PySpark knowledge.
NewsDatabricks News: Free Tier, Multi-statement transactions, Declarative Automation Bundles, Genie Code
Databricks now offers a free tier for Lakeflow Connect, providing 100 DBUs per day per workspace, and has introduced multi-statement transactions in Unity Catalog that ensure atomicity with rollback capabilities. The platform also announced a Databricks One mobile app, a new AI runtime with pre-installed tools for GPU use cases, and enhanced Genie Code that understands project structure for automated development tasks. Additionally, Databricks Asset Bundles are now called Declarative Automation Bundles and use a faster direct engine, and a new 5X-Large SQL warehouse is available for processing terabytes of data.
Tutorials53 Lakeflow Connect SQL Server Managed Connector | Ingest Data using Databricks native connectors
The video demonstrates how to ingest data from SQL Server into Databricks using Lakeflow Connect's managed connector, covering the setup of a SQL Server database, user permissions, and enabling change tracking/change data capture (CT/CDC). It then walks through configuring the Databricks connection, creating gateway and ingestion pipelines, and showcasing how SCD Type 2 changes are automatically managed.
NewsDatabricks News: unit testing, OneLake federation, scoped access tokens
Databricks now allows creating Unity Catalog domains for business users, running JAR tasks on serverless compute, and federating OneLake data directly into Databricks. The platform also introduces in-workspace Python unit testing, new data connectors like HubSpot and TikTok Ads, and scoped personal access tokens for enhanced security.
NewsDatabricks News: Catalog and External locations in DABS, Schema Evolution, File Events, Queries Tags
Databricks Runtime 18.1 introduces schema evolution for inserts, managed file events for Autoloader, and a simplified `TABLE` syntax for querying. The video also demonstrates new features like the AI Gateway for LLM governance, query tags for tracking, and the GA release of the supervisor agent.
Delta Lake 4.1.0
Delta Lake 4.1.0 enhances Unity Catalog integration with improved support for catalog-managed tables, including atomic CTAS and conflict-free feature enablement for Deletion Vectors and Column Mapping. It also introduces a new Spark V2 connector based on Delta Kernel API for streaming reads and server-side planning capabilities.
TutorialsDatabricks End-To-End Project | Zero-To-Expert | Streaming, AI, Lakeflow, Unity Catalog, AI/BI
This video demonstrates building an end-to-end restaurant analytics platform on Databricks, covering streaming and batch data ingestion, AI-powered sentiment analysis, and dashboard creation. It teaches how to use Unity Catalog, Lake Flow Connect for CDC, Spark declarative pipelines for real-time data from Event Hub, and how to construct a medallion architecture with fact and dimension tables.
NewsDatabricks Breaking News: 2026 Week 6: 2 February 2026 to 8 February 2026
Databricks introduces agentic data quality monitoring with anomaly detection, LLM judge UI builder for MLflow, and new SQL warehouse features including a default option and activity details. The platform also enhances its assistant to connect with MCP servers, improves Google Sheets integration with pivot table functionality, and adds direct Git deployment and tagging for Databricks apps.
NewsDatabricks Breaking News: 2026 Week 5: 26 January 2026 to 1 February 2026
Databricks now allows triggering materialized views or streaming tables on update, automatically detecting source changes and refreshing the pipeline. MLflow traces can now be stored in Unity Catalog using OpenTelemetry, providing a centralized logging system for experiment data.
NewsDatabricks Breaking News: 2026 Week 4: 19 January 2026 to 25 January 2026
Databricks introduces temporary tables that are Unity Catalog managed, materialized, and allow DML operations, automatically cleaning up after a session or seven days. Materialized views now support refresh policies like incremental strict, which verifies if a view can be incrementally refreshed before deployment.
NewsDatabricks Breaking News: 2026 Week 3: 12 January 2026 to 18 January 2026
Databricks Runtime 18 is now Generally Available, offering Spark 4.1 and improved identifier/parameter maker availability. New features include Lakeflow Connect for row filtering during ingestion, Codex models (GBT Codex Max and Mini) for code development, and Databricks One improvements like favorites and data preview in Gen Rooms.
Delta Lake 4.0.1
The "managed table" feature is renamed to `catalogManaged` (breaking change for `catalogOwned-preview` and `ucTableId`), and Unity Catalog now supports OAuth authentication for catalogs. This release also fixes a `NoSuchMethodError` when running `REORG TABLE … APPLY (PURGE)` with Spark 4.0.1 and enables creating UC-managed Delta tables where properties are sent to the UC server.
NewsDatabricks Breaking News: Week 2026 02: 5 January 2026 to 11 January 2026 #databricks news
Databricks now allows changing catalog and schema during dashboard deployments, addressing a previous issue with environment-specific configurations. The Databricks CLI has a breaking change with plan version 2, altering the structure of deployment plans.
This release adds support for multiple constraints at once, generates Symlink Manifests for external engines, and introduces GCS auto-registration. It also includes fixes for schema evolution in merge operations, improved error reporting, and enhanced handling of empty tables.
NewsDatabricks Breaking News: Week 2026 01: 29 December 2025 to 4 January 2026 #databricks news
Databricks now supports deploying asset bundles from a generated plan, enabling CI/CD integration for review and approval. Unity Catalog introduces new secret grants, and Runtime 18 brings "everywhere" implementations for literal string colling, parameter markers, and identifiers, along with window functions in metrics view and general availability for SQL scripting.
This release introduces several API changes and integrates `delta_kernel` for improved stats parsing performance. It also fixes issues with schema evolution during merge operations and null handling in scalar extraction.
ReleasesDatabricks Breaking News: Week 52: 22 December 2025 to 28 December 2025 #databricks news
Databricks introduces a direct mode for asset bundles, offering faster deployments without Terraform, and the Databricks Assistant agent mode is now in public preview, capable of multi-step notebook editing and data analysis. Other updates include single-use refresh tokens for enhanced security, partition columns now included in Parquet files for improved compatibility, and new dashboard features like custom labels, flexible sorting, and Microsoft Teams integration for scheduled reports.
NewsDatabricks Breaking News: Week 51: 15 December 2025 to 21 December 2025 #databricks news
Databricks introduces new Lakeflow Connect features, including custom logic for declarative pipelines and new connectors for incremental data import from sources like Confluence, PostgreSQL, and MySQL. The platform also announces the deprecation of legacy features like Hive Metastore and DBFS for new accounts, alongside updates to Lakehouse ACLs, job scheduling from notebooks, flexible node types for cluster deployment, and expanded resource assignment in Databricks apps.
NewsDatabricks Breaking News: Week 50: 8 December 2025 to 14 December 2025 #databricks news
Databricks now supports native reading and writing of Excel files in PySpark, SQL, and Autoloader, including features like sheet listing and range targeting. Additionally, Databricks Runtime 18 is available in beta, introducing improvements for streaming queries and new system columns for job tables, alongside a new Legase experience with project and branching capabilities for transactional databases.
Tutorials52 Lakeflow Spark Declarative Pipelines | New Pipeline Code Editor | AUTO CDC |External Target Sinks
Databricks' LakeFlow Spark Declarative Pipelines (SDP), formerly Delta Live Tables (DLT), offers a unified solution for data ingestion, transformation, and orchestration, now open-sourced with Apache Spark 4.1. The video demonstrates using the new pipeline code editor to build SDPs in Python and SQL, showcasing features like auto CDC (formerly apply changes) and external target sinks.
Tutorials34 Write PySpark Unit Test Cases using PyTest module | Setup PyTest with PySpark
The video demonstrates how to write PySpark unit test cases using the Pytest module. It covers setting up Pytest, creating fixtures for Spark sessions, and writing test functions to validate PySpark transformations and filters.
NewsWhy YouTube NOT Udemy? #dataengineering #easewithdata #pyspark #databricks
The creator explains they offer free data engineering content on YouTube because they struggled to find good, affordable learning resources when they were starting out. They aim to provide high-quality, demo-rich content for free to prevent others from facing similar difficulties with paid, low-quality courses.
Tutorials33 What is Spark Connect? | Spark Connect vs Spark Session | Setup Spark Connect Server with Cluster
Spark Connect decouples the client and server, allowing remote connection to Spark clusters using DataFrame APIs from various IDEs and languages, unlike Spark Session which tightly couples them and supports low-level RDD APIs. The video demonstrates setting up a Spark 3.5 cluster, starting a Spark Connect server, and running PySpark DataFrame operations remotely from VS Code.
CommunityApache Spark Was Hard Until I Learned These 30 Concepts!
The video explains 30 key Apache Spark concepts, starting with a comparison to MapReduce to highlight Spark's in-memory processing and DAG-based execution model. It then details Spark's cluster architecture, job execution flow (driver, executors, tasks), and memory management within executor containers.
Tutorials04_2 - Setup PySpark in Local Machine with Jupyter Lab | PySpark Local Machine Setup
The video demonstrates setting up PySpark with Jupyter Lab on a local machine using Docker, first as a standalone instance and then as a multi-node cluster. It walks through installing Docker Desktop, pulling a PySpark Jupyter Lab image from Docker Hub, configuring ports, and verifying the setup by running a basic PySpark job.
NewsDatabricks: What’s new in October 2025 #databricks news
Databricks introduces Databricks One, a new business-focused experience with consumer access for dashboards and Genie, alongside updates to Genie for defining relations and extended API endpoints. The platform also adds features like easy conversion of external to managed tables, enhanced Databricks Asset Bundles with policy integration and script execution, and new system tables for MLflow tracking and data classification results.
TutorialsDatabricks + Cursor IDE: Step-by-Step AI Coding Tutorial
The video demonstrates using Cursor IDE for AI-enhanced Databricks development, focusing on setting up Databricks Connect and leveraging Cursor rules and context for efficient code generation and testing. It shows how to structure projects, write Python and PySpark code, and create unit tests, highlighting the importance of providing clear instructions to the AI agent.
NewsDatabricks: What’s new in September 2025? #databricks
Databricks now supports geospatial data types (geography and geometry) with new functions for visualization and spatial operations, and introduces serverless GPU clusters for distributed GPU code execution. The platform also offers enhanced notebook features like side-by-side editing and a notebook-specific search, along with new options for managing serverless environments, SQL warehouses, and access requests in Unity Catalog.
TutorialsDelta Lake Masterclass | Azure Databricks | PySpark | From Zero-To-Expert
This video provides a comprehensive masterclass on Delta Lake using Azure Databricks and PySpark, covering its core concepts, internal workings, and practical applications. It demonstrates how Delta Lake solves data lake problems like lack of ACID support, DML operations, and schema enforcement, and teaches features like time travel, concurrency control, and optimization techniques.
Tutorials51 Setup Azure DevOps Pipeline with Databricks Asset Bundles (DABs) | Complete CICD Process
The video demonstrates how to set up an Azure DevOps pipeline to deploy Databricks Asset Bundles (DABs) to higher environments like QA. It covers configuring service principal permissions, setting up Azure pipeline variables for environment-specific details, and writing the YAML pipeline code to validate and deploy Databricks assets.
Tutorials50 Databricks Asset Bundles | Configure Production grade DABs | CICD using DABs (IAC)
The video demonstrates how to configure and deploy Databricks Asset Bundles (DABs) for managing Databricks assets like notebooks, jobs, and pipelines across different environments. It covers creating a structured DAB project, defining resources and targets in YAML, and deploying using both the Databricks UI and CLI, including setting up environment-specific configurations and variables.
Tutorials49 Databricks CLI | Install and Authenticate Databricks CLI | U2M and M2M Authentication
The video demonstrates how to install the Databricks CLI on Windows and authenticate it using both User-to-Machine (U2M) and Machine-to-Machine (M2M) methods. It then shows how to run various CLI commands to interact with Databricks workspaces and account consoles, such as listing catalogs, creating schemas, and managing groups.
Tutorials48 Databricks GIT Folders | Configure GIT repository with Databricks using Azure DevOps Repo
Tutorials47 AIBI Genie Space in Databricks | Use Natural Language to Query data
EventsBringing Declarative Pipelines to the Apache Spark™ Open Source Project
Databricks announces the contribution of Spark declarative pipelines to Apache Spark, allowing users to build end-to-end production pipelines with a few lines of SQL. This new feature simplifies data processing by abstracting away complex technologies, enabling focus on data value.
EventsIntroducing Apache Spark 4.0
Apache Spark 4.0 introduces SQL UDFs, a new pipe syntax, the variant data type, and makes ANSI mode the default. It also enhances Spark Connect with support for Swift, Rust, and Go, adds a Python data source API, and reimagines streaming state with a new transform with state API.
NewsGPU Accelerated Spark Connect
This video demonstrates how to accelerate Spark Connect using GPUs for both Spark SQL and ML workloads. It details the architecture, deployment, and benchmark results showing significant speedups and cost savings compared to CPU-only execution.

