Apache Spark
Recent items mentioning Apache Spark across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Recent activity around Apache Spark highlights its continued integration and optimization within the Databricks ecosystem. Databricks now offers a decision framework for ETL migration to Databricks, leveraging Spark Declarative Pipelines and notebooks 4, while Unity Catalog extends fine-grained access controls to external engines like Apache Spark 7. Additionally, Spatial SQL is now Generally Available on Databricks, bringing native geospatial data types and Apache Spark 4.2 compatibility for geo columns 9.
Generated daily from the 10 most recent items mentioning Apache Spark. Click any [N] to jump to the source.
TutorialsMastering Joins In Apache Spark: Complete Deep Dive
The video provides a deep dive into four Apache Spark physical join strategies: Sort Merge Join, Broadcast Hash Join, Shuffle Hash Join, and Broadcast Nested Loop Join. For each join, it explains the conditions for Spark's selection, visualizes its step-by-step internal mechanics, and demonstrates its appearance in Spark's physical plan and UI.
StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster
How does Databricks handle registration and discovery of custom PySpark data sources in SDPs?
A Decision Framework for ETL Migration to Databricks
Databricks ETL migration offers three paths—Lakehouse, Spark Declarative Pipelines, and notebooks—to address diverse scenarios, often used in combination. A four-stage framework (assess, quick wins, modernize, optimize) and tools like Lakebridge and AI-assisted conversion enable incremental migration and automate mechanical translation.
PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON
python-v1.6.1: Column Mapping write support
This release adds support for writing Delta tables with column mapping enabled. It also introduces a new API for stats-free append writes and allows switching nanosecond timestamps at runtime in Python.
DataFlint on Databricks - the Open Source Spark UI Upgrade Apache Spark Has Needed for Years
NewsUnity Catalog Fine-Grained Access Controls on External Engines
Unity Catalog enables fine-grained access controls (FGAC) defined once to be enforced consistently across Databricks and external engines like Apache Spark. External engines can also create and write to UC-managed tables, benefiting from centralized governance, automatic optimization, and transactional safety.
Delta Lake 4.3.0
Databricks practitioners can now integrate Spark with the Unity Catalog Delta REST API for managed Delta tables and selectively replace data using new `replaceOn` and `replaceUsing` DataFrame APIs. UniForm for Iceberg conversion is now atomic and incremental, and Delta Sharing supports streaming and Change Data Feed for shared tables.
UnityCatalog 0.5.0
This release introduces a new UC Delta API for managing Delta tables via REST, enabling various engines to use Unity Catalog as a Delta-native catalog. The UC Spark connector now has separate artifacts for Spark 4.0.x and 4.1.x compatibility, and its credential-scoped file system is enabled by default.
EventsDatabricks News: CLI v 1.0.0, AI-tools, databricks Docker, DABs UI sync, mutators
The video demonstrates new Databricks features, including the GA release of CLI 1.0.0, UI sync for DABs, Python mutators for bundle extension, and new Docker image options for custom runtimes. It also covers serverless pipeline orchestration, enhanced autoscaling for Lakebase and apps, serverless interactive execution timeout, and auto-scoping for access tokens.
Geospatial Unbounded: Spatial SQL GA with AI/BI Maps, Delta Sharing, and Iceberg v3
Spatial SQL is now Generally Available on Databricks, bringing native geospatial data types, 90+ ST_* functions, and AI/BI Dashboards that render maps natively. This release also includes major performance improvements, open lakehouse support via Delta Sharing and Iceberg v3, and Apache Spark 4.2 compatibility for geo columns.
Apache Spark’s Real-Time Mode Use Case Deep Dive: Gaming Sessionization
Apache Spark Real-Time Mode for Gaming: A Better Way to Do Real-Time Sessionization
Apache Spark Real-Time Mode now enables real-time gaming sessionization for millions of active device sessions, replacing custom applications with sub-second precision for both input processing and timer-driven output. Learn how transformWithState timers power proactive, timer-driven heartbeats, generating output on a schedule independent of incoming data.
Apache Spark Masterclass (In-Person, Bengaluru) | 6 June
Converting stored procedures to PySpark
TutorialsThe New Databricks Lakeflow Designer Is a Game Changer!
Databricks Lakeflow Designer is a visual data preparation tool that allows users to create, add, and transform data using a no-code drag-and-drop UI or AI-powered Genie Code. The video demonstrates how to import data from various sources, profile data, perform complex transformations like data type conversions and sentiment analysis, and then deploy the resulting production-ready PySpark code for scheduling or integration into existing pipelines.
Handle case issue in column names
I am loading data using pyspark with spark_reader.load(data_path) However, in some cases data can be very messy, with fields using different case for each rows (can be in nested structs). Here is an example of data : [ { "field_1": "1", "Field_2": 1, "field_3": "b", "field_4": [{"A": 1, "b": 2}, {"A": 3, "b": 4}], }, { "Field_1": "2", "Field_2": 2, "Field_3": "BB", "Field_4": [{"a": 1, "B": 2}, {"a": 3, "B": 4}], }, ] In this case, the load fails with following error : pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: `field_1`, `field_3`, `field_4` And I can't find a clean way to handle this case. I tried the following workaround : raw = spark.read.text(data_path) normalized_rdd = raw.rdd.mapPartitions(_normalize_partition) raw_df = spark.read.json(normalized_rdd) With a python function _normalize_partition that normalizes the column names. However it does not work in my case as I use a Databricks serverless compute and the use of .rdd is not allowed. [NOT_IMPLEMENTED] Using custom code using PySpark RDDs is not allowed on serverless compute.
Building a Spark Streaming Real-Time Mode (RTM) Pipeline — Millisecond Streaming with Kafka
I recently built a fully working real-time transaction enrichment pipeline using PySpark RTM paired with Kafka, achieving end-to-end latency in the milliseconds. The article covers: \- Real-Time Mode (RTM) fundamentals \- Kafka integration with Spark Structured Streaming \- Millisecond-latency pipeline architecture \- Real-time transaction enrichment patterns Blog: https://blog.devgenius.io/building-a-spark-streaming-real-time-mode-rtm-pipeline-millisecond-streaming-with-kafka-dda74e9ef284
CommunityHow I Mastered System Design Interviews
This video teaches a six-step framework for mastering data engineering system design interviews, covering requirements gathering, pipeline design, data modeling, storage and file formats, data quality and observability, and pipeline resilience. It demonstrates how to apply this framework with practical examples and back-of-the-envelope calculations to justify design choices.
EventsDatabricks News: Lakeflow Designer, UV package manager, DABs templates, Genie scheduled tasks
Databricks introduces Lakeflow Designer for visual data preparation, though its generated code is messy; a workaround uses Genie to convert the visual workflow into clean PySpark/SQL notebooks. The UV package manager significantly speeds up package installations on Databricks serverless runtimes, and DABs templates allow for standardized, customizable Databricks Asset Bundles.
Expanded interoperability with Unity Catalog Open APIs
Unity Catalog Open APIs now offer expanded interoperability, with external access to UC managed Delta tables in Beta and credential vending generally available with M2M OAuth support. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables, leveraging Delta Lake's new catalog commits feature for safe concurrent writes and audibility.
TutorialsHow to use Meta Conversions API on Databricks to activate first-party data
The Databricks Meta Conversions API app enables users to send conversion events from the Databricks Lakehouse directly to Meta Ads Manager. It provides a guided setup to connect Databricks to Meta using a pixel ID and access token, allowing for quick testing with sample data, deploying customizable notebooks, or setting up automated jobs for continuous data flow.
NewsDatabricks News: watermark-based incremental ingestion, MCP in AI gateway, Genie, Vector Search
Databricks now offers watermark-based incremental ingestion from SQL databases without change data feed, allowing for efficient data updates and soft deletion handling. The AI Gateway supports custom MCP servers, enabling integration with external APIs like GitHub for enhanced AI application development.
TutorialsApache Spark Streaming Real-Time Mode - Latency Demo
The video demonstrates how to deploy and run Apache Spark Streaming in Real-Time Mode (RTM) using a declarative automation bundle. It shows that RTM significantly reduces P50 and P95 latencies compared to microbatch mode, achieving 26ms and 50ms respectively in a simplified setup without an external messaging bus.
TutorialsAir Traffic Control with Apache Spark Structured Streaming Real-Time Mode
The video demonstrates building a real-time air traffic control application using Apache Spark Structured Streaming Real-Time Mode, Lakehouse, and Databricks Apps. This system processes live flight telemetry, detects congestion, and generates alerts with sub-second end-to-end latency, all within a single Databricks platform.
UnityCatalog 0.4.1
The Unity Catalog Spark connector now supports atomic REPLACE TABLE AS SELECT and Dynamic Partition Overwrite for managed Delta tables, and a new credential-scoped file system to prevent OOM errors in long-running sessions. This release also adds support for the VARIANT data type and fixes a critical security vulnerability (CVE-2026-27478) that allowed user impersonation, requiring new server configuration for existing deployments with authorization enabled.
Delta Lake 4.2.0
Databricks practitioners gain enhanced Unity Catalog support with new REPLACE TABLE/RTAS and Dynamic Partition Overwrite capabilities, alongside improved streaming reads for catalog-managed tables including `startingTimestamp` and `skipChangeCommits` options. This release also introduces general availability for Variant columns and support for Geospatial and Collations table features, while fixing several bugs related to data skipping, DML operations, and decimal predicates.
NewsDatabricks News: AUTO CDC, Workspace skills, Ask Genie, and Type widening
Databricks introduces Auto CDC for efficient change data feed processing, notebook and govern tags for better organization, and workspace skills for Ask Genie to customize its responses. Databricks also adds type widening for streaming tables, allowing data types to automatically adjust to larger incoming values.
Tutorials54 Zerobus Ingest Lakeflow Standard Connector | Ingest Streaming data directly into Delta Table
The video demonstrates how to use Databricks Zero Bus Ingest, a push-based API, to directly stream various data types like IoT, event, and telemetry data into Unity Catalog Delta tables. It highlights Zero Bus Ingest's ability to simplify streaming ingestion by eliminating the need for intermediate message buses and managing their infrastructure.
NewsDatabricks News: Excel add-in, Metrics Views UI, and Quality Monitoring
Databricks announced Lake Watch for cybersecurity, new dynamic dropdown filters in SQL editor, and improved quality monitoring with null value scanning and automated alerts. The video also demonstrates a new UI for defining metric views, an Excel add-in for data preview and import, and the ability to publish dashboards as public web pages.
ReleasesIntroducing Pantheon - Agentic Engineering At Scale
Pantheon is a Databricks application that uses a multi-agent system to generate Lake Flow pipelines for data engineering, allowing users to define data ingestion and transformation rules through a conversational interface. It automates the design, validation, and code generation for lakehouse pipelines, enabling citizen engineers to build robust data solutions without deep PySpark knowledge.
NewsDatabricks News: Free Tier, Multi-statement transactions, Declarative Automation Bundles, Genie Code
Databricks now offers a free tier for Lakeflow Connect, providing 100 DBUs per day per workspace, and has introduced multi-statement transactions in Unity Catalog that ensure atomicity with rollback capabilities. The platform also announced a Databricks One mobile app, a new AI runtime with pre-installed tools for GPU use cases, and enhanced Genie Code that understands project structure for automated development tasks. Additionally, Databricks Asset Bundles are now called Declarative Automation Bundles and use a faster direct engine, and a new 5X-Large SQL warehouse is available for processing terabytes of data.
Tutorials53 Lakeflow Connect SQL Server Managed Connector | Ingest Data using Databricks native connectors
The video demonstrates how to ingest data from SQL Server into Databricks using Lakeflow Connect's managed connector, covering the setup of a SQL Server database, user permissions, and enabling change tracking/change data capture (CT/CDC). It then walks through configuring the Databricks connection, creating gateway and ingestion pipelines, and showcasing how SCD Type 2 changes are automatically managed.
NewsDatabricks News: unit testing, OneLake federation, scoped access tokens
Databricks now allows creating Unity Catalog domains for business users, running JAR tasks on serverless compute, and federating OneLake data directly into Databricks. The platform also introduces in-workspace Python unit testing, new data connectors like HubSpot and TikTok Ads, and scoped personal access tokens for enhanced security.
NewsDatabricks News: Catalog and External locations in DABS, Schema Evolution, File Events, Queries Tags
Databricks Runtime 18.1 introduces schema evolution for inserts, managed file events for Autoloader, and a simplified `TABLE` syntax for querying. The video also demonstrates new features like the AI Gateway for LLM governance, query tags for tracking, and the GA release of the supervisor agent.
Delta Lake 4.1.0
Delta Lake 4.1.0 introduces enhanced support for Unity Catalog managed tables, including batch/streaming read/write and conflict-free feature enablement for Deletion Vectors and Column Mapping. It also requires Java 17 and Spark 4.0.1+, dropping support for Spark 3.5.
TutorialsDatabricks End-To-End Project | Zero-To-Expert | Streaming, AI, Lakeflow, Unity Catalog, AI/BI
This video demonstrates building an end-to-end restaurant analytics platform on Databricks, covering streaming and batch data ingestion, AI-powered sentiment analysis, and dashboard creation. It teaches how to use Unity Catalog, Lake Flow Connect for CDC, Spark declarative pipelines for real-time data from Event Hub, and how to construct a medallion architecture with fact and dimension tables.
NewsDatabricks Breaking News: 2026 Week 6: 2 February 2026 to 8 February 2026
Databricks introduces agentic data quality monitoring with anomaly detection, LLM judge UI builder for MLflow, and new SQL warehouse features including a default option and activity details. The platform also enhances its assistant to connect with MCP servers, improves Google Sheets integration with pivot table functionality, and adds direct Git deployment and tagging for Databricks apps.
NewsDatabricks Breaking News: 2026 Week 5: 26 January 2026 to 1 February 2026
Databricks now allows triggering materialized views or streaming tables on update, automatically detecting source changes and refreshing the pipeline. MLflow traces can now be stored in Unity Catalog using OpenTelemetry, providing a centralized logging system for experiment data.
NewsDatabricks Breaking News: 2026 Week 4: 19 January 2026 to 25 January 2026
Databricks introduces temporary tables that are Unity Catalog managed, materialized, and allow DML operations, automatically cleaning up after a session or seven days. Materialized views now support refresh policies like incremental strict, which verifies if a view can be incrementally refreshed before deployment.
NewsDatabricks Breaking News: 2026 Week 3: 12 January 2026 to 18 January 2026
Databricks Runtime 18 is now Generally Available, offering Spark 4.1 and improved identifier/parameter maker availability. New features include Lakeflow Connect for row filtering during ingestion, Codex models (GBT Codex Max and Mini) for code development, and Databricks One improvements like favorites and data preview in Gen Rooms.
Delta Lake 4.0.1
The "managed table" feature is renamed to `catalogManaged` (breaking change for `catalogOwned-preview` users) and Unity Catalog OAuth authentication is now supported. This release also fixes a `NoSuchMethodError` when running `REORG TABLE … APPLY (PURGE)` with Spark 4.0.1 and enables creating UC-managed Delta tables.
NewsDatabricks Breaking News: Week 2026 02: 5 January 2026 to 11 January 2026 #databricks news
Databricks now allows changing catalog and schema during dashboard deployments, addressing a previous issue with environment-specific configurations. The Databricks CLI has a breaking change with plan version 2, altering the structure of deployment plans.
This release adds support for multiple constraints at once, generates Symlink Manifests for external engines, and introduces GCS auto-registration. It also includes fixes for schema evolution in merge operations, improved error reporting, and enhanced handling of empty tables.
NewsDatabricks Breaking News: Week 2026 01: 29 December 2025 to 4 January 2026 #databricks news
Databricks now supports deploying asset bundles from a generated plan, enabling CI/CD integration for review and approval. Unity Catalog introduces new secret grants, and Runtime 18 brings "everywhere" implementations for literal string colling, parameter markers, and identifiers, along with window functions in metrics view and general availability for SQL scripting.
This release introduces several API changes and integrates `delta_kernel` for improved stats parsing performance. It also fixes issues with schema evolution during merge operations and null handling in scalar extraction.
ReleasesDatabricks Breaking News: Week 52: 22 December 2025 to 28 December 2025 #databricks news
Databricks introduces a direct mode for asset bundles, offering faster deployments without Terraform, and the Databricks Assistant agent mode is now in public preview, capable of multi-step notebook editing and data analysis. Other updates include single-use refresh tokens for enhanced security, partition columns now included in Parquet files for improved compatibility, and new dashboard features like custom labels, flexible sorting, and Microsoft Teams integration for scheduled reports.
NewsDatabricks Breaking News: Week 51: 15 December 2025 to 21 December 2025 #databricks news
Databricks introduces new Lakeflow Connect features, including custom logic for declarative pipelines and new connectors for incremental data import from sources like Confluence, PostgreSQL, and MySQL. The platform also announces the deprecation of legacy features like Hive Metastore and DBFS for new accounts, alongside updates to Lakehouse ACLs, job scheduling from notebooks, flexible node types for cluster deployment, and expanded resource assignment in Databricks apps.
NewsDatabricks Breaking News: Week 50: 8 December 2025 to 14 December 2025 #databricks news
Databricks now supports native reading and writing of Excel files in PySpark, SQL, and Autoloader, including features like sheet listing and range targeting. Additionally, Databricks Runtime 18 is available in beta, introducing improvements for streaming queries and new system columns for job tables, alongside a new Legase experience with project and branching capabilities for transactional databases.
Tutorials52 Lakeflow Spark Declarative Pipelines | New Pipeline Code Editor | AUTO CDC |External Target Sinks
Databricks' LakeFlow Spark Declarative Pipelines (SDP), formerly Delta Live Tables (DLT), offers a unified solution for data ingestion, transformation, and orchestration, now open-sourced with Apache Spark 4.1. The video demonstrates using the new pipeline code editor to build SDPs in Python and SQL, showcasing features like auto CDC (formerly apply changes) and external target sinks.
Tutorials34 Write PySpark Unit Test Cases using PyTest module | Setup PyTest with PySpark
The video demonstrates how to write PySpark unit test cases using the Pytest module. It covers setting up Pytest, creating fixtures for Spark sessions, and writing test functions to validate PySpark transformations and filters.
NewsWhy YouTube NOT Udemy? #dataengineering #easewithdata #pyspark #databricks
The creator explains they offer free data engineering content on YouTube because they struggled to find good, affordable learning resources when they were starting out. They aim to provide high-quality, demo-rich content for free to prevent others from facing similar difficulties with paid, low-quality courses.
Tutorials33 What is Spark Connect? | Spark Connect vs Spark Session | Setup Spark Connect Server with Cluster
Spark Connect decouples the client and server, allowing remote connection to Spark clusters using DataFrame APIs from various IDEs and languages, unlike Spark Session which tightly couples them and supports low-level RDD APIs. The video demonstrates setting up a Spark 3.5 cluster, starting a Spark Connect server, and running PySpark DataFrame operations remotely from VS Code.
CommunityApache Spark Was Hard Until I Learned These 30 Concepts!
The video explains 30 key Apache Spark concepts, starting with a comparison to MapReduce to highlight Spark's in-memory processing and DAG-based execution model. It then details Spark's cluster architecture, job execution flow (driver, executors, tasks), and memory management within executor containers.
Tutorials04_2 - Setup PySpark in Local Machine with Jupyter Lab | PySpark Local Machine Setup
The video demonstrates setting up PySpark with Jupyter Lab on a local machine using Docker, first as a standalone instance and then as a multi-node cluster. It walks through installing Docker Desktop, pulling a PySpark Jupyter Lab image from Docker Hub, configuring ports, and verifying the setup by running a basic PySpark job.
NewsDatabricks: What’s new in October 2025 #databricks news
Databricks introduces Databricks One, a new business-focused experience with consumer access for dashboards and Genie, alongside updates to Genie for defining relations and extended API endpoints. The platform also adds features like easy conversion of external to managed tables, enhanced Databricks Asset Bundles with policy integration and script execution, and new system tables for MLflow tracking and data classification results.
TutorialsDatabricks + Cursor IDE: Step-by-Step AI Coding Tutorial
The video demonstrates using Cursor IDE for AI-enhanced Databricks development, focusing on setting up Databricks Connect and leveraging Cursor rules and context for efficient code generation and testing. It shows how to structure projects, write Python and PySpark code, and create unit tests, highlighting the importance of providing clear instructions to the AI agent.
NewsDatabricks: What’s new in September 2025? #databricks
Databricks now supports geospatial data types (geography and geometry) with new functions for visualization and spatial operations, and introduces serverless GPU clusters for distributed GPU code execution. The platform also offers enhanced notebook features like side-by-side editing and a notebook-specific search, along with new options for managing serverless environments, SQL warehouses, and access requests in Unity Catalog.
TutorialsDelta Lake Masterclass | Azure Databricks | PySpark | From Zero-To-Expert
This video provides a comprehensive masterclass on Delta Lake using Azure Databricks and PySpark, covering its core concepts, internal workings, and practical applications. It demonstrates how Delta Lake solves data lake problems like lack of ACID support, DML operations, and schema enforcement, and teaches features like time travel, concurrency control, and optimization techniques.