Skip to content
All topics

Apache Spark

Recent items mentioning Apache Spark across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

60 recent items9 releases4 news38 videos9 community threads
What's happening in Apache SparkAI synthesis · updated 21h ago

Recent activity around Apache Spark highlights its continued integration and optimization within the Databricks ecosystem. Databricks now offers a decision framework for ETL migration to Databricks, leveraging Spark Declarative Pipelines and notebooks 4, while Unity Catalog extends fine-grained access controls to external engines like Apache Spark 7. Additionally, Spatial SQL is now Generally Available on Databricks, bringing native geospatial data types and Apache Spark 4.2 compatibility for geo columns 9.

Generated daily from the 10 most recent items mentioning Apache Spark. Click any [N] to jump to the source.

Databricks CommunityData Engineering

StatusCode.UNIMPLEMENTED error: DatabricksConnect library using AKS/PySpark to calling Spark cluster

00yesterday
Databricks CommunityData Engineeringanswered

How does Databricks handle registration and discovery of custom PySpark data sources in SDPs?

003d ago
Databricks CommunityData Engineeringanswered

PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

001w ago
Databricks CommunityCommunity Articles

DataFlint on Databricks - the Open Source Spark UI Upgrade Apache Spark Has Needed for Years

001w ago
Databricks CommunityTechnical Blog

Apache Spark’s Real-Time Mode Use Case Deep Dive: Gaming Sessionization

004w ago
Databricks CommunityData Engineering

Apache Spark Masterclass (In-Person, Bengaluru) | 6 June

001mo ago
Databricks CommunityCommunity Articles

Converting stored procedures to PySpark

001mo ago
Stack Overflow

Handle case issue in column names

I am loading data using pyspark with spark_reader.load(data_path) However, in some cases data can be very messy, with fields using different case for each rows (can be in nested structs). Here is an example of data : [ { "field_1": "1", "Field_2": 1, "field_3": "b", "field_4": [{"A": 1, "b": 2}, {"A": 3, "b": 4}], }, { "Field_1": "2", "Field_2": 2, "Field_3": "BB", "Field_4": [{"a": 1, "B": 2}, {"a": 3, "B": 4}], }, ] In this case, the load fails with following error : pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: `field_1`, `field_3`, `field_4` And I can't find a clean way to handle this case. I tried the following workaround : raw = spark.read.text(data_path) normalized_rdd = raw.rdd.mapPartitions(_normalize_partition) raw_df = spark.read.json(normalized_rdd) With a python function _normalize_partition that normalizes the column names. However it does not work in my case as I use a Databricks serverless compute and the use of .rdd is not allowed. [NOT_IMPLEMENTED] Using custom code using PySpark RDDs is not allowed on serverless compute.

apache-sparkpysparkdatabricks
00Nakeuh1mo ago
RedditTutorial

Building a Spark Streaming Real-Time Mode (RTM) Pipeline — Millisecond Streaming with Kafka

I recently built a fully working real-time transaction enrichment pipeline using PySpark RTM paired with Kafka, achieving end-to-end latency in the milliseconds. The article covers: \- Real-Time Mode (RTM) fundamentals \- Kafka integration with Spark Structured Streaming \- Millisecond-latency pipeline architecture \- Real-time transaction enrichment patterns Blog: https://blog.devgenius.io/building-a-spark-streaming-real-time-mode-rtm-pipeline-millisecond-streaming-with-kafka-dda74e9ef284

72databuff_161mo ago