Skip to content
brickster.ai
All topics

Apache Spark

Recent items mentioning Apache Spark across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

60 recent items6 releases1 news44 videos9 community threads
What's happening in Apache SparkAI synthesis · updated 4h ago

Unity Catalog Open APIs now offer expanded interoperability, allowing external engines like Apache Spark, Flink, and DuckDB to create, read, and write to UC managed Delta tables 45. Databricks also introduced Lakeflow Designer for visual data preparation, with a workaround using Genie to convert visual workflows into clean PySpark/SQL notebooks 3. For data engineering system design, a six-step framework for mastering interviews covers pipeline design, data modeling, and storage, including file formats 1.

Generated daily from the 10 most recent items mentioning Apache Spark. Click any [N] to jump to the source.

RedditHelp

Databricks Delivery Solutions Architect Interview Experience Needed (Hiring Manager Round)

Hi everyone, I recently completed the screening round for a Senior Solutions Architect role at Databricks, and after reviewing my profile and discussion, the team mentioned that my experience aligns more closely with the Delivery Solutions Architect role. I’ve now been scheduled for the Hiring Manager round next week. The recruiter shared that the interview will focus on areas like: * AI/Data & AI architecture discussions * Customer advisory / pre-sales scenarios * Driving adoption and execution * Stakeholder management * Bias for action and ownership * Team fit and real-world project experience I have around 7.9 years of experience in Big Data Engineering, PySpark, SQL, Python, Databricks, and AI/ML-related work. Has anyone recently interviewed for the Delivery Solutions Architect role at Databricks? Would really appreciate any insights on: * What kind of questions were asked * Depth of technical discussion expected * Whether coding/system design is involved * How much focus is on customer-facing/pre-sales scenarios * Tips to prepare for the Hiring Manager round Thanks in advance!

28Immediate_Music_7419yesterday
Reddit

Expanded interoperability with Unity Catalog Open APIs. External engines like Apache Spark, Flink, and DuckDB can now create, read, and write to UC managed Delta tables.

293Youssef_Mrini4d ago
RedditHelp

Day to Day life of DATABRICKS ENGINEER

I need help from you guys that which coding language should I work more on for being a 4 year experienced data engineer is it Pyspark, python or SQL? IF possible can anyone please help me with there daily tasks what they do regularly as databricks engineer?

27BeingExpert801w ago
RedditHelp

Tips for integrating data quality tests?

I've been brought on as a data engineering consultant for a small to mid-sized company who has a poorly built architecture in Databricks. There's currently no documentation or clear architecture, so I've been spending weeks trying to untangle everything. They now want me to start implementing data quality checks because as of now there's no testing within the process at all and they're unsure if their outputs are even correct. Currently the data they want me to test are just raw files uploaded into Databricks tables on an irregular schedule, all with different granularity and logic that will require more complex checks than just null checks and unique primary keys. What is the best starting point for this? They have jobs and jobs that run jobs but no pipelines established, and I don't think I have the power to change that yet, so I think that takes DLT off the table unless I can prove it's worth the refactor. My first thought was integrating pyspark testing scripts to run within the jobs, but there has to be a more sophisticated way to do this?

611FiftyShadesOfBlack1w ago
Stack Overflow

How to group connected materials and aggregate stocks in PySpark/SQL on Databricks

How can I solve my current issue I'll give you some sample raw data and expected data below. The process can use SQL and Python, currently using a Databricks notebook. As you can see in the data, I don't have a direct link between all of the materials. I have A1, B3 — any material is fine as long as it's within the same group. But in the group column, if I already have group A1 even though I add another material within that group, it still needs to be A1. I want to retain it. So my solution is to save it into ADLS and for the STOCKS column I want to aggregate the stocks also within the same group. Sample: https://i.sstatic.net/b2Q05lUr.png Any help would be greatly appreciated. Thanks!

pythonpysparkdatabricksdata-engineering
00Eleazzz2w ago
RedditDiscussion

The learning order that actually works for Databricks. I wasted 3 months before figuring this out.

I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.

8518InevitableClassic2612w ago
RedditGeneral

Marimo on Databricks

My workflow for a long time involved me switching back/forth between vscode and browser/databricks ui. I like to write my "production code" in normal python, but notebooks are great for exploration, spikes, visualization, triage etc. I could write a small dissertation but for various reasons I don't really like jupyter, and databricks notebooks have their own problems with commented magic commands etc. This led me to check out [marimo](https://marimo.io/), and wow, these are so cool. Code that runs in normal python, merges cleanly, has visualizations, widgets, the the app runs locally and doesn't glitch out, and even the vscode extension works nicely. The problem was, the databricks support wasn't great. It just felt a bit dated. It required a warehouse for sql, doesn't seem to really support serverless, and there were just so many oppurtunities to plug databricks into Marimo. This led me to create [marimo-databricks-connect](https://github.com/brookpatten/marimo-databricks-connect) [pypi](https://pypi.org/project/marimo-databricks-connect/) I tried to plug in "all the things" databricks into the place where they go in Marimo. I'm pretty happy with the result. - Connect to databricks using databricks-connect & spark (not sql warehouse) - Authenticate/configure spark using the default databricks-connect process (env vars, .databrickscfg etc), no additional auth config. - Execution of both python & sql cells - Autocomplete Catalog/Schema/Table/Column Names - Browsing of catalogs/schemas/tables/columns in the marimo data sources view - Browsing of external locations, volumes, dbfs, workspace in the marimo storage browser Notebook widgets to monitor and control of specific instances of databricks capabilities (clusters, workflows, vector search, apps etc) - Widgets to browse & explore databricks capabilities (compute, workflows, unity catalog) - Works in local marimo marimo edit notebook.py, in the vscode extension - Deploy as a databricks app to provide an alternative web based marimo UI. I'm working on adding serving endpoints as AI providers to the notebooks too. In particular what I like to use this for is creating "command center" notebooks for given processes that can include some normal pyspark/sql code to query/triage, widgets to monitor/control various databricks resources, visualizations to monitor dq etc. I just wanted to share and see what the community thinks, would you use it? contributions are welcome. throwaway account because i'm doxing myself via gh repo.

2017yes_my_name_is_brook2w ago
RedditHelp

Help reading data

I am working on a Python data project for which I need to read data from parquet files stored in a volume as well as delta tables. Downstream I need the data in pandas DataFrame. To read the parquet I have used pd.read\_parquet(), this however is really slow compare to when I read the file from my machine. With the delta table, it is quick when read as pyspark DataFrame, but the toPandas() operation is also slow. I realise I am probably doing it naively, I wondered if someone had some advice. Edit: Some additional info. The table and parquet are about 7GB. The .toPandas() operation doesn't complete after an hour and read_parquet takes about 20mins.

49Soudain_Seul2w ago
Stack Overflowanswered

Splitting string into respective columns

I am trying to read the log files and split the data into multiple columns using Databricks and Python. So far I have been able to split the string into the Array format, but I not able to put them into columns. Like: type, date, time, comment content INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3 from pyspark.sql.functions import split df_split = ( df.select(split('content', r' ', 0).alias('row')) ) df_split.select('row').display() row [INFO, 2025-10-10, 08:01:23, Starting, Spark, application] [WARN, 2025-10-10, 08:02:01, Memory, usage, is, high:, 75%] [ERROR, 2025-10-10, 08:03:05, Task, failed, for, partition, 3] I need this to be displayed into the following format: type date time comments INFO 2025-10-10 08:01:23 Starting Spark application WARN 2025-10-10 08:02:01 Memory usage is high: 75% ERROR 2025-10-10 08:03:05 Task failed for partition 3

stringpysparktextdatabricksmultiple-columns
02Mr.Singh2w ago