Skip to content
brickster.ai
All topics
Data EngineeringSee on /pulse →

Data Quality

Recent items mentioning Data Quality across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

56 recent items4 news46 videos6 community threads
What's happening in Data QualityAI synthesis · updated 4h ago

Data quality is increasingly recognized as foundational for AI strategy, with a recent blog post declaring it "the AI strategy" itself, emphasizing fixes to transactional systems and unified data for value creation 3. Practitioners are actively seeking guidance on integrating data quality tests 6 and building data quality monitors on Databricks Apps 5. Furthermore, data quality and observability are highlighted as critical components in mastering data engineering system design interviews 1.

Generated daily from the 7 most recent items mentioning Data Quality. Click any [N] to jump to the source.

RedditGeneral

The next generation of Databricks Genie just launched. Here is what data engineers actually need to know.

I have been following Genie since it first launched with AI/BI last year. Back then, I honestly thought it was mostly for business users. A chatbot on top of your data that could answer basic questions in plain English. Useful, but not something I thought data engineers really needed to care much about. After seeing the new 2026 version, I completely changed my mind. Genie is no longer just a business chatbot. The biggest change is Genie Code, which is basically an AI agent designed for data professionals. It can generate pipelines, debug failures, create dashboards, monitor systems, and work directly with Lakeflow and Unity Catalog. That part caught my attention immediately because it moves beyond simple Q&A and starts touching actual engineering workflows. What surprised me most is how connected the whole system has become. It can pull context from dashboards, Genie Spaces, apps, metadata, documentation, and external systems like GitHub, Jira, and Confluence through MCP. Instead of only searching tables, it tries to understand relationships across the environment. That feels very different from the first version. The operational side is also interesting. Genie Code can monitor pipelines, investigate failures, help with DBR upgrades, and respond to issues before teams even notice them. The more I read about it, the more it felt less like a chatbot and more like an assistant sitting beside the engineering team. But honestly, the biggest takeaway for me is not the AI itself. It is what this means for data engineers. A lot of people immediately jump to “AI will replace data engineers,” but I think the opposite is happening. These systems are only as good as the data foundation underneath them. If metadata is incomplete, if tables are messy, if naming conventions are inconsistent, or if documentation is missing, the AI layer will give poor answers confidently. That means clean data modeling, governance, metadata, documentation, and data quality are becoming even more important than before. The engineers building those foundations become more valuable, not less. I think the role is slowly shifting away from spending hours writing repetitive boilerplate transformations and more toward building trustworthy, AI-ready data systems. One thing I keep noticing while learning Databricks through BricksNotes and the wider community is that the platform is moving very quickly toward AI-native data engineering. Features like Unity Catalog, Lakeflow, and now Genie all connect together. It feels like understanding metadata and governance is becoming just as important as understanding Spark itself. Also interesting that Genie now has a full mobile experience on iOS and Android. Business users can access dashboards, apps, and chat directly from their phones, which means the underlying data quality matters even more because people are going to depend on these systems everywhere, not only during work hours. Curious if anyone here is already using Genie or Genie Code in production. I would genuinely like to hear how the answer quality has been and whether your teams are changing how they approach metadata and documentation because of it.

5719InevitableClassic2611w ago
Databricks CommunityTechnical Blog

Three MCPs, One Answer: Building a Data Quality Monitor on Databricks Apps

001w ago
RedditHelp

Tips for integrating data quality tests?

I've been brought on as a data engineering consultant for a small to mid-sized company who has a poorly built architecture in Databricks. There's currently no documentation or clear architecture, so I've been spending weeks trying to untangle everything. They now want me to start implementing data quality checks because as of now there's no testing within the process at all and they're unsure if their outputs are even correct. Currently the data they want me to test are just raw files uploaded into Databricks tables on an irregular schedule, all with different granularity and logic that will require more complex checks than just null checks and unique primary keys. What is the best starting point for this? They have jobs and jobs that run jobs but no pipelines established, and I don't think I have the power to change that yet, so I think that takes DLT off the table unless I can prove it's worth the refactor. My first thought was integrating pyspark testing scripts to run within the jobs, but there has to be a more sophisticated way to do this?

611FiftyShadesOfBlack1w ago
RedditDiscussion

The learning order that actually works for Databricks. I wasted 3 months before figuring this out.

I want to share something that I wish someone told me when I started learning Databricks because it would have saved me months of confusion. When I first opened Databricks, I did what most people do. I went straight to PySpark because every tutorial said that is what data engineers use. I spent weeks trying to understand RDDs, DataFrames, transformations, actions, lazy evaluation, and the DAG all at once. I could follow along with the instructor but the moment I opened a blank notebook I had no idea where to start. Then I took a step back and tried something different. I started with SQL. Databricks runs SQL natively. I already knew SQL from a previous job. Within an hour I was querying tables, running aggregations, building views. I felt productive for the first time in weeks. That confidence changed everything. Here is the order that worked for me and I genuinely believe it works for most people. Start with SQL on existing tables. Databricks has sample datasets built in. Run SELECT statements. Do GROUP BY. Write JOINs. Get comfortable navigating data. If you already know SQL from any database this stage takes a few days not weeks. Then learn Delta Lake through SQL. Create tables. Insert data. Update rows. Delete rows. Run DESCRIBE HISTORY and see the transaction log. Run SELECT VERSION AS OF and experience time travel. This is where Databricks starts to feel different from other databases. Every table you create is automatically a Delta table so you get versioning, schema enforcement, and ACID transactions without configuring anything. Then move to PySpark DataFrames. Now that you understand what the data looks like and how Delta tables work, PySpark makes way more sense. You understand what df.filter does because you already did WHERE in SQL. You understand what df.groupBy does because you already did GROUP BY. Lazy evaluation clicks faster because you have context for what the transformations are actually doing. Then build pipelines. Take what you learned and chain it together. Read from a source. Transform. Write to a Delta table. Schedule it. Monitor it. This is where Lakeflow (the new name for Delta Live Tables) comes in. But it makes no sense if you skip the previous steps. Then governance. Unity Catalog, permissions, data quality expectations. This feels like admin work when you learn it in isolation but once you have built a pipeline you understand exactly why it matters. The mistake I made was trying to learn PySpark before I understood the data model. I was writing code without knowing what it produced. Once I started with SQL and built up from there everything fell into place faster. One more thing. If you are on Free Edition you do not need to configure clusters. It is serverless. If a tutorial tells you to create a cluster and choose a runtime version that tutorial was written for Community Edition which no longer exists. Just open a notebook and start writing code. Hope this helps someone who is feeling overwhelmed right now. Happy to answer any questions in the comments.

8518InevitableClassic2612w ago
RedditHelp

cant apend results of streaming group by agregations

Hi, I'm relatively new to Databricks. I have a medallion architecture with the following components: \- cor\_project (catalog) \- bronze (schema) \- raw\_swell\_metrics (table) \- data (volume) \- landing \- checkpoints \- raw\_swell\_metrics \- silver (schema) \- swell\_metrics (table) \- quarantine\_swell\_metrics (table) \- data (volume) \- checkpoints \- swell\_metrics \- quarantine\_swell\_metrics \- gold (schema) \- wave\_daily\_summary (table) \- data (volume) \- checkpoints \- wave\_daily\_summary The flow is as follows: Add file(s) to bronze.data.landing -> manually execute job -> read only new file(s) and add them to bronze.raw\_swell\_metrics -> read only new rows in bronze.raw\_swell\_metrics (transform and data quality) and add them to swell\_metrics or quarantine\_swell\_metrics -> read only new rows in silver.swell\_metrics (transform) and add them to gold.wave\_daily\_summary. The data is uploaded every month with a new file. The data is flowing correctly from landing to silver.swell\_metrics It fails when I'm transforming it to gold. Code: df_silver_swell_metrics = (     spark.readStream         .format("delta")         .table(f"cor_project.silver.swell_metrics") ) df_silver_swell_metrics_transformed = (     df_silver_swell_metrics     .groupBy(         F.date_trunc("day", "datetime").alias("day"),         "coast_name"     ).agg(         F.max("wave_height_m").alias("max_wave_height_m"),         F.expr("max_by(wave_period_s, wave_height_m)").alias("max_wave_period_s"),         F.expr("max_by(wave_direction_deg, wave_height_m)").alias("max_wave_direction_deg"),         F.expr("max_by(wind_speed_ms, wave_height_m)").alias("max_wave_wind_speed_ms"),         F.expr("max_by(wind_direction_deg, wave_height_m)").alias("max_wave_wind_direction_deg"),         F.min("wave_height_m").alias("min_wave_height_m"),         F.expr("min_by(wave_period_s, wave_height_m)").alias("min_wave_period_s"),         F.expr("min_by(wave_direction_deg, wave_height_m)").alias("min_wave_direction_deg"),         F.expr("min_by(wind_speed_ms, wave_height_m)").alias("min_wave_wind_speed_ms"),         F.expr("min_by(wind_direction_deg, wave_height_m)").alias("min_wave_wind_direction_deg"),         F.avg("wave_height_m").alias("avg_wave_height_m")     ) ) df_gold_wave_daily_summary = (     df_silver_swell_metrics_transformed     .select(         F.col("day").alias("date"),         F.col("coast_name"),         F.col("max_wave_height_m").cast("float"),         F.col("max_wave_period_s").cast("float"),         F.col("max_wave_direction_deg").cast("float"),         F.col("max_wave_wind_speed_ms").cast("float"),         F.col("max_wave_wind_direction_deg").cast("float"),         F.col("min_wave_height_m").cast("float"),         F.col("min_wave_period_s").cast("float"),         F.col("min_wave_direction_deg").cast("float"),         F.col("min_wave_wind_speed_ms").cast("float"),         F.col("min_wave_wind_direction_deg").cast("float"),         F.col("avg_wave_height_m").cast("float")     ) ) (     df_gold_wave_daily_summary.writeStream         .format("delta")         .trigger(availableNow=True)         .option("checkpointLocation", f"/Volumes/cor_{ambiente}/gold/data/checkpoints/wave_daily_summary")         .toTable(f"cor_project.gold.wave_daily_summary") ) This generates the following error: [STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION] Invalid streaming output mode: append. This output mode is not supported for streaming aggregations without watermark on streaming DataFrames/DataSets. SQLSTATE: 42KDE I have tried including F.watermark it works, but doesn't load the records for the last day. Any idea how to solve it? Thanks for any advice

34guillermo_hre2w ago
RedditHelp

How do you reframe data engineering for a CEO who thinks it's "data quality oversight"?

10golly10-3w ago