Data EngineeringSee on /pulse →

Data Quality

Recent items mentioning Data Quality across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

49 recent items8 news40 videos1 community thread

What's happening in Data QualityAI synthesis · updated 9d ago

Recent Databricks blogs highlight data quality as a critical factor for successful AI initiatives, emphasizing that poor data quality is a common pitfall to avoid when tying AI investments to business outcomes 1. A modern data governance architecture, combining automated lineage and RBAC, is key to ensuring data quality and regulatory compliance at scale 3. These strategies underscore the importance of clean, governed data and strategic platform consolidation for driving company growth with AI 1.

Generated daily from the 5 most recent items mentioning Data Quality. Click any [N] to jump to the source.

Data + AI Foundations

Enterprise Data Strategy Roadmap for Business Outcomes

* A robust enterprise data strategy connects organizational data assets to specific business objectives through governance, architecture, and analytics frameworks that scale with evolving business needs. * Effective data governance, data quality management, and master data manage

Databricks Staff3w ago

Data + AI Foundations

Data Governance Architecture: A Complete Blueprint for Modern Organizations

This blueprint details a complete data governance architecture, outlining the policies, roles, and technologies needed to manage data assets. It emphasizes a modern strategy combining automated lineage, RBAC, and federated models to ensure data quality and regulatory compliance at scale.

Databricks Staff1mo ago

RedditHelp

Pipelines - how are you handling significant schema changes?

Hello - Right now with pipelines you can set things up so you can gracefully handle column additions and safe type conversions (more general type to more specific type). For things like column removals or downcasting, even the most liberal schema evolution setting will throw a failure and require the user manually triage by doing a full reload. I get why this is being done and I agree with the overall philosophy. We should lean on planning and communication because signifiant schema changes mean there is likely an impact downstream and/or to do data quality/meaning. But...... that means we have breaking changes that need to be triaged. In a scenario where we have separate Ops teams potentially operating in a different country, that means we would need to do "staged" failure and reprocessing of the various pipelines. EX: major change in pipeline A means it breaks and requires a full refresh. But A feeds B which feeds C which feeds D. In some situations, that means staged triaging of those expected failures (fix A, then fix B, then fix C, then fix D) - which means labor and perceived extended downtime. - How is everyone managing those kinds of situations? - Is there any possibility of ever setting the pipeline objects to allow for any form of schema change - falling back to a full load as a "last resort" so things auto-heal? (I get why this is not a great option, but at some point this feels like what we'd do manually anyway) Thank you!

21lofat1mo ago

Platform

Governing AI agents at scale with Unity Catalog

Unity Catalog now governs AI agents at scale, providing a unified layer for identity-aware access, runtime policies, and full auditability across all agent interactions. This extends data governance to AI systems, improving observability, compliance, and trust for models, servers, and data within the lakehouse.

David Nasi1mo ago

Community

How I Mastered System Design Interviews

This video teaches a six-step framework for mastering data engineering system design interviews, covering requirements gathering, pipeline design, data modeling, storage and file formats, data quality and observability, and pipeline resilience. It demonstrates how to apply this framework with practical examples and back-of-the-envelope calculations to justify design choices.

Data Quality

Top 10 AI Business Solutions Driving Company Growth

Enterprise Data Strategy Roadmap for Business Outcomes

Data Governance Architecture: A Complete Blueprint for Modern Organizations

Pipelines - how are you handling significant schema changes?

Governing AI agents at scale with Unity Catalog

How I Mastered System Design Interviews

Data + AI Executive Series: Fast 5 — Scaling Real-Time Ops with Databricks at Aer Lingus

Data quality is the AI strategy

Effective strategies to enhance data quality management

How data transformation improves data quality and analysis

Effective strategies to improve data quality across your organization

OpenClaw, Databricks Agentic Data Monitoring & more! | AI Newsround - February 2026 | Advancing AI

Databricks Breaking News: 2026 Week 6: 2 February 2026 to 8 February 2026

Lakeflow Connect: The Game-Changer for Complex Event-Driven Architectures

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

The Upcoming Apache Spark™ 4.1: The Next Chapter in Unified Analytics

Lakeflow in Production: CI/CD, Testing and Monitoring at Scale

Lakeflow Observability: From UI Monitoring to Deep Analytics

Automating Engineering with AI - LLMs in Metadata Driven Frameworks

Lakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader

125. Databricks | Pyspark| Delta Live Table: Data Quality Check - Expect

Cross-Platform Data Lineage with OpenLineage

Sponsored: Lightup Data | How McDonald's Leveraged Lightup Data Quality

Labcorp Data Platform Journey: From Selection to Go-Live in Six Months

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

De-Risking Language Models for Faster Adoption

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

Sponsored by: Fivetran | Fivetran and Catalyst Enable Businesses & Solve Critical Market Challenges

Sponsored by: Anomalo | Scaling Data Quality with Unsupervised Machine Learning Methods

Sponsored: Accenture | Databricks Enables Employee Data Domain to Align People w/ Business Outcomes

Taking Control of Streaming Healthcare Data

Powering Up the Business with a Lakehouse

Open Source Powers the Modern Data Stack

Mapping Data Quality Concerns to Data Lake Zones

Connecting the Dots with DataHub: Lakehouse and Beyond

How unsupervised machine learning can scale data quality monitoring in Databricks

OvalEdge: End-To-End Data Governance

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Agile Data Engineering: Reliability and Continuous Delivery at Scale

Building Production-Ready Recommender Systems with Feature Stores

Cleanlab: AI to Find and Fix Errors in ML Datasets

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Discover Data Lakehouse With End-to-End Lineage

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Computational Data Governance at Scale