Skip to content
All topics
Data EngineeringSee on /pulse →

Data Quality

Recent items mentioning Data Quality across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.

49 recent items8 news40 videos1 community thread
What's happening in Data QualityAI synthesis · updated 9d ago

Recent Databricks blogs highlight data quality as a critical factor for successful AI initiatives, emphasizing that poor data quality is a common pitfall to avoid when tying AI investments to business outcomes 1. A modern data governance architecture, combining automated lineage and RBAC, is key to ensuring data quality and regulatory compliance at scale 3. These strategies underscore the importance of clean, governed data and strategic platform consolidation for driving company growth with AI 1.

Generated daily from the 5 most recent items mentioning Data Quality. Click any [N] to jump to the source.

RedditHelp

Pipelines - how are you handling significant schema changes?

Hello - Right now with pipelines you can set things up so you can gracefully handle column additions and safe type conversions (more general type to more specific type). For things like column removals or downcasting, even the most liberal schema evolution setting will throw a failure and require the user manually triage by doing a full reload. I get why this is being done and I agree with the overall philosophy. We should lean on planning and communication because signifiant schema changes mean there is likely an impact downstream and/or to do data quality/meaning. But...... that means we have breaking changes that need to be triaged. In a scenario where we have separate Ops teams potentially operating in a different country, that means we would need to do "staged" failure and reprocessing of the various pipelines. EX: major change in pipeline A means it breaks and requires a full refresh. But A feeds B which feeds C which feeds D. In some situations, that means staged triaging of those expected failures (fix A, then fix B, then fix C, then fix D) - which means labor and perceived extended downtime. - How is everyone managing those kinds of situations? - Is there any possibility of ever setting the pipeline objects to allow for any form of schema change - falling back to a full load as a "last resort" so things auto-heal? (I get why this is not a great option, but at some point this feels like what we'd do manually anyway) Thank you!

21lofat1mo ago