Lineage
Recent items mentioning Lineage across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Recent discussions highlight the importance of lineage in diverse data contexts, from a new Rust SQL engine offering column lineage 3 to the need for pre-built governance and monitoring infrastructure, including lineage, for successful AI application deployments 2. Even within Databricks' own Lakeflow Spark Declarative Pipelines, users are seeking workarounds for functionalities like pivot() that could benefit from clearer lineage tracking 1.
Generated daily from the 3 most recent items mentioning Lineage. Click any [N] to jump to the source.
pivot() workarounds in Lakeflow Spark Declarative Pipelines
Problem: In Lakeflow Spark Declarative Pipelines, the `pivot()` function is not supported. The `pivot` operation in Spark requires the eager loading of input data to compute the output schema. This capability is not supported in pipelines. Source: [https://docs.databricks.com/aws/en/ldp/limitations](https://docs.databricks.com/aws/en/ldp/limitations) # How can this be mitigated? **Workaround 1: Rewrite PIVOT Using CASE WHEN** This is the most common workaround. You manually expand the pivot into conditional aggregations. >Original Query: SELECT * FROM sales_data PIVOT ( SUM(sales) FOR region IN ('North', 'South', 'East', 'West') ) >Rewritten without PIVOT: SELECT product, SUM(CASE WHEN region = 'North' THEN sales ELSE 0 END) AS North, SUM(CASE WHEN region = 'South' THEN sales ELSE 0 END) AS South, SUM(CASE WHEN region = 'East' THEN sales ELSE 0 END) AS East, SUM(CASE WHEN region = 'West' THEN sales ELSE 0 END) AS West FROM sales_data GROUP BY product This works perfectly in Lakeflow Pipelines because the output schema is fully deterministic at parse time, no eager data loading required. **Workaround 2: Rewrite PIVOT Using aggregate FILTER** Databricks SQL supports the `FILTER(WHERE ...)` clause on aggregates, which is a cleaner alternative to CASE WHEN: >Original PIVOT query: SELECT year, region, q1, q2, q3, q4 FROM sales PIVOT ( SUM(sales) AS sales FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten with FILTER: SELECT year, region, SUM(sales) FILTER(WHERE quarter = 1) AS q1, SUM(sales) FILTER(WHERE quarter = 2) AS q2, SUM(sales) FILTER(WHERE quarter = 3) AS q3, SUM(sales) FILTER(WHERE quarter = 4) AS q4 FROM sales GROUP BY year, region This syntax is often more readable than nested CASE WHEN, especially with multiple aggregations. **Multi-Column PIVOT Rewrite** >For pivoting on multiple columns simultaneously: SELECT * FROM sales PIVOT ( SUM(sales) AS sales FOR (quarter, region) IN ((1, 'east') AS q1_east, (1, 'west') AS q1_west, (2, 'east') AS q2_east, (2, 'west') AS q2_west) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'east') AS q1_east, SUM(sales) FILTER(WHERE quarter = 1 AND region = 'west') AS q1_west, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'east') AS q2_east, SUM(sales) FILTER(WHERE quarter = 2 AND region = 'west') AS q2_west FROM sales GROUP BY year **Multiple Aggregations** You can also rewrite PIVOTs that use multiple aggregate functions. >Original Query SELECT * FROM (SELECT year, quarter, sales FROM sales) AS s PIVOT ( SUM(sales) AS total, AVG(sales) AS avg FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4) ) >Rewritten: SELECT year, SUM(sales) FILTER(WHERE quarter = 1) AS q1_total, AVG(sales) FILTER(WHERE quarter = 1) AS q1_avg, SUM(sales) FILTER(WHERE quarter = 2) AS q2_total, AVG(sales) FILTER(WHERE quarter = 2) AS q2_avg, SUM(sales) FILTER(WHERE quarter = 3) AS q3_total, AVG(sales) FILTER(WHERE quarter = 3) AS q3_avg, SUM(sales) FILTER(WHERE quarter = 4) AS q4_total, AVG(sales) FILTER(WHERE quarter = 4) AS q4_avg FROM sales GROUP BY year **Summary** Both approaches produce identical results and work fully within SDP pipelines with complete lineage tracking.
AI Applications: Tools, Use Cases, and Platforms
AI applications span four capability tiers, each with distinct data requirements and evaluation frameworks, and enterprise deployments often stall due to inadequate data infrastructure. Production-grade model development, from prompt engineering to pretraining, is increasingly accessible with open-source LLMs, but requires pre-built governance and monitoring infrastructure for successful deployment at scale.
Show HN: Rocky – Rust SQL engine with branches, replay, column lineage
Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for. The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG. A few things I think are interesting: - Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse. - Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP. - Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe. - Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event. - Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint. - Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land. What Rocky isn't: - Not a warehouse — it's the control plane on top. - Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC. - Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration. Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0. I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread. --- top comments --- [Xiaoher-C] The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality. Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary. Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review. [ramon156] If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run? [mollerhoj] Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks inclu […truncated]
How AI improves data lineage at scale
Discover how AI accelerates data lineage with automated docs, testing, and scalable governance.
TutorialsHow to use Recursive CTEs in Databricks
The video demonstrates how to use recursive CTEs in Databricks to traverse hierarchical data structures of unknown depth, such as data lineage or organizational charts. It shows how to write a recursive CTE in SQL, highlighting the `RECURSIVE` keyword and the union of an anchor member and a recursive member.
NewsFrom Raw Data to Real-Time Retention: Powering Customer Health Scores on Databricks
Tutorials











