Snowflake
Recent items mentioning Snowflake across the Databricks ecosystem — releases, news, videos, and community Q&A. Updated hourly.
Recent community discussions highlight ongoing performance comparisons between Snowflake and Databricks, with a new blog and video specifically benchmarking the two platforms 1. Additionally, a new company-wide write-ahead log for data, Matterbeam, was showcased on Hacker News 2. The competitive landscape also includes Microsoft Fabric, with community members debating if it's "good enough" compared to Databricks 3.
Generated daily from the 4 most recent items mentioning Snowflake. Click any [N] to jump to the source.
[BLOG + video] Snowflake and Databricks benchmarks
We put Snowflake and Databricks head-to-head across 5 scenarios. 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝘄𝗼𝗻 𝟰 𝗼𝘂𝘁 𝗼𝗳 𝟱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀: \- Sequential queries: 34% faster, 17% cheaper (at $2/credit) \- Concurrent queries: 38% faster, 39% cheaper \- Cold start: 54% faster (Databricks startup time: \~7 sec. Snowflake: sub-second. Every. Single. Time.) \- DML (delete + insert): 59% faster, 32% cheaper thanks to elite query pruning that treated 6B rows like 6M 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗰𝗹𝗮𝗶𝗺𝗲𝗱 𝘁𝗵𝗲 𝗼𝗻𝗲 𝘁𝗵𝗮𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗳𝗼𝗿 𝗱𝗮𝘁𝗮 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀: CTAS (Create Table As Select): 58% faster, 71% cheaper when writing billions of rows across multiple table shapes If your workload is heavy on dbt materializations, large table builds, or data pipeline writes, Databricks has a real edge here. If your workload is analysts running queries, dashboards, and incremental refreshes, Snowflake Standard looks compelling, even vs. Databricks Enterprise pricing. \- Read the full methodology and results: [https://select.dev/posts/snowflake-vs-databricks-showdown](https://select.dev/posts/snowflake-vs-databricks-showdown) \- Take a look at the repo: [https://github.com/get-select/snowflake-databricks-benchmark](https://github.com/get-select/snowflake-databricks-benchmark)
Show HN: Matterbeam, a company-wide write-ahead log for your data
Hey HN. I'm Michael, founder of Matterbeam. Been chewing on the core ideas of it for over ten years, building toward it for three. demo: https://www.youtube.com/watch?v=YuhujARUmhA whitepaper: https://matterbeam.com/whitepaper Short version: companies build their data infra on point-to-point pipelines and one place to put all the data. Source A goes to warehouse B. Team C wants the same data shaped differently? Build another. Eventually a mess of brittle ETL nobody wants to touch. Matterbeam puts existing ideas together in a different way. Source data collected as immutable, time-ordered facts into a log. Destinations replay and transform those facts, from any point in time, into the target they need. One source, many uses. My last startup was acquired by Pluralsight in 2014. I ended up leading product architecture and data there for about five years. Working with really brilliant, product and data people that I would have said were doing everything _right_. Yet no one in the company was happy with data. It made me question if something more fundamental wasn't broken. A key inspiration came from Martin Kleppmann's 2015 talk "Turning the Database Inside Out." (https://www.youtube.com/watch?v=fU9hR3kiOK0) Most databases internally do something interesting: a write-ahead log (durable, append-only, time-ordered) as a source of truth, and derived structures are created (B-trees, indexes, materialized views) optimized to serve different read patterns. What if you took that pattern and blew it up to org scale? Your uses become materializations. Warehouse, RAG vector db, graph db, any new use created when needed with a late transform and a new emitter. A few comparisons: We aren't Kafka. Kafka is lower-level. My first attempt at this was at Pluralsight using Kafka as the log. It was crazy expensive and complicated to operate. For Matterbeam we built cloud-native: object storage gives durability, ephemeral compute avoids coordination, we don't need 100ms latency for most jobs. Allowed us to avoid a lot of Kafka's complexity. We aren't Fivetran. Fivetran is a managed pipeline. We're a utility. One customer replaced Fivetran when they brought us in. Saved them money, but that wasn't the goal, suddenly projects they estimated at five months started taking two days. A two-year migration compressed into months. Their PMs started asking to use Matterbeam for everything. We aren't a warehouse or lake. Snowflake and Databricks are great at what they're great at. The push to centralize all data in these systems was a mistake. We aim to be the layer underneath. Basically fulfill the original promise of the data lake: collect without a use case, materialize when you figure out what you need, in the shape and system you need. What's broken: This doesn't fit cleanly into "what does this replace" buckets. Most people agree data is broken but then lament "data is hard" or some form of "my team isn't doing it right." Nobody's actively looking to solve the deeper problem. Hard to find new customers even with glowing testimonials. Connector coverage. Fivetran has hundreds. We have way fewer in production. We're working on it, we're using AI, you can write your own pretty quickly. Still, if your stack needs fifty SaaS integrations on day one, we struggle. We're early. Handful of paid customers. Not large-enterprise-ready no SOC2, HIPAA etc yet. Also, conscious decision not to be open source. Long list of reasons, separate post. I'd love feedback on: How would you position or market this? It feels like category creation, which I know is hard. Does the mental model land, or is there a piece where you go WAT? If you've built CDC-into-warehouse, Kafka-plus-schema-registry, or rolled a data backbone, what's the part you'd have wanted an easier solution for? Blog, testimonials, marketing video on the site. I'll be watching the thread. Be brutal, I can take it (I think).
Is Fabric just “good enough,” or does Databricks still win?
I’ve been at a few Microsoft centric events lately hearing this a lot lately: *“What’s the difference between Fabric, Databricks, and Snowflake and when would you choose each?”* I'm curious what tips the scale for you and what your one-line answer would be? * Team skillsets? * Data scale/complexity? * Cost control? * Governance? * AI/ML needs?
Show HN: Mljar Studio – local AI data analyst that saves analysis as notebooks
Hi HN, I’ve been working on mljar-supervised (open-source AutoML for tabular data) for a few years. Recently I built a desktop app around it called MLJAR Studio. The idea is simple: you talk to your data in natural language, the AI generates Python code, executes it locally, and the whole conversation becomes a reproducible notebook (*.ipynb file). So instead of just chatting with data, you end up with something you can inspect, modify, and rerun. What MLJAR Studio does: - Sets up a local Python environment automatically, runs on Mac, Windows, and Linux - Installs missing packages during the conversation - Built-in AutoML for tabular data (classification, regression, multiclass) - Works with standard Python libraries (pandas, matplotlib, etc.) - Works with any data file: CSV, Excel, Stata, Parquet ... - Connects to PostgreSQL, MySQL, SQL Server, Snowflake, Databricks, and Supabase. For AI: use Ollama locally (zero data egress), bring your own OpenAI key, or use MLJAR AI add-on. I built this because I wanted something between Jupyter Notebook (flexible but manual) and AI tools that generate code but don’t preserve the workflow. Most tools I tried either hide too much or don’t give reproducible results and are cloud based Demos: - 60-second demo: https://youtu.be/BjxpZYRiY4c - Full 3-minute analysis: https://youtu.be/1DHMMxaNJxI Pricing is $199 one-time, with a 7-day trial. Curious if this is useful for others doing real data work, or if I’m solving my own problem here. Happy to answer questions. --- top comments --- [MSaiRam10] Notebooks as the output format is funny because notebooks are famously bad for reproducibility. Out of order execution, hidden state, etc. You're solving "chat isn't reproducible" with a format that also isn't really [hasyimibhar] How does this compare to open source Deepnote[0]? We use the cloud version (BYOC) at my previous company to replace self-hosted Jupyter notebooks, and it's pretty great. [0] https://github.com/deepnote/deepnote [2ndorderthought] This is one of those product areas I would call high-risk without a human in the loop. So I am glad you kept a person in the loop. It's really easy to lose tons of money making decisions based on bad statistics or models. Anyone remember how much money zillow lost because of automatic time series models? I do have concerns about the workflow. Data people aren't usually the best programmers. Models hallucinate and make mistakes sometimes subtle sometimes not. Can you think of a way to prevent data scientists from having to be expert code reviewers? I feel like taking away the code gives them the chance to find and fix mistakes in their reasoning but I have no evidence for that. [amirathi] Really cool. If somebody doesn't want to adopt a new platform, take a look at open source Jupyter MCP Server[1]. Once integrated with Claude, it can execute code on the live notebook kernel. I just let Claude write notebooks, run top to bottom, debug & fix errors & only ping me when everything is working. [1] https://github.com/datalayer/jupyter-mcp-server [trymamboapp] "AI saves analysis as notebooks" is fighting the wrong fight ig. The reproducibility issue with notebooks isn't the format. it's out-of-order cell execution and silent kernel state llm generation makes that worse: the model has no memory of what state existed when it wrote cell 7, and neither does the user.
Show HN: Rocky – Rust SQL engine with branches, replay, column lineage
Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for. The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG. A few things I think are interesting: - Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse. - Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP. - Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe. - Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event. - Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint. - Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land. What Rocky isn't: - Not a warehouse — it's the control plane on top. - Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC. - Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration. Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0. I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread. --- top comments --- [Xiaoher-C] The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality. Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary. Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review. [ramon156] If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run? [mollerhoj] Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks inclu […truncated]
Migrate SSRS reports from Snowflake to databricks..!
We are being onboarded on project (SSRS reports from Snowflake to databricks) have never worked on similar thing. As we were mostly in support role. So, can you guys please guide us how to approach this project. And what thing needed to be take care of. And if anyone worked on similar thing can you guide us with the rough process so that we can get a broad idea and move further. Thanks!
Databricks vs. Snowflake Weekly
--- top comments --- [noashavit] It’s hard keeping up with leading players in a fast-paced market like the data. I built this artifact to scrape relevant sources (docs, site pages, press release, etc) to surface the most recent, trusted and relevant news I should know about (with clear priority for product updates) Use it to keep up with these or other leading players in an industry you pay close attention to.
NewsBayada’s Snowflake-to-Databricks Migration: Transforming Data for Speed & Efficiency
Delta Lake 3.3.0
Delta Lake 3.3.0 introduces Identity Columns, faster VACUUM LITE, and the ability to enable Row Tracking on existing tables for row-level lineage. It also allows enabling UniForm Iceberg on existing tables without data rewrites and supports reading tables with Type Widening enabled in Delta Kernel.


