Description
Spark Performance Tuning Welcome back to another engaging apache spark tutorial! In this apache spark performance optimization hands on tutorial, we dive deep into the techniques to fix data skew, focusing on Adaptive Query Execution (AQE) and broadcast join. AQE, a feature introduced in Spark 3.0, uses runtime statistics to select the most efficient query plan, optimizing shuffle partitions, joins, and skewed joins. We will discuss how Spark coalesces partitions, converts sort merge joins into broadcast joins, and splits larger partitions into smaller ones to optimize skewed joins. We will walk through the Spark documentation to understand the properties that need to be set to true for Spark to dynamically handle skew in a sort mode join. Then, we will look at an example joining two datasets, transaction and customer, to analyze how the join will look with and without AQE. By the end of this video, you will have a solid understanding of AQE, how to optimize skewed joins, and how to set up a Spark session to handle data skews. Key Takeaways: Understanding Adaptive Query Execution (AQE) and its benefits. How to optimize shuffle partitions and joins using AQE. Setting up a Spark …
Description from YouTube. Full content on the video page.
More from Afaque Ahmad
CommunityHow I Mastered System Design Interviews
This video teaches a six-step framework for mastering data engineering system design interviews, covering requirements gathering, pipeline design, data modeling, storage and file formats, data quality and observability, and pipeline resilience. It demonstrates how to apply this framework with practical examples and back-of-the-envelope calculations to justify design choices.
TutorialsDatabricks End-To-End Project | Zero-To-Expert | Streaming, AI, Lakeflow, Unity Catalog, AI/BI
This video demonstrates building an end-to-end restaurant analytics platform on Databricks, covering streaming and batch data ingestion, AI-powered sentiment analysis, and dashboard creation. It teaches how to use Unity Catalog, Lake Flow Connect for CDC, Spark declarative pipelines for real-time data from Event Hub, and how to construct a medallion architecture with fact and dimension tables.
CommunityHow Much DSA Do You Need To Crack Data Engineering Interviews?
Data engineers need to understand DSA concepts at an easy to medium level, focusing on practical applications like Big O intuition, arrays, hashmaps, and basic trees/graphs, rather than advanced algorithms. The video provides a practical DSA roadmap, differentiating between "must-knows," "good-to-knows" for stronger product/infra roles, and "overkill" topics for most classic data engineering interviews.
CommunityWill AI REPLACE Data Engineers?
AI will not replace data engineers, but it will shift their role from typing code to designing solutions, guiding AI tools, and verifying outputs. Data engineers should focus on core coding fundamentals, system and product thinking, and effectively using AI and other tools.
CommunityApache Spark Was Hard Until I Learned These 30 Concepts!
The video explains 30 key Apache Spark concepts, starting with a comparison to MapReduce to highlight Spark's in-memory processing and DAG-based execution model. It then details Spark's cluster architecture, job execution flow (driver, executors, tasks), and memory management within executor containers.
TutorialsDelta Lake Masterclass | Azure Databricks | PySpark | From Zero-To-Expert
This video provides a comprehensive masterclass on Delta Lake using Azure Databricks and PySpark, covering its core concepts, internal workings, and practical applications. It demonstrates how Delta Lake solves data lake problems like lack of ACID support, DML operations, and schema enforcement, and teaches features like time travel, concurrency control, and optimization techniques.