You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/_posts/2025-06-13-first-lesson.md
+31-9Lines changed: 31 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,38 @@
1
1
---
2
-
layout: default
3
2
title: "First Lesson: Partition vs Broadcast"
4
-
date: 2025-06-13
3
+
date: 2025-06-13 12:00:00 -0800
4
+
categories:
5
+
- spark
6
+
- optimization
7
+
tags:
8
+
- performance
9
+
excerpt: "How switching from partitioned joins to broadcast joins reduced shuffle writes from 8GB to 500MB"
5
10
---
6
11
7
-
## First Lesson: Partition vs Broadcast
8
12
9
-
When joining a large 10 GB DataFrame with a small 200 MB lookup table, I discovered that:
10
13
11
-
- Using a **broadcast join** with `spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "300MB")` dropped shuffle write from 8 GB to 500 MB.
12
-
- Conversely, relying on default partitioned joins caused redundant shuffles across executors.
14
+
We have a Spark job on Databricks that would join a enormous table (player tracking records of coordindates on frames) with a dimensional table (position number to player uid mapping), and then do some de-dup using window function.
15
+
16
+
## The Deduplication Challenge
17
+
18
+
The core logic involved a window function to handle duplicate records (some psedudo SQL):
19
+
20
+
```sql
21
+
WITH ranked AS (
22
+
SELECT
23
+
tracking.*,
24
+
lineup.fielder_id,
25
+
lineup.position_alpha,
26
+
ROW_NUMBER() OVER (
27
+
PARTITION BY game_id, pitch_uid, position_num, event_time
28
+
ORDER BY processed_year DESC, processed_month DESC, processed_day DESC
0 commit comments