Skip to content

Commit c38c0d2

Browse files
committed
Refactor lesson post and enhance styling: update content structure and improve code block appearance
1 parent 0e22433 commit c38c0d2

2 files changed

Lines changed: 52 additions & 13 deletions

File tree

docs/_posts/2025-06-13-first-lesson.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,19 +5,34 @@ categories:
55
- spark
66
- optimization
77
tags:
8-
- broadcast
9-
- partitioning
108
- performance
119
excerpt: "How switching from partitioned joins to broadcast joins reduced shuffle writes from 8GB to 500MB"
1210
---
1311

14-
## First Lesson: Partition vs Broadcast
1512

16-
When joining a large 10 GB DataFrame with a small 200 MB lookup table, I discovered that:
1713

18-
- Using a **broadcast join** with `spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "300MB")` dropped shuffle write from 8 GB to 500 MB.
19-
- Conversely, relying on default partitioned joins caused redundant shuffles across executors.
14+
We have a Spark job on Databricks that would join a enormous table (player tracking records of coordindates on frames) with a dimensional table (position number to player uid mapping), and then do some de-dup using window function.
15+
16+
## The Deduplication Challenge
17+
18+
The core logic involved a window function to handle duplicate records (some psedudo SQL):
19+
20+
```sql
21+
WITH ranked AS (
22+
SELECT
23+
tracking.*,
24+
lineup.fielder_id,
25+
lineup.position_alpha,
26+
ROW_NUMBER() OVER (
27+
PARTITION BY game_id, pitch_uid, position_num, event_time
28+
ORDER BY processed_year DESC, processed_month DESC, processed_day DESC
29+
) AS rn
30+
FROM hawkeye_tracking tracking
31+
JOIN hawkeye_lineup lineup ON (...)
32+
)
33+
SELECT * EXCEPT (rn)
34+
FROM ranked
35+
WHERE rn = 1
36+
```
37+
2038

21-
<aside class="callout">
22-
💡 **Tip:** Always tune `spark.sql.shuffle.partitions = executors * cores_per_executor` after switching join strategies.
23-
</aside>

docs/assets/css/style.css

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,35 @@ body {
55
margin: auto;
66
padding: 1rem;
77
}
8-
pre {
9-
background: #f5f5f5;
10-
padding: 0.5rem;
11-
overflow-x: auto;
8+
9+
/* Override the theme's pre styling with more specific selectors */
10+
.content pre[class*="language-"] {
11+
background: #f5f5f5 !important;
12+
border: 1px solid #e0e0e0 !important;
13+
border-radius: 6px !important;
14+
padding: 1em !important;
15+
margin: 1.5em 0 !important;
16+
line-height: 1.5 !important;
1217
}
18+
19+
/* Also target the code element inside pre */
20+
.content pre[class*="language-"] code {
21+
background: none !important;
22+
padding: 0 !important;
23+
color: #333333 !important;
24+
font-size: 0.9em !important;
25+
}
26+
27+
/* Fallback for any plain pre tags */
28+
.content pre:not([class]) {
29+
background: #f5f5f5 !important;
30+
border: 1px solid #e0e0e0 !important;
31+
border-radius: 6px !important;
32+
padding: 1em !important;
33+
margin: 1.5em 0 !important;
34+
line-height: 1.5 !important;
35+
}
36+
1337
.callout {
1438
border-left: 4px solid #007acc;
1539
background: #f0f8ff;

0 commit comments

Comments
 (0)