You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-7Lines changed: 17 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,31 @@
1
-
# Spark Tuning Notes
1
+
# Data Engineering Notes
2
2
3
-
A collection of Apache Spark performance optimization lessons learned from real-world production workloads.
3
+
A collection of practical insights and lessons learned from real-world data engineering workloads, covering Apache Spark optimization, pipeline architecture, and big data technologies.
4
4
5
5
## 📖 Read the Blog
6
6
7
-
Visit the live blog: **[mag1cfrog.github.io/spark-tuning-notes](https://mag1cfrog.github.io/spark-tuning-notes/)**
7
+
Visit the live blog: **[mag1cfrog.github.io/data-engineering-notes](https://mag1cfrog.github.io/data-engineering-notes/)**
8
8
9
-
## 🚀 What You'll Learn
10
-
11
-
This blog documents practical Spark optimization techniques gained from working with massive datasets in production environments, including:
9
+
This blog documents practical data engineering techniques gained from working with massive datasets in production environments, including:
12
10
11
+
### Apache Spark Optimization
13
12
-**Window Function Performance**: How a simple `ROW_NUMBER()` window function caused 391 GB of disk spilling and 33+ minute execution times
14
13
-**Hash vs Sort Aggregation**: Understanding when Spark falls back to expensive sort-based aggregation and how to avoid it
15
14
-**Memory Management**: Identifying and fixing disk spilling issues that kill performance
16
15
-**Query Profile Analysis**: Reading Databricks query profiles to identify bottlenecks
17
16
-**Alternative Approaches**: Using functions like `max_by()` to replace expensive window operations
18
17
18
+
### Data Pipeline Engineering
19
+
- Pipeline design patterns and best practices
20
+
- Data quality and monitoring strategies
21
+
- Stream processing architectures
22
+
- Performance optimization techniques
23
+
24
+
### Big Data Technologies
25
+
- Comparative analysis of data processing frameworks
26
+
- Infrastructure and deployment considerations
27
+
- Scalability patterns and anti-patterns
28
+
19
29
## 🎯 Featured Case Study
20
30
21
31
**From 33 minutes to 12 minutes**: Learn how replacing a window function with `max_by()` eliminated massive sort operations and reduced execution time by 2.7x on a 2.3 billion row dataset.
@@ -42,6 +52,6 @@ Have your own Spark optimization stories? Contributions are welcome! This blog a
0 commit comments