Skip to content

Commit 72cc381

Browse files
authored
Merge pull request #9 from mag1cfrog/refactor/rename-repo
Refactor/rename repo
2 parents 22b953d + fdbc801 commit 72cc381

23 files changed

Lines changed: 297 additions & 123 deletions

README.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,31 @@
1-
# Spark Tuning Notes
1+
# Data Engineering Notes
22

3-
A collection of Apache Spark performance optimization lessons learned from real-world production workloads.
3+
A collection of practical insights and lessons learned from real-world data engineering workloads, covering Apache Spark optimization, pipeline architecture, and big data technologies.
44

55
## 📖 Read the Blog
66

7-
Visit the live blog: **[mag1cfrog.github.io/spark-tuning-notes](https://mag1cfrog.github.io/spark-tuning-notes/)**
7+
Visit the live blog: **[mag1cfrog.github.io/data-engineering-notes](https://mag1cfrog.github.io/data-engineering-notes/)**
88

9-
## 🚀 What You'll Learn
10-
11-
This blog documents practical Spark optimization techniques gained from working with massive datasets in production environments, including:
9+
This blog documents practical data engineering techniques gained from working with massive datasets in production environments, including:
1210

11+
### Apache Spark Optimization
1312
- **Window Function Performance**: How a simple `ROW_NUMBER()` window function caused 391 GB of disk spilling and 33+ minute execution times
1413
- **Hash vs Sort Aggregation**: Understanding when Spark falls back to expensive sort-based aggregation and how to avoid it
1514
- **Memory Management**: Identifying and fixing disk spilling issues that kill performance
1615
- **Query Profile Analysis**: Reading Databricks query profiles to identify bottlenecks
1716
- **Alternative Approaches**: Using functions like `max_by()` to replace expensive window operations
1817

18+
### Data Pipeline Engineering
19+
- Pipeline design patterns and best practices
20+
- Data quality and monitoring strategies
21+
- Stream processing architectures
22+
- Performance optimization techniques
23+
24+
### Big Data Technologies
25+
- Comparative analysis of data processing frameworks
26+
- Infrastructure and deployment considerations
27+
- Scalability patterns and anti-patterns
28+
1929
## 🎯 Featured Case Study
2030

2131
**From 33 minutes to 12 minutes**: Learn how replacing a window function with `max_by()` eliminated massive sort operations and reduced execution time by 2.7x on a 2.3 billion row dataset.
@@ -42,6 +52,6 @@ Have your own Spark optimization stories? Contributions are welcome! This blog a
4252

4353
## 🔗 Links
4454

45-
- **Live Blog**: [mag1cfrog.github.io/spark-tuning-notes](https://mag1cfrog.github.io/spark-tuning-notes/)
55+
- **Live Blog**: [mag1cfrog.github.io/data-engineering-notes](https://mag1cfrog.github.io/data-engineering-notes/)
4656
- **Astro Documentation**: [docs.astro.build](https://docs.astro.build)
4757
- **Apache Spark**: [spark.apache.org](https://spark.apache.org)

astro.config.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ import customToc from "astro-custom-toc";
77
// https://astro.build/config
88
export default defineConfig({
99
site: 'https://mag1cfrog.github.io',
10-
base: '/spark-tuning-notes/',
10+
base: '/data-engineering-notes/',
1111
output: 'static',
1212
integrations: [
1313
customToc(

package-lock.json

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"name": "spark-tuning-notes",
2+
"name": "data-engineering-notes",
33
"type": "module",
44
"version": "0.0.1",
55
"scripts": {
-23.2 KB
Binary file not shown.
-22.3 KB
Binary file not shown.
File renamed without changes.

src/assets/img/post1/dbx_query_operation_11.jpg renamed to src/assets/img/spark-window-optimization/dbx_query_operation_11.jpg

File renamed without changes.

src/assets/img/post1/dbx_query_profile_1.jpg renamed to src/assets/img/spark-window-optimization/dbx_query_profile_1.jpg

File renamed without changes.

src/assets/img/post1/dbx_query_profile_2.jpg renamed to src/assets/img/spark-window-optimization/dbx_query_profile_2.jpg

File renamed without changes.

0 commit comments

Comments
 (0)