Battle-tested patterns for making Apache Spark jobs fast and cheap. Each technique paired with reproducible benchmarks on a local stack — no cloud required. The same patterns scale to EMR, Databricks, Dataproc, and Glue.
Sushma K L — Senior Data Engineer 📍 Bengaluru, India 💼 LinkedIn • 🐙 GitHub • ✉️ sushmakl95@gmail.com
8+ years of shipping PySpark data pipelines on AWS Glue, Databricks, EMR. This playbook condenses the optimization patterns I reach for most often — the ones that consistently deliver 5-50× speed/cost wins on real production workloads.
Most Spark tuning content is either (a) vendor marketing or (b) a dump of configuration flags without the why. This repo is different:
- Every technique has a benchmark. Not "this should be faster" — actual measured numbers.
- Every benchmark is reproducible on a laptop. Docker + synthetic data. No cloud bill.
- Every optimization explains the internals. Why does broadcast join beat shuffle? Why does AQE help with skew? You leave understanding, not memorizing.
If you're a data engineer who wants to stop guessing and start measuring — this is for you.
| # | Technique | Typical impact | When to use |
|---|---|---|---|
| 1 | File sizing & coalesce strategy | 3-10× read speed | Small-files problem on S3/HDFS |
| 2 | Partition pruning | 5-50× | Time-series + columnar storage |
| 3 | Z-ORDER / Liquid Clustering | 10-40× | Multi-dimensional filter predicates |
| 4 | Broadcast joins | 5-20× | Small fact × big dim |
| 5 | Adaptive Query Execution (AQE) | 2-5× | Skewed joins, dynamic shuffle partitions |
| 6 | Skew handling (salting) | 10-100× | Heavily skewed keys |
| 7 | Executor & memory tuning | 1.5-3× | All workloads — foundational |
| 8 | Caching & checkpointing | 2-10× | Multi-use intermediate DataFrames |
| 9 | Predicate pushdown | 3-20× | Filtering on columnar formats |
| 10 | Storage format choice | 3-8× | Parquet vs ORC vs Delta vs Iceberg |
| 11 | Liquid Clustering (Delta 3.2+) | 2-5× on incremental OPTIMIZE | New Delta tables — replaces Z-ORDER; evolvable clustering keys |
| 12 | Deletion Vectors (Delta 3.x) | 5-20× on DELETE/MERGE | GDPR deletes, CDC merge-on-read workloads |
| 13 | Apache DataFusion Comet | 2-4× on analytic scans | Heavy parquet + aggregation workloads, no code change |
Each technique has its own notebook in notebooks/ showing before/after measurements.
# Install
git clone https://github.com/sushmakl95/spark-performance-optimization-playbook.git
cd spark-performance-optimization-playbook
make install-dev
# Generate synthetic dataset (1GB, ~10M rows)
make seed-data
# Run the whole benchmark suite
make bench-allExpected output (shortened, on an 8-core laptop):
[bench] 01_file_sizing before=142.3s after= 14.8s speedup= 9.6x
[bench] 02_partition_pruning before=87.2s after= 3.1s speedup=28.1x
[bench] 03_zorder_clustering before=60.1s after= 5.4s speedup=11.1x
[bench] 04_broadcast_join before=48.7s after= 4.9s speedup= 9.9x
[bench] 05_aqe_skew before=120.3s after= 39.8s speedup= 3.0x
[bench] 06_skew_salting before=240.1s after= 9.2s speedup=26.1x
[bench] 07_executor_tuning before=58.4s after= 22.1s speedup= 2.6x
[bench] 08_caching before=36.9s after= 7.8s speedup= 4.7x
[bench] 09_predicate_pushdown before=45.7s after= 6.3s speedup= 7.3x
[bench] 10_storage_formats before=82.3s after= 9.1s speedup= 9.0x
These numbers are from my M-series laptop. Your mileage will vary — but the relative improvements are stable across hardware.
spark-performance-optimization-playbook/
├── notebooks/ # Jupyter notebooks — one per technique
├── src/playbook/ # Reusable utilities (benchmark harness, etc.)
├── data/generators/ # Synthetic data generators
├── benchmarks/ # Automated benchmark scripts
├── docs/ # Deep-dive articles per technique
├── scripts/ # Local dev helpers
└── infra/ # Reference IaC snippets (EMR, Glue, Databricks)
# Just partition pruning
python -m benchmarks.run_bench --technique partition_pruning
# With custom data scale
python -m benchmarks.run_bench --technique broadcast_join --scale 5x
# Compare before/after side by side (includes Spark UI metrics)
python -m benchmarks.compare --technique zorder_clustering --output report.mdDefault uses PySpark local mode (embedded Spark, no cluster). Everything runs on one machine — good enough to observe the ratios, not absolute numbers.
For realistic distributed behavior, spin up the included Docker Compose (1 driver + 3 workers):
docker compose -f compose/docker-compose.yml up -d
# SparkUI at http://localhost:4040 during jobs- Data engineers tuning Glue/EMR/Databricks jobs and unsure which lever matters most
- Analytics engineers hitting perf walls with dbt + Spark
- Platform teams writing internal tuning guides
- Interview candidates preparing for Spark-heavy system design rounds
- ❌ A Spark introduction — assumes you know
DataFrame,join,window, partitioning basics - ❌ A vendor-specific tuning guide — patterns apply across platforms
- ❌ A list of configs without explanation — every setting is paired with the internals
- Spark SQL Performance Tuning (canonical)
- Databricks: Adaptive Query Execution
- Holden Karau & Rachel Warren, High Performance Spark (O'Reilly)
- Jacek Laskowski, Mastering Spark SQL (free online book)
MIT — see LICENSE.