Skip to content

Commit 2730aef

Browse files
Add: Spark writes multiple parquet files after repartition
Closes #237
1 parent fc00fb4 commit 2730aef

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
id: a71d2105aa
3+
question: Why does Spark write multiple parquet files after repartitioning a DataFrame?
4+
sort_order: 64
5+
---
6+
7+
Spark processes data in partitions. When you write a DataFrame to disk, Spark writes each partition as a separate output file. For example:
8+
9+
```python
10+
trips.repartition(4).write.parquet("output/")
11+
```
12+
13+
This creates four parquet files because the DataFrame now has four partitions. This behavior enables Spark to write data in parallel and can improve performance on large datasets.

0 commit comments

Comments
 (0)