Skip to content

Commit 23a9e88

Browse files
Add: Count records per partition file in Spark
Closes #232
1 parent d845ded commit 23a9e88

1 file changed

Lines changed: 25 additions & 0 deletions

File tree

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
id: bcafec775a
3+
question: How many records are stored in each partition/parquet file when writing
4+
a Spark DataFrame with repartition?
5+
sort_order: 65
6+
---
7+
8+
When you repartition a DataFrame and write it to Parquet, Spark writes one Parquet file per partition. The total number of rows in the dataset is distributed across those files, so each partition file contains roughly N / num_partitions rows (where N is the total row count and num_partitions is the number of partitions you repartitioned to). The exact counts per file depend on the data distribution and the chosen number of partitions.
9+
10+
Example:
11+
```
12+
df.repartition(4).write.parquet("output/")
13+
```
14+
15+
To see how many rows are in each partition file, read the output and count rows per input file:
16+
```
17+
spark.read.parquet("output/").groupBy(input_file_name()).count().show()
18+
```
19+
20+
Notes:
21+
- The function `input_file_name()` helps identify which file a row came from. You may need to import it in PySpark:
22+
```
23+
from pyspark.sql.functions import input_file_name
24+
```
25+
- The counts shown by the above command correspond to the quiz options, and will vary with dataset size and the number of partitions you write to. If you want more uniform file sizes, adjust the number of partitions or use `coalesce`/`repartition` as appropriate.

0 commit comments

Comments
 (0)