Skip to content

Revert "Implement write.parquet.row-group-size-bytes in the pyarrow w…#13

Closed
stephrb wants to merge 3 commits into
mainfrom
revert-12-sbuck/implement-row-group-size-bytes
Closed

Revert "Implement write.parquet.row-group-size-bytes in the pyarrow w…#13
stephrb wants to merge 3 commits into
mainfrom
revert-12-sbuck/implement-row-group-size-bytes

Conversation

@stephrb

@stephrb stephrb commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

intended to merge to develop instead of main

Stephen Buck and others added 3 commits June 3, 2026 10:00
The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used
only write.parquet.row-group-limit (rows). For wide tables that means a
single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default
rows ≈ 1.7 GiB uncompressed per row group — which drives the polars /
pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where
bytes_per_row is approximated from the in-memory arrow_table's nbytes.
This matches Spark / parquet-mr 'whichever limit fires first' semantics
and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB)
actually take effect.
…e-bytes

Implement write.parquet.row-group-size-bytes in the pyarrow writer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant