Commit a707099
Stephen Buck
Implement write.parquet.row-group-size-bytes in the pyarrow writer
The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used
only write.parquet.row-group-limit (rows). For wide tables that means a
single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default
rows ≈ 1.7 GiB uncompressed per row group — which drives the polars /
pyarrow reader's decode peak into the tens of GiB on production reads.
Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where
bytes_per_row is approximated from the in-memory arrow_table's nbytes.
This matches Spark / parquet-mr 'whichever limit fires first' semantics
and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB)
actually take effect.1 parent c84017d commit a707099
2 files changed
Lines changed: 104 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2404 | 2404 | | |
2405 | 2405 | | |
2406 | 2406 | | |
| 2407 | + | |
| 2408 | + | |
| 2409 | + | |
| 2410 | + | |
| 2411 | + | |
| 2412 | + | |
| 2413 | + | |
| 2414 | + | |
| 2415 | + | |
| 2416 | + | |
2407 | 2417 | | |
2408 | 2418 | | |
2409 | 2419 | | |
2410 | 2420 | | |
2411 | | - | |
| 2421 | + | |
2412 | 2422 | | |
2413 | 2423 | | |
2414 | 2424 | | |
2415 | 2425 | | |
| 2426 | + | |
| 2427 | + | |
| 2428 | + | |
| 2429 | + | |
| 2430 | + | |
2416 | 2431 | | |
2417 | 2432 | | |
2418 | 2433 | | |
| |||
2436 | 2451 | | |
2437 | 2452 | | |
2438 | 2453 | | |
| 2454 | + | |
2439 | 2455 | | |
2440 | 2456 | | |
2441 | 2457 | | |
| |||
2564 | 2580 | | |
2565 | 2581 | | |
2566 | 2582 | | |
2567 | | - | |
2568 | 2583 | | |
2569 | 2584 | | |
2570 | 2585 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
| 78 | + | |
77 | 79 | | |
78 | 80 | | |
79 | 81 | | |
| |||
2319 | 2321 | | |
2320 | 2322 | | |
2321 | 2323 | | |
| 2324 | + | |
| 2325 | + | |
| 2326 | + | |
| 2327 | + | |
| 2328 | + | |
| 2329 | + | |
| 2330 | + | |
| 2331 | + | |
| 2332 | + | |
| 2333 | + | |
| 2334 | + | |
| 2335 | + | |
| 2336 | + | |
| 2337 | + | |
| 2338 | + | |
| 2339 | + | |
| 2340 | + | |
| 2341 | + | |
| 2342 | + | |
| 2343 | + | |
| 2344 | + | |
| 2345 | + | |
| 2346 | + | |
| 2347 | + | |
| 2348 | + | |
| 2349 | + | |
| 2350 | + | |
| 2351 | + | |
| 2352 | + | |
| 2353 | + | |
| 2354 | + | |
| 2355 | + | |
| 2356 | + | |
| 2357 | + | |
| 2358 | + | |
| 2359 | + | |
| 2360 | + | |
| 2361 | + | |
| 2362 | + | |
| 2363 | + | |
| 2364 | + | |
| 2365 | + | |
| 2366 | + | |
| 2367 | + | |
| 2368 | + | |
| 2369 | + | |
| 2370 | + | |
| 2371 | + | |
| 2372 | + | |
| 2373 | + | |
| 2374 | + | |
| 2375 | + | |
| 2376 | + | |
| 2377 | + | |
| 2378 | + | |
| 2379 | + | |
| 2380 | + | |
| 2381 | + | |
| 2382 | + | |
| 2383 | + | |
| 2384 | + | |
| 2385 | + | |
| 2386 | + | |
| 2387 | + | |
| 2388 | + | |
| 2389 | + | |
| 2390 | + | |
| 2391 | + | |
| 2392 | + | |
| 2393 | + | |
| 2394 | + | |
| 2395 | + | |
| 2396 | + | |
| 2397 | + | |
| 2398 | + | |
| 2399 | + | |
| 2400 | + | |
| 2401 | + | |
| 2402 | + | |
| 2403 | + | |
| 2404 | + | |
| 2405 | + | |
| 2406 | + | |
| 2407 | + | |
| 2408 | + | |
0 commit comments