Commit 706390f
# Rationale for this change
Closes #2152, addresses the long-standing memory problem reported in
#1004 and re-discovered by dlt-hub#3753.
`Table.append(df)` and `Table.overwrite(df)` currently require a fully
materialised `pa.Table`. For large or unbounded inputs this means
loading the entire dataset into memory before writing — fatal at any
non-trivial scale and a recurring complaint going back to #1004 (Aug
2024). The reference Java implementation has streaming append;
iceberg-go shipped it in iceberg-go#369 (Apr 2025). Python is the last
major SDK without it.
This PR adds `pa.RecordBatchReader` as a valid input to
`Table.append/overwrite` (and `Transaction.append/overwrite`). The
reader is consumed lazily, microbatched into Parquet files via the new
`bin_pack_record_batches` helper, and committed in a single snapshot via
the existing `fast_append` pipeline.
```python
reader = pa.RecordBatchReader.from_batches(schema, batch_iter)
tbl.append(reader) # ← streams, doesn't materialise
tbl.overwrite(reader) # ← also supported
```
## Scope (unpartitioned only)
Streaming into a partitioned table raises `NotImplementedError` pointing
back to #2152. Partitioned support is genuinely the harder case — it
needs design discussion around partition cardinality bounds,
per-partition rolling writers, and idempotency on retry — so I'm
proposing to land in three reviewable PRs:
1. **This PR** — API + unpartitioned + buffered byte-budget bin-packing.
2. **PR2 (next)** — switch internals to a rolling `pq.ParquetWriter` +
`OutputStream.tell()` for constant-memory streaming. No API change.
Detailed plan below.
3. **PR3 (later)** — partitioned streaming, after design discussion on
#2152.
This mirrors iceberg-go#369's staging: ship the unpartitioned API first,
iterate from there.
## Implementation
The streaming path reuses the existing `WriteTask` → `write_file` →
`fast_append` pipeline. The only new primitive is
`bin_pack_record_batches` (sibling of the existing
`bin_pack_arrow_table`):
- Accumulates incoming `RecordBatch`es into an in-memory buffer.
- Flushes when `sum(batch.nbytes) >= write.target-file-size-bytes`.
- Each flushed buffer becomes one parquet file via the existing
`write_parquet` task.
- Schema check (`_check_pyarrow_schema_compatible`) runs against
`reader.schema` before the snapshot producer opens — schema mismatches
fail before any data file is written, so no orphans.
## Acknowledged trade-offs
**Memory**: peak memory is bounded by `N_workers ×
write.target-file-size-bytes` (default 8 × 512 MiB ≈ 4 GiB), not
constant. This is materially better than today's "materialise
everything" but isn't yet "constant memory streaming". PR2 fixes this.
**Byte semantics**: `write.target-file-size-bytes` is currently
interpreted as **uncompressed in-memory Arrow bytes**
(`RecordBatch.nbytes` — the bin-packing weight), not compressed on-disk
Parquet bytes. The resulting files are typically 3-10× smaller than the
property suggests after zstd / dictionary / RLE encoding. This matches
the existing `pa.Table` write path (`bin_pack_arrow_table` uses the same
accounting) — this PR doesn't change pyiceberg's existing semantics, it
only documents them in the docstrings of both helpers and the
`Transaction.append/overwrite` `Note:` blocks. PR2 fixes this too.
**Retry**: `pa.RecordBatchReader` is single-pass, so a failed catalog
commit leaves the reader drained and a naive retry writes zero rows.
Documented in the `Note:` block — callers needing at-least-once
semantics should reconstruct the reader on each attempt via a factory
callable, or use the two-stage `add_files` pattern (whose input is a
replayable list of paths).
## PR2 — proposed scope (FYI, not in this PR)
Drop the buffer-and-flush approach and use a rolling `pq.ParquetWriter`
driven by `OutputStream.tell()` (added in #2998 specifically for this
kind of use case):
```python
# sketch
writer = pq.ParquetWriter(fos, schema, **kwargs)
for batch in reader:
writer.write_batch(batch)
if fos.tell() >= target_file_size: # compressed on-disk bytes
writer.close()
finalize_data_file(...)
# open next file
fos = io.new_output(next_path).create(overwrite=True)
writer = pq.ParquetWriter(fos, schema, **kwargs)
writer.close()
```
What this delivers:
- **Constant memory**: `O(1 batch)` per worker (~10s of MB) regardless
of `target_file_size`. The 4 GiB peak in this PR drops to ~50-100 MB.
- **Spec-correct byte semantics**: `write.target-file-size-bytes`
becomes actual on-disk compressed bytes, matching the Java/Spark/Flink
writers and the spec.
- **No public API change**: same `tx.append(reader)` /
`tx.overwrite(reader)` — internals only.
Open design questions for PR2 (will surface on the issue thread before
coding):
- **Parallelism**: a single rolling writer is serial. Either accept that
for streaming (memory-vs-throughput trade), or add a hybrid (N rolling
writers fed via a queue) and pick a default that matches today's
`executor.map(write_parquet, tasks)` parallelism.
- **Backwards compat**: switching `bin_pack_arrow_table` to the same
rolling-writer mechanism would also tighten the `pa.Table` path's byte
semantics. That changes file-size characteristics for every existing
pyiceberg writer. Probably worth a separate change with a deprecation
note, or a feature flag.
- **`add_files` interaction**: rolling writes produce data files we know
about directly; we shouldn't go through the parquet-footer round-trip in
`_dataframe_to_data_files`. Means a small refactor in the streaming-only
path.
## Are these changes tested?
Yes, comprehensively at four layers.
**1. Unit tests** (`tests/io/test_pyarrow.py`) — 4 new tests for
`bin_pack_record_batches` covering single-bin, microbatched, empty
input, and lazy generator consumption.
**2. End-to-end behaviour tests**
(`tests/catalog/test_catalog_behaviors.py`) — 8 new tests parametrised
across all three in-process catalog backends (`memory`, `sql`,
`sql_without_rowcount`) → 24 test runs covering append, overwrite,
microbatch verification (multiple files in one snapshot), empty reader,
partitioned-table-raises, invalid-input-rejected,
reader-consumed-exactly-once, and schema-mismatch-writes-no-files.
**3. Integration tests**
(`tests/integration/test_writes/test_writes.py`) — 6 new Spark-readback
tests for v1 + v2 format versions covering append, overwrite, and
multi-file microbatch. Proves Spark can read tables written via the
streaming path against the docker-compose stack.
**4. Smoke test on a real production stack** — verified end-to-end
against AWS Glue + S3 in our staging account: 100 k-row streaming append
in 17 s, 20-file microbatched commit, Athena read-back (`COUNT(*)` and
`MAX(id)` matched the input exactly), schema-mismatch rejection leaving
no orphan files.
Full unit suite: 3 647 passed. Full integration suite: 122 passed, 1
skipped.
## Are there any user-facing changes?
Yes, intentionally:
- `Transaction.append(df)`, `Transaction.overwrite(df)`,
`Table.append(df)`, `Table.overwrite(df)` accept `pa.Table |
pa.RecordBatchReader`.
- The `ValueError` raised on bad input changes from `"Expected PyArrow
table, got: ..."` to `"Expected pa.Table or pa.RecordBatchReader, got:
..."`. Updated `test_invalid_arguments` accordingly.
- New module-level helper `bin_pack_record_batches` in
`pyiceberg.io.pyarrow` (sibling of `bin_pack_arrow_table`).
- `bin_pack_arrow_table` gained its first docstring, documenting the
existing uncompressed-Arrow-bytes accounting.
- Docs: new "Streaming writes from a RecordBatchReader" subsection in
`mkdocs/docs/api.md`.
- Docstrings on `Transaction.append/overwrite` document retry semantics
and the byte-semantics caveat.
## Related
- Closes #2152
- Addresses #1004 (closed by reporter without a fix)
- Reference implementation: iceberg-go#369
- Downstream consumer hitting the same problem: dlt-hub/dlt#3753
(independent rediscovery of the same approach)
- Builds on the maintainer-blessed pattern from #1742's review
(`_dataframe_to_data_files` + `fast_append.append_data_file()`, no
separate `write_parquet` API)
- Companion fix (already merged separately): test-state isolation in
`test_write_optional_list`
- PR2 will build on `OutputStream.tell()` from #2998
---------
Co-authored-by: Paul Mathew <paul.mathew@aircall.io>
1 parent 43d1f1f commit 706390f
7 files changed
Lines changed: 498 additions & 30 deletions
File tree
- mkdocs/docs
- pyiceberg
- io
- table
- tests
- catalog
- integration/test_writes
- io
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
365 | 365 | | |
366 | 366 | | |
367 | 367 | | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
368 | 379 | | |
369 | 380 | | |
370 | 381 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2675 | 2675 | | |
2676 | 2676 | | |
2677 | 2677 | | |
| 2678 | + | |
| 2679 | + | |
| 2680 | + | |
| 2681 | + | |
| 2682 | + | |
| 2683 | + | |
| 2684 | + | |
| 2685 | + | |
| 2686 | + | |
| 2687 | + | |
| 2688 | + | |
| 2689 | + | |
2678 | 2690 | | |
2679 | 2691 | | |
2680 | 2692 | | |
| |||
2690 | 2702 | | |
2691 | 2703 | | |
2692 | 2704 | | |
| 2705 | + | |
| 2706 | + | |
| 2707 | + | |
| 2708 | + | |
| 2709 | + | |
| 2710 | + | |
| 2711 | + | |
| 2712 | + | |
| 2713 | + | |
| 2714 | + | |
| 2715 | + | |
| 2716 | + | |
| 2717 | + | |
| 2718 | + | |
| 2719 | + | |
| 2720 | + | |
| 2721 | + | |
| 2722 | + | |
| 2723 | + | |
| 2724 | + | |
| 2725 | + | |
| 2726 | + | |
| 2727 | + | |
| 2728 | + | |
| 2729 | + | |
| 2730 | + | |
| 2731 | + | |
| 2732 | + | |
| 2733 | + | |
| 2734 | + | |
| 2735 | + | |
| 2736 | + | |
| 2737 | + | |
| 2738 | + | |
| 2739 | + | |
2693 | 2740 | | |
2694 | 2741 | | |
2695 | 2742 | | |
| |||
2809 | 2856 | | |
2810 | 2857 | | |
2811 | 2858 | | |
2812 | | - | |
| 2859 | + | |
2813 | 2860 | | |
2814 | 2861 | | |
2815 | 2862 | | |
2816 | 2863 | | |
2817 | | - | |
| 2864 | + | |
| 2865 | + | |
| 2866 | + | |
| 2867 | + | |
| 2868 | + | |
| 2869 | + | |
| 2870 | + | |
| 2871 | + | |
| 2872 | + | |
| 2873 | + | |
2818 | 2874 | | |
2819 | 2875 | | |
2820 | | - | |
| 2876 | + | |
2821 | 2877 | | |
2822 | 2878 | | |
2823 | 2879 | | |
| |||
2837 | 2893 | | |
2838 | 2894 | | |
2839 | 2895 | | |
| 2896 | + | |
| 2897 | + | |
| 2898 | + | |
| 2899 | + | |
| 2900 | + | |
| 2901 | + | |
| 2902 | + | |
| 2903 | + | |
| 2904 | + | |
| 2905 | + | |
| 2906 | + | |
| 2907 | + | |
| 2908 | + | |
| 2909 | + | |
| 2910 | + | |
| 2911 | + | |
| 2912 | + | |
2840 | 2913 | | |
2841 | 2914 | | |
2842 | 2915 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
450 | 450 | | |
451 | 451 | | |
452 | 452 | | |
453 | | - | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
454 | 459 | | |
455 | | - | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
456 | 497 | | |
457 | 498 | | |
458 | | - | |
| 499 | + | |
459 | 500 | | |
460 | 501 | | |
461 | 502 | | |
| |||
466 | 507 | | |
467 | 508 | | |
468 | 509 | | |
469 | | - | |
470 | | - | |
| 510 | + | |
| 511 | + | |
471 | 512 | | |
472 | 513 | | |
473 | 514 | | |
| |||
478 | 519 | | |
479 | 520 | | |
480 | 521 | | |
481 | | - | |
482 | | - | |
483 | | - | |
484 | | - | |
485 | | - | |
486 | | - | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
487 | 530 | | |
488 | 531 | | |
489 | 532 | | |
| |||
555 | 598 | | |
556 | 599 | | |
557 | 600 | | |
558 | | - | |
| 601 | + | |
559 | 602 | | |
560 | 603 | | |
561 | 604 | | |
562 | 605 | | |
563 | 606 | | |
564 | 607 | | |
565 | | - | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
566 | 645 | | |
567 | 646 | | |
568 | 647 | | |
| |||
571 | 650 | | |
572 | 651 | | |
573 | 652 | | |
574 | | - | |
| 653 | + | |
575 | 654 | | |
576 | 655 | | |
577 | 656 | | |
| |||
585 | 664 | | |
586 | 665 | | |
587 | 666 | | |
588 | | - | |
589 | | - | |
| 667 | + | |
| 668 | + | |
590 | 669 | | |
591 | 670 | | |
592 | 671 | | |
| |||
606 | 685 | | |
607 | 686 | | |
608 | 687 | | |
609 | | - | |
610 | | - | |
| 688 | + | |
| 689 | + | |
611 | 690 | | |
612 | 691 | | |
613 | 692 | | |
| |||
1373 | 1452 | | |
1374 | 1453 | | |
1375 | 1454 | | |
1376 | | - | |
| 1455 | + | |
| 1456 | + | |
| 1457 | + | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
1377 | 1461 | | |
1378 | | - | |
| 1462 | + | |
| 1463 | + | |
| 1464 | + | |
| 1465 | + | |
| 1466 | + | |
1379 | 1467 | | |
1380 | 1468 | | |
1381 | | - | |
| 1469 | + | |
1382 | 1470 | | |
1383 | 1471 | | |
1384 | 1472 | | |
| |||
1401 | 1489 | | |
1402 | 1490 | | |
1403 | 1491 | | |
1404 | | - | |
| 1492 | + | |
1405 | 1493 | | |
1406 | 1494 | | |
1407 | 1495 | | |
1408 | 1496 | | |
1409 | 1497 | | |
1410 | 1498 | | |
1411 | | - | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
1412 | 1504 | | |
1413 | 1505 | | |
1414 | 1506 | | |
| |||
1417 | 1509 | | |
1418 | 1510 | | |
1419 | 1511 | | |
1420 | | - | |
| 1512 | + | |
1421 | 1513 | | |
1422 | 1514 | | |
1423 | 1515 | | |
| |||
0 commit comments