You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add streaming flag to ArrowScan.to_record_batches
When streaming=True, batches are yielded as they are produced by PyArrow
without materializing entire files into memory. Files are still processed
sequentially, preserving file ordering. The inner method handles the
global limit correctly when called with all tasks, avoiding double-counting.
This addresses the OOM issue in #3036 for single-file-at-a-time streaming.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -362,6 +362,13 @@ for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
362
362
print(f"Buffer contains {len(buf)} rows")
363
363
```
364
364
365
+
By default, each file's batches are materialized in memory before being yielded. For large files that may exceed available memory, use `streaming=True` to yield batches as they are produced without materializing entire files:
366
+
367
+
```python
368
+
for buf in tbl.scan().to_arrow_batch_reader(streaming=True, batch_size=1000):
369
+
print(f"Buffer contains {len(buf)} rows")
370
+
```
371
+
365
372
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
366
373
367
374
```python
@@ -1635,6 +1642,15 @@ table.scan(
1635
1642
).to_arrow_batch_reader(batch_size=1000)
1636
1643
```
1637
1644
1645
+
Use `streaming=True` to avoid materializing entire files in memory. This yields batches as they are produced by PyArrow, one file at a time:
0 commit comments