You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add concurrent_files flag for bounded concurrent streaming
Add _bounded_concurrent_batches() with proper lock discipline:
- Queue backpressure caps memory (scan.max-buffered-batches, default 16)
- Semaphore limits concurrent file reads (concurrent_files param)
- Cancel event with timeouts on all blocking ops (no lock over IO)
- Error propagation and early termination support
When streaming=True and concurrent_files > 1, batches are yielded
as they arrive from parallel file reads. File ordering is not
guaranteed (documented).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+27Lines changed: 27 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -369,6 +369,22 @@ for buf in tbl.scan().to_arrow_batch_reader(streaming=True, batch_size=1000):
369
369
print(f"Buffer contains {len(buf)} rows")
370
370
```
371
371
372
+
For maximum throughput, use `concurrent_files` to read multiple files in parallel while streaming. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
373
+
374
+
```python
375
+
for buf in tbl.scan().to_arrow_batch_reader(streaming=True, concurrent_files=4, batch_size=1000):
When using `concurrent_files > 1`, batches from different files may be interleaved. Within each file, batches are always in row order. See the ordering semantics table in the [Apache Arrow section](#apache-arrow) above for details.
0 commit comments