Feature Request / Improvement
Problem
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
- Tables with equality deletes are unreadable (hard
ValueError)
- CoW deletes OOM on large Parquet files (~1GB)
- CoW overwrite OOMs (same pattern as delete)
- Upsert uses O(n²) row-by-row comparison
- Compaction not implemented, requires external sort (infeasible in-memory for large tables)
- Orphan file deletion OOMs (LEFT ANTI JOIN of millions of paths)
The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.
Proposed Solution
Integrate Apache DataFusion as an optional execution engine (pip install 'pyiceberg[pyiceberg-core]') behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).
No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.
Design Doc
Support for PyIceberg DataFusion Integration
Operations Unblocked
Related Issues
PyIceberg
iceberg-rust
datafusion-python
Feature Request / Improvement
Problem
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
ValueError)The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.
Proposed Solution
Integrate Apache DataFusion as an optional execution engine (pip install 'pyiceberg[pyiceberg-core]') behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).
No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.
Design Doc
Support for PyIceberg DataFusion Integration
Operations Unblocked
Related Issues
PyIceberg
pyiceberg-corefrom thepyarrowextra #3356 (execution path isolation)DeleteFileIndexfor equality deletes)REPLACEAPI, prerequisite for compaction)DeleteFileIndexfor positional deletes, merged foundation)iceberg-rust
datafusion-python