Integrate DataFusion as execution engine for compute-heavy operations

### Feature Request / Improvement

# Problem
PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:
- Tables with equality deletes are unreadable (hard `ValueError`)
- CoW deletes OOM on large Parquet files (~1GB)
- CoW overwrite OOMs (same pattern as delete)
- Upsert uses O(n²) row-by-row comparison
- Compaction not implemented, requires external sort (infeasible in-memory for large tables)
- Orphan file deletion OOMs (LEFT ANTI JOIN of millions of paths)

The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.

# Proposed Solution
Integrate Apache DataFusion as an optional execution engine (pip install 'pyiceberg[pyiceberg-core]') behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).

No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.

# Design Doc
[Support for PyIceberg DataFusion Integration](https://docs.google.com/document/d/1p3Imyhlw_KZq9asP6Wz9VFj9sny1hcqelY9LC0c6J0Y/edit?usp=sharing)

# Operations Unblocked
- [ ] Equality delete read resolution
- [ ] Streaming CoW delete/overwrite
- [ ] Table compaction (sort + rewrite)
- [ ] Orphan file deletion
- [ ] Upsert via hash join
- [ ] Equality-to-positional conversion
- [ ] Position delete compaction
- [ ] Full MoR compaction
- [ ] Z-Order / Hilbert sorting
- [ ] DV compaction
- [ ] Incremental compaction
- [ ] Sort-order enforcement on write
- [ ] Dynamic partition overwrite (bounded memory)

# Related Issues

## PyIceberg
- #1078 (MoR support epic)
- #1210 / #3270 (equality delete reads)
- #3356 (execution path isolation)
- #1092 (data compaction)
- #1200 (orphan file deletion)
- #3285 (`DeleteFileIndex` for equality deletes)
- #3319 / #3320 (commit retry, prerequisite for safe compaction commits)
- #3130 / #3131 (`REPLACE` API, prerequisite for compaction)
- #1818 (V3 tracking, DV compaction)
- #1808 (positional delete write support)
- #2918 (`DeleteFileIndex` for positional deletes, merged foundation)

## iceberg-rust
- [iceberg-rust#2186](https://github.com/apache/iceberg-rust/issues/2186) (MoR scan-side delete reconciliation)
- [iceberg-rust#2205](https://github.com/apache/iceberg-rust/issues/2205) (equality delete reader)
- [iceberg-rust#1530](https://github.com/apache/iceberg-rust/issues/1530) (delete file support in scan)
- [iceberg-rust#2269](https://github.com/apache/iceberg-rust/issues/2269) (DataFusion write actions)

## datafusion-python
- [datafusion-python#1217](https://github.com/apache/datafusion-python/issues/1217) (FFI boundary stability)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate DataFusion as execution engine for compute-heavy operations #3554

Feature Request / Improvement

Problem

Proposed Solution

Design Doc

Operations Unblocked

Related Issues

PyIceberg

iceberg-rust

datafusion-python

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Integrate DataFusion as execution engine for compute-heavy operations #3554

Description

Feature Request / Improvement

Problem

Proposed Solution

Design Doc

Operations Unblocked

Related Issues

PyIceberg

iceberg-rust

datafusion-python

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions