Engine Abstraction for PyIceberg #3219

nssalian · 2026-04-06T16:28:36Z

nssalian
Apr 6, 2026

I want to pull together some related threads and frame a bigger picture.

I've been working on the File Format API (#3100, PR #3119) to decouple format handling from the write path.
While tracing through both the read and write paths I noted that
PyArrow is hardwired as the only execution engine, and that bottleneck shows up in multiple places:

Memory pressure when using `ArrowScan.to_record_batches` with multi-threaded workers #3122 Memory pressure in ArrowScan.to_record_batches with multithreaded workers
File Format API for PyIceberg #3100 File Format API (pluggable read + write, in progress)
#20 ORC write support
Polars' scan_iceberg() already bypasses PyArrow for file reads

Context

Right now every output method goes through to_arrow():

def to_duckdb(self, ...):   con.register(table_name, self.to_arrow())
def to_ray(self, ...):      return ray.data.from_arrow(self.to_arrow())
def to_pandas(self, ...):   return self.to_arrow().to_pandas(**kwargs)
def to_polars(self, ...):   return pl.from_arrow(self.to_arrow())

DuckDB has its own highly optimized C++ Parquet reader.
DataFusion and Polars have Rust-native ones. None of them can use their own readers today.
They all get a pre-materialized pa.Table from PyArrow.

Currently, in the code

If you look at the read path, there's a clean split point:

Table.scan() - DataScan - plan_files() - [FileScanTask, ...]  -  ArrowScan - pa.Table
                          ^^^^^^^^^^^^                            ^^^^^^^^^
                          engine-agnostic                         engine-specific

plan_files() returns Iterable[FileScanTask] — file paths, delete files, partition info —
all engine-agnostic. The engine-specific part starts when ArrowScan takes those tasks and actually reads the Parquet files.

Polars already does this using scan_iceberg() that calls PyIceberg for the plan,
then reads the files with its own Rust engine. The idea is to formalize that handoff.

What this could look like

I propose something like an IcebergEngine ABC following the same pattern as Catalog (ABC + factory + registration):

class IcebergEngine(ABC):
    @abstractmethod
    def execute_scan(
        self,
        tasks: Iterable[FileScanTask],
        table_metadata: TableMetadata,
        projected_schema: Schema,
        bound_row_filter: BooleanExpression,
        case_sensitive: bool = True,
        limit: int | None = None,
    ) -> pa.Table: ...

    @abstractmethod
    def execute_write(
        self,
        df: Any,  # engine-native (pa.Table, DuckDB relation, Polars DF, etc.)
        table_metadata: TableMetadata,
        file_io: FileIO,
        properties: Properties = {},
    ) -> list[DataFile]: ...

PyArrowEngine would wrap the existing ArrowScan with zero behavior change. DataScan.to_arrow() would delegate to the engine.
Something like a table property pyiceberg.engine would select the engine (default "pyarrow").

The File Format APIs I'm building in #3100 (FileFormatWriter, FileFormatReadBuilder, FileFormatModel, FileFormatFactory) would stay internal
to the PyArrow engine. DuckDB/DataFusion/Polars use their own readers.

How this relates to #3100

#3100 separates Parquet/ORC/Avro handling inside the PyArrow engine as part of the file format layer. The engine abstraction proposed here would be a layer above:
which engine runs the scan. The idea is FileIO for storage, FileFormatModel for format, IcebergEngine for execution.

Rollout

I'd like to land #3100 first since it establishes the pattern. After that, the engine work could start with just the ABC + a PyArrowEngine
that wraps ArrowScan. Prototyping with the preferred next engine (DuckDB/DataFusion) would be next, then the rest.

Questions I'd like input on

Naming: IcebergEngine vs ExecutionBackend vs something else? Java uses module separation, so no precedent.
Granularity: Should read and write be separate interfaces? A DuckDB engine might be read-only initially.
Return type: pa.Table directly, or a lazy RecordBatchIterator with to_arrow()?
Credentials: Engines with their own I/O (DuckDB, DataFusion) need S3/GCS creds from FileIO.properties. What's the cleanest way to propagate them?
Scope: Data path only? Metadata (catalogs, snapshots) is already engine-agnostic, so I don't think it needs change.
First engine after PyArrow: DuckDB and DataFusion both have native Parquet readers and Arrow interop. Any preference on which to prototype first?

Happy to write up more detail on any of this or prototype a piece of it.
Mainly looking for feedback on whether this direction makes sense and what the priorities should be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine Abstraction for PyIceberg #3219

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Engine Abstraction for PyIceberg #3219

Uh oh!

nssalian Apr 6, 2026

Context

Currently, in the code

What this could look like

How this relates to #3100

Rollout

Questions I'd like input on

Replies: 0 comments

nssalian
Apr 6, 2026