You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to pull together some related threads and frame a bigger picture.
I've been working on the File Format API (#3100, PR #3119) to decouple format handling from the write path.
While tracing through both the read and write paths I noted that
PyArrow is hardwired as the only execution engine, and that bottleneck shows up in multiple places:
DuckDB has its own highly optimized C++ Parquet reader.
DataFusion and Polars have Rust-native ones. None of them can use their own readers today.
They all get a pre-materialized pa.Table from PyArrow.
Currently, in the code
If you look at the read path, there's a clean split point:
plan_files() returns Iterable[FileScanTask] — file paths, delete files, partition info —
all engine-agnostic. The engine-specific part starts when ArrowScan takes those tasks and actually reads the Parquet files.
Polars already does this using scan_iceberg() that calls PyIceberg for the plan,
then reads the files with its own Rust engine. The idea is to formalize that handoff.
What this could look like
I propose something like an IcebergEngine ABC following the same pattern as Catalog (ABC + factory + registration):
PyArrowEngine would wrap the existing ArrowScan with zero behavior change. DataScan.to_arrow() would delegate to the engine.
Something like a table property pyiceberg.engine would select the engine (default "pyarrow").
The File Format APIs I'm building in #3100 (FileFormatWriter, FileFormatReadBuilder, FileFormatModel, FileFormatFactory) would stay internal
to the PyArrow engine. DuckDB/DataFusion/Polars use their own readers.
#3100 separates Parquet/ORC/Avro handling inside the PyArrow engine as part of the file format layer. The engine abstraction proposed here would be a layer above:
which engine runs the scan. The idea is FileIO for storage, FileFormatModel for format, IcebergEngine for execution.
Rollout
I'd like to land #3100 first since it establishes the pattern. After that, the engine work could start with just the ABC + a PyArrowEngine
that wraps ArrowScan. Prototyping with the preferred next engine (DuckDB/DataFusion) would be next, then the rest.
Questions I'd like input on
Naming: IcebergEngine vs ExecutionBackend vs something else? Java uses module separation, so no precedent.
Granularity: Should read and write be separate interfaces? A DuckDB engine might be read-only initially.
Return type: pa.Table directly, or a lazy RecordBatchIterator with to_arrow()?
Credentials: Engines with their own I/O (DuckDB, DataFusion) need S3/GCS creds from FileIO.properties. What's the cleanest way to propagate them?
Scope: Data path only? Metadata (catalogs, snapshots) is already engine-agnostic, so I don't think it needs change.
First engine after PyArrow: DuckDB and DataFusion both have native Parquet readers and Arrow interop. Any preference on which to prototype first?
Happy to write up more detail on any of this or prototype a piece of it.
Mainly looking for feedback on whether this direction makes sense and what the priorities should be.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I want to pull together some related threads and frame a bigger picture.
I've been working on the File Format API (#3100, PR #3119) to decouple format handling from the write path.
While tracing through both the read and write paths I noted that
PyArrow is hardwired as the only execution engine, and that bottleneck shows up in multiple places:
ArrowScan.to_record_batcheswith multithreaded workersscan_iceberg()already bypasses PyArrow for file readsContext
Right now every output method goes through
to_arrow():DuckDB has its own highly optimized C++ Parquet reader.
DataFusion and Polars have Rust-native ones. None of them can use their own readers today.
They all get a pre-materialized
pa.Tablefrom PyArrow.Currently, in the code
If you look at the read path, there's a clean split point:
plan_files()returnsIterable[FileScanTask]— file paths, delete files, partition info —all engine-agnostic. The engine-specific part starts when
ArrowScantakes those tasks and actually reads the Parquet files.Polars already does this using
scan_iceberg()that calls PyIceberg for the plan,then reads the files with its own Rust engine. The idea is to formalize that handoff.
What this could look like
I propose something like an
IcebergEngineABC following the same pattern asCatalog(ABC + factory + registration):PyArrowEnginewould wrap the existingArrowScanwith zero behavior change.DataScan.to_arrow()would delegate to the engine.Something like a table property
pyiceberg.enginewould select the engine (default"pyarrow").The File Format APIs I'm building in #3100 (
FileFormatWriter,FileFormatReadBuilder,FileFormatModel,FileFormatFactory) would stay internalto the PyArrow engine. DuckDB/DataFusion/Polars use their own readers.
How this relates to #3100
#3100 separates Parquet/ORC/Avro handling inside the PyArrow engine as part of the file format layer. The engine abstraction proposed here would be a layer above:
which engine runs the scan. The idea is
FileIOfor storage,FileFormatModelfor format,IcebergEnginefor execution.Rollout
I'd like to land #3100 first since it establishes the pattern. After that, the engine work could start with just the ABC + a
PyArrowEnginethat wraps
ArrowScan. Prototyping with the preferred next engine (DuckDB/DataFusion) would be next, then the rest.Questions I'd like input on
IcebergEnginevsExecutionBackendvs something else? Java uses module separation, so no precedent.pa.Tabledirectly, or a lazyRecordBatchIteratorwithto_arrow()?FileIO.properties. What's the cleanest way to propagate them?Happy to write up more detail on any of this or prototype a piece of it.
Mainly looking for feedback on whether this direction makes sense and what the priorities should be.
Beta Was this translation helpful? Give feedback.
All reactions