Feature Request / Improvement
Problem Statement
When different compute engines or hardware accelerators (e.g., GPU via RAPIDS Accelerator and CPU) read the same Iceberg table concurrently, they need different values for read.split.target-size. GPU readers benefit from significantly larger splits (e.g., 2GB) to saturate device memory bandwidth, while CPU readers perform better with the current default (128MB).
Today there are only two ways to control read.split.target-size:
- Table property (
read.split.target-size) -- requires DDL (ALTER TABLE ... SET TBLPROPERTIES), affects all readers globally, and is unsuitable when GPU and CPU workloads hit the same table simultaneously.
- Read option (
split-size) -- requires source code changes to every DataFrameReader call site, which is impractical for ad-hoc SQL queries and shared notebooks. Hardware accelerators like RAPIDS Accelerator for Apache Spark are designed as drop-in replacements that require no application code changes, so requiring per-call-site read options defeats this benefit.
Neither approach allows a Spark session to declare "all my reads should use split size X" without modifying table metadata or application code.
Proposed Solution
Add a Spark session configuration key that overrides the table property for split size:
spark.sql.iceberg.split-size
This follows the existing pattern used by other Iceberg session configs such as spark.sql.iceberg.vectorization.enabled and spark.sql.iceberg.data-planning-mode.
The resolution precedence would be:
- Read option (
split-size)
- Table-scoped session configuration (
spark.sql.iceberg.split-size.<catalog>.<database>.<table>)
- Global session configuration (
spark.sql.iceberg.split-size)
- Table property (
read.split.target-size)
- Default (128MB)
Usage
-- GPU session: use 2GB splits for all table reads
SET spark.sql.iceberg.split-size = 2147483648;
-- CPU session: keep the default or set a different value
SET spark.sql.iceberg.split-size = 134217728;
-- Override split size for a specific table in this session
SET spark.sql.iceberg.split-size.my_catalog.my_db.large_table = 1073741824;
No DDL or code changes needed -- each session gets its own split size, with optional per-table granularity.
Alternative: Engine-scoped table property overrides
A further extension could add an engine or hardware type qualifier to the read.split.target-size table property itself (e.g., read.split.target-size.gpu), allowing a single table to declare optimal split sizes for different hardware profiles without session-level configuration. This would let table owners encode hardware-aware defaults directly in metadata.
Scope
- Spark only (all supported shims: 3.4, 3.5, 4.0)
- Files affected:
SparkSQLProperties, SparkReadConf, SparkConfParser
- Backward compatible: no behavior change unless the new session key is explicitly set
Query engine
Spark
Willingness to contribute
Feature Request / Improvement
Problem Statement
When different compute engines or hardware accelerators (e.g., GPU via RAPIDS Accelerator and CPU) read the same Iceberg table concurrently, they need different values for
read.split.target-size. GPU readers benefit from significantly larger splits (e.g., 2GB) to saturate device memory bandwidth, while CPU readers perform better with the current default (128MB).Today there are only two ways to control
read.split.target-size:read.split.target-size) -- requires DDL (ALTER TABLE ... SET TBLPROPERTIES), affects all readers globally, and is unsuitable when GPU and CPU workloads hit the same table simultaneously.split-size) -- requires source code changes to everyDataFrameReadercall site, which is impractical for ad-hoc SQL queries and shared notebooks. Hardware accelerators like RAPIDS Accelerator for Apache Spark are designed as drop-in replacements that require no application code changes, so requiring per-call-site read options defeats this benefit.Neither approach allows a Spark session to declare "all my reads should use split size X" without modifying table metadata or application code.
Proposed Solution
Add a Spark session configuration key that overrides the table property for split size:
This follows the existing pattern used by other Iceberg session configs such as
spark.sql.iceberg.vectorization.enabledandspark.sql.iceberg.data-planning-mode.The resolution precedence would be:
split-size)spark.sql.iceberg.split-size.<catalog>.<database>.<table>)spark.sql.iceberg.split-size)read.split.target-size)Usage
No DDL or code changes needed -- each session gets its own split size, with optional per-table granularity.
Alternative: Engine-scoped table property overrides
A further extension could add an engine or hardware type qualifier to the
read.split.target-sizetable property itself (e.g.,read.split.target-size.gpu), allowing a single table to declare optimal split sizes for different hardware profiles without session-level configuration. This would let table owners encode hardware-aware defaults directly in metadata.Scope
SparkSQLProperties,SparkReadConf,SparkConfParserQuery engine
Spark
Willingness to contribute