Skip to content

Spark: Allow session-level split size override without DDL or source code changes #16153

@gerashegalov

Description

@gerashegalov

Feature Request / Improvement

Problem Statement

When different compute engines or hardware accelerators (e.g., GPU via RAPIDS Accelerator and CPU) read the same Iceberg table concurrently, they need different values for read.split.target-size. GPU readers benefit from significantly larger splits (e.g., 2GB) to saturate device memory bandwidth, while CPU readers perform better with the current default (128MB).

Today there are only two ways to control read.split.target-size:

  1. Table property (read.split.target-size) -- requires DDL (ALTER TABLE ... SET TBLPROPERTIES), affects all readers globally, and is unsuitable when GPU and CPU workloads hit the same table simultaneously.
  2. Read option (split-size) -- requires source code changes to every DataFrameReader call site, which is impractical for ad-hoc SQL queries and shared notebooks. Hardware accelerators like RAPIDS Accelerator for Apache Spark are designed as drop-in replacements that require no application code changes, so requiring per-call-site read options defeats this benefit.

Neither approach allows a Spark session to declare "all my reads should use split size X" without modifying table metadata or application code.

Proposed Solution

Add a Spark session configuration key that overrides the table property for split size:

spark.sql.iceberg.split-size

This follows the existing pattern used by other Iceberg session configs such as spark.sql.iceberg.vectorization.enabled and spark.sql.iceberg.data-planning-mode.

The resolution precedence would be:

  1. Read option (split-size)
  2. Table-scoped session configuration (spark.sql.iceberg.split-size.<catalog>.<database>.<table>)
  3. Global session configuration (spark.sql.iceberg.split-size)
  4. Table property (read.split.target-size)
  5. Default (128MB)

Usage

-- GPU session: use 2GB splits for all table reads
SET spark.sql.iceberg.split-size = 2147483648;

-- CPU session: keep the default or set a different value
SET spark.sql.iceberg.split-size = 134217728;

-- Override split size for a specific table in this session
SET spark.sql.iceberg.split-size.my_catalog.my_db.large_table = 1073741824;

No DDL or code changes needed -- each session gets its own split size, with optional per-table granularity.

Alternative: Engine-scoped table property overrides

A further extension could add an engine or hardware type qualifier to the read.split.target-size table property itself (e.g., read.split.target-size.gpu), allowing a single table to declare optimal split sizes for different hardware profiles without session-level configuration. This would let table owners encode hardware-aware defaults directly in metadata.

Scope

  • Spark only (all supported shims: 3.4, 3.5, 4.0)
  • Files affected: SparkSQLProperties, SparkReadConf, SparkConfParser
  • Backward compatible: no behavior change unless the new session key is explicitly set

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions