Skip to content

feat: update iceberg integration to support v0.11.0#6192

Closed
gweaverbiodev wants to merge 2 commits into
Eventual-Inc:mainfrom
gweaverbiodev:gw/pyice11
Closed

feat: update iceberg integration to support v0.11.0#6192
gweaverbiodev wants to merge 2 commits into
Eventual-Inc:mainfrom
gweaverbiodev:gw/pyice11

Conversation

@gweaverbiodev
Copy link
Copy Markdown
Contributor

@gweaverbiodev gweaverbiodev commented Feb 12, 2026

Changes Made

Makes required changes to support pyiceberg v0.11.0 which now properly supports Decimal type (see apache/iceberg-python#2515). Excludes v0.9.1 and v0.10.0 that have the Decimal type issue.

There is a breaking change in pyiceberg that requires a partition field to have a different name than any field in the schema. My initial attempt is to add a name field that requires explicit specification by the user unless there is some safe way to generate this name automatically for the user? Perhaps it could be <source field name>_<field id>?

@gweaverbiodev gweaverbiodev changed the title update iceberg integration to support v0.11.0 [feat] update iceberg integration to support v0.11.0 Feb 12, 2026
@gweaverbiodev gweaverbiodev changed the title [feat] update iceberg integration to support v0.11.0 feat: update iceberg integration to support v0.11.0 Feb 12, 2026
@github-actions github-actions Bot added the feat label Feb 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR updates Daft's Iceberg integration to support PyIceberg v0.11.0, which fixes Decimal type handling and introduces a breaking change requiring partition fields to have explicit names distinct from schema field names.

The implementation adds an optional name parameter to PartitionField across Python and Rust layers. When creating Iceberg tables, users must now provide explicit partition field names (e.g., name="a_identity" instead of reusing the source field name).

Critical Issue Found:

  • The make_partition_field helper function in daft/io/scan.py was not updated to accept and pass the name parameter, which will break reading existing Iceberg tables with partitions
  • The call site in iceberg_scan.py:109 extracts the name from PyIceberg partition fields but cannot pass it through

Breaking Changes:

  • Users creating partitioned Iceberg tables must now provide explicit name parameters
  • The PR title should follow conventional commit format with ! to indicate breaking changes (per rule 81e59850-f921-405c-a13e-213b428b8d9b)

What Works Well:

  • Rust implementation properly threads the name field through all layers
  • Test coverage updated for all partition transform types
  • Version constraints correctly exclude problematic v0.9.1 and v0.10.0

Confidence Score: 1/5

  • This PR contains a critical bug that will break reading partitioned Iceberg tables
  • The make_partition_field helper function was not updated to support the new name parameter, which means reading existing partitioned Iceberg tables will fail to preserve partition field names. This is a blocking issue that must be fixed before merge.
  • daft/io/scan.py and daft/io/iceberg/iceberg_scan.py require immediate fixes to support the name parameter

Important Files Changed

Filename Overview
daft/io/partitioning.py Adds optional name parameter to PartitionField with property accessor - clean implementation
daft/catalog/__iceberg.py Updates partition field conversion to use explicit name parameter, adds validation error for missing names
tests/catalog/test_iceberg.py Updates all partition tests to provide explicit name parameters, assertions updated to check new names
daft/io/scan.py Missing name parameter in make_partition_field - will break reading partitioned Iceberg tables
daft/io/iceberg/iceberg_scan.py Extracts partition field name but doesn't pass it to make_partition_field - needs update

Sequence Diagram

sequenceDiagram
    participant User
    participant Catalog
    participant IcebergCatalog
    participant PyIceberg
    
    Note over User,PyIceberg: Creating Partitioned Table
    User->>Catalog: create_table(schema, partition_fields)
    Note right of User: partition_fields must have explicit names
    Catalog->>IcebergCatalog: _partition_fields_to_pyiceberg_spec()
    IcebergCatalog->>IcebergCatalog: validate pf.name is not None
    IcebergCatalog->>PyIceberg: create PyIcebergPartitionField(name=pf.name)
    PyIceberg-->>IcebergCatalog: PartitionSpec
    IcebergCatalog->>PyIceberg: create_table(schema, partition_spec)
    PyIceberg-->>User: Table
    
    Note over User,PyIceberg: Reading Partitioned Table
    User->>Catalog: read_iceberg(table)
    Catalog->>IcebergScanOperator: new(table)
    IcebergScanOperator->>PyIceberg: get partition spec
    PyIceberg-->>IcebergScanOperator: PartitionSpec with names
    IcebergScanOperator->>IcebergScanOperator: _iceberg_partition_field_to_daft(pfield)
    Note right of IcebergScanOperator: Extracts name from pfield.name
    IcebergScanOperator->>IcebergScanOperator: make_partition_field(field, source, tfm)
    Note right of IcebergScanOperator: BUG: name not passed here
    IcebergScanOperator-->>User: DataFrame
Loading

Last reviewed commit: d4514ff

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 12, 2026

Additional Comments (3)

daft/io/scan.py
missing name parameter - this will break reading existing Iceberg tables with partitions

The function needs to accept and pass the name parameter to PyPartitionField. Currently called from iceberg_scan.py:109 which extracts the name from PyIceberg partition fields, but that name won't be passed through.

def make_partition_field(
    field: Field, source_field: Field | None = None, transform: PyPartitionTransform | None = None, name: str | None = None
) -> PyPartitionField:
    return PyPartitionField(
        field._field,
        source_field._field if source_field is not None else None,
        transform,
        name,
    )

tests/io/test_partitioning.py
consider testing the name property that was added to PartitionField

    partition_field = PartitionField.create(field, source_field, transform, name="test_name")
    assert partition_field.field == field
    assert partition_field.source_field == source_field
    assert partition_field.transform == transform
    assert partition_field.name == "test_name"

daft/io/iceberg/iceberg_scan.py
pass name to make_partition_field to preserve partition field names from PyIceberg

    return make_partition_field(result_field, daft_field, transform=tfm, name=name)

@kevinjqliu
Copy link
Copy Markdown

There is a breaking change in pyiceberg that requires a partition field to have a different name than any field in the schema. My initial attempt is to add a name field that requires explicit specification by the user unless there is some safe way to generate this name automatically for the user? Perhaps it could be _?

this is the naming scheme pyiceberg uses
https://github.com/apache/iceberg-python/blob/7d4a8ef65d92ea52953136457d5ec277728dad9d/pyiceberg/partitioning.py#L380-L406

Heres the java reference
https://github.com/apache/iceberg/blob/ed39cee30d4e229cb71bbcbc7faa081375fcb179/api/src/main/java/org/apache/iceberg/PartitionSpec.java

TDLR

  • for identity transform, the default name is the column name itself
  • for other transforms, the default is columnName + "_<transform>"

@gweaverbiodev
Copy link
Copy Markdown
Contributor Author

thanks @kevinjqliu closing this one in favor of that simpler automated naming with clean history #6200

@gweaverbiodev gweaverbiodev deleted the gw/pyice11 branch February 13, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants