feat: update iceberg integration to support v0.11.0#6192
Conversation
Greptile OverviewGreptile SummaryThis PR updates Daft's Iceberg integration to support PyIceberg v0.11.0, which fixes Decimal type handling and introduces a breaking change requiring partition fields to have explicit names distinct from schema field names. The implementation adds an optional Critical Issue Found:
Breaking Changes:
What Works Well:
Confidence Score: 1/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Catalog
participant IcebergCatalog
participant PyIceberg
Note over User,PyIceberg: Creating Partitioned Table
User->>Catalog: create_table(schema, partition_fields)
Note right of User: partition_fields must have explicit names
Catalog->>IcebergCatalog: _partition_fields_to_pyiceberg_spec()
IcebergCatalog->>IcebergCatalog: validate pf.name is not None
IcebergCatalog->>PyIceberg: create PyIcebergPartitionField(name=pf.name)
PyIceberg-->>IcebergCatalog: PartitionSpec
IcebergCatalog->>PyIceberg: create_table(schema, partition_spec)
PyIceberg-->>User: Table
Note over User,PyIceberg: Reading Partitioned Table
User->>Catalog: read_iceberg(table)
Catalog->>IcebergScanOperator: new(table)
IcebergScanOperator->>PyIceberg: get partition spec
PyIceberg-->>IcebergScanOperator: PartitionSpec with names
IcebergScanOperator->>IcebergScanOperator: _iceberg_partition_field_to_daft(pfield)
Note right of IcebergScanOperator: Extracts name from pfield.name
IcebergScanOperator->>IcebergScanOperator: make_partition_field(field, source, tfm)
Note right of IcebergScanOperator: BUG: name not passed here
IcebergScanOperator-->>User: DataFrame
Last reviewed commit: d4514ff |
Additional Comments (3)
The function needs to accept and pass the
|
this is the naming scheme pyiceberg uses Heres the java reference TDLR
|
|
thanks @kevinjqliu closing this one in favor of that simpler automated naming with clean history #6200 |
Changes Made
Makes required changes to support
pyicebergv0.11.0 which now properly supports Decimal type (see apache/iceberg-python#2515). Excludes v0.9.1 and v0.10.0 that have the Decimal type issue.There is a breaking change in
pyicebergthat requires a partition field to have a different name than any field in the schema. My initial attempt is to add anamefield that requires explicit specification by the user unless there is some safe way to generate this name automatically for the user? Perhaps it could be<source field name>_<field id>?