Skip to content

Commit 0774698

Browse files
koenvoMoIMC
authored andcommitted
Fallback for upsert when arrow cannot compare source rows with target rows (apache#1878)
<!-- Fixes apache#1711 --> Upsert operations in PyIceberg rely on Arrow joins between source and target rows. However, Arrow Acero cannot compare certain complex types — like `struct`, `list`, and `map` — unless they’re part of the join key. When such types exist in non-join columns, the upsert fails with an error like: ```ArrowInvalid: Data type struct<...> is not supported in join non-key field venue_geo``` This PR introduces a **fallback mechanism**: if Arrow fails to join due to unsupported types, we fall back to comparing only the key columns. Non-key complex fields are ignored in the join condition, but still retained in the final upserted data. --- ```python txn.upsert(df, join_cols=["match_id"]) ``` > ❌ ArrowInvalid: Data type struct<...> is not supported in join non-key field venue_geo --- ```python txn.upsert(df, join_cols=["match_id"]) ``` > ✅ Successfully inserts or updates the record, skipping complex field comparison during join --- Yes: - A test was added to reproduce the failure scenario with complex non-key fields. - The new behavior is verified by asserting that the upsert completes successfully using the fallback logic. --- > ℹ️ **Note** > This change does not affect users who do not include complex types in their schemas. For those who do, it improves resilience while preserving data correctness. --- Yes — upserts involving complex non-key columns (like `struct`, `list`, or `map`) no longer fail. They now succeed by skipping unsupported comparisons during the join phase.
1 parent 2d0acf1 commit 0774698

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

tests/table/test_upsert.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -674,7 +674,7 @@ def test_upsert_with_nulls(catalog: Catalog) -> None:
674674

675675
schema = pa.schema(
676676
[
677-
("foo", pa.string()),
677+
("foo", pa.large_string()),
678678
("bar", pa.int32()),
679679
("baz", pa.bool_()),
680680
]
@@ -702,7 +702,7 @@ def test_upsert_with_nulls(catalog: Catalog) -> None:
702702
upd = table.upsert(data_without_null, join_cols=["foo"])
703703
assert upd.rows_updated == 1
704704
assert upd.rows_inserted == 0
705-
assert table.scan().to_arrow() == pa.Table.from_pylist(
705+
assert table.scan().to_arrow().combine_chunks() == pa.Table.from_pylist(
706706
[
707707
{"foo": "apple", "bar": 7, "baz": False},
708708
{"foo": "banana", "bar": None, "baz": False},

0 commit comments

Comments
 (0)