Skip to content

Commit fe7c749

Browse files
GayathriSrividyaGayathri Srividya Rajavarapukevinjqliu
authored
fix: correct NOT STARTS WITH projection for truncated partitions (#3528)
closes #3493 ## Summary Fixes incorrect projection of `NOT STARTS WITH` predicates for truncated string/binary partition fields. The current implementation unsafely truncates the filter literal without checking its length relative to the truncate width. ## Root Cause The `TruncateTransform.project` method calls `_truncate_array` which blindly truncates the literal for both `STARTS WITH` and `NOT STARTS WITH` predicates: ```python elif isinstance(pred, BoundNotStartsWith): return NotStartsWith(Reference(name), _transform_literal(transform, boundary)) ``` For `NOT STARTS WITH "hello"` with `truncate[2]`, this produces: - Current (unsafe): `NOT STARTS WITH "he"` - Problem: The truncated partition contains all values starting with "he" (from "hello", "heat", "hear", etc.), so we cannot safely exclude all non-"hello" rows ## Solution Add special handling for `BoundNotStartsWith` in the `project` method following the Java/Go reference behavior: - **prefix_length < truncate_width**: Keep original `NOT STARTS WITH` literal (safe) - **prefix_length == truncate_width**: Project to `!=` instead (safe equality check) - **prefix_length > truncate_width**: Return `None` (no inclusive projection possible) ### pyiceberg/transforms.py - Add explicit `NOT STARTS WITH` handling before calling `_truncate_array` - Check literal length vs truncate width and apply correct projection rules ### tests/test_transforms.py - Update `test_projection_truncate_string_not_starts_with` to expect `None` (prefix_length > width is unsafe) - Add `test_projection_truncate_string_not_starts_with_shorter_literal` (prefix_length == width → `!=`) - Add `test_projection_truncate_string_not_starts_with_original_literal` (prefix_length < width → original) ## Validation - `make lint` ✓ (all pre-commit hooks pass) - `pytest tests/test_transforms.py` → 280 passed ✓ - All 13 string truncate projection tests pass --------- Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local> Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
1 parent a38bbe3 commit fe7c749

2 files changed

Lines changed: 27 additions & 4 deletions

File tree

pyiceberg/transforms.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -813,7 +813,16 @@ def project(self, name: str, pred: BoundPredicate) -> UnboundPredicate | None:
813813
return _truncate_number(name, pred, self.transform(field_type))
814814
elif isinstance(field_type, (BinaryType, StringType)):
815815
if isinstance(pred, BoundLiteralPredicate):
816-
return _truncate_array(name, pred, self.transform(field_type))
816+
if isinstance(pred, BoundNotStartsWith):
817+
literal_width = len(pred.literal.value)
818+
if literal_width < self.width:
819+
return pred.as_unbound(name, pred.literal.value)
820+
elif literal_width == self.width:
821+
return NotEqualTo(name, pred.literal.value)
822+
else:
823+
return None
824+
else:
825+
return _truncate_array(name, pred, self.transform(field_type))
817826

818827
def strict_project(self, name: str, pred: BoundPredicate) -> UnboundPredicate | None:
819828
field_type = pred.term.ref().field.field_type

tests/test_transforms.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1025,10 +1025,24 @@ def test_projection_truncate_string_starts_with(bound_reference_str: BoundRefere
10251025
) == StartsWith(term="name", literal=literal("he"))
10261026

10271027

1028-
def test_projection_truncate_string_not_starts_with(bound_reference_str: BoundReference) -> None:
1028+
def test_projection_truncate_string_not_starts_with_longer_literal(bound_reference_str: BoundReference) -> None:
1029+
# Not a valid projection: return None because pruning on "he" could drop qualifying rows like "help".
1030+
assert TruncateTransform(2).project("name", BoundNotStartsWith(term=bound_reference_str, literal=literal("hello"))) is None
1031+
1032+
1033+
def test_projection_truncate_string_not_starts_with_shorter_literal(bound_reference_str: BoundReference) -> None:
1034+
def test_projection_truncate_string_not_starts_with_equal_width_literal(bound_reference_str: BoundReference) -> None:
1035+
# Valid projection: improve NOT STARTS WITH "he" to partition != "he".
1036+
assert TruncateTransform(2).project(
1037+
"name", BoundNotStartsWith(term=bound_reference_str, literal=literal("he"))
1038+
) == NotEqualTo(term="name", literal=literal("he"))
1039+
1040+
1041+
def test_projection_truncate_string_not_starts_with_shorter_literal(bound_reference_str: BoundReference) -> None:
1042+
# Valid projection: pass the NOT STARTS WITH literal "h" through unchanged.
10291043
assert TruncateTransform(2).project(
1030-
"name", BoundNotStartsWith(term=bound_reference_str, literal=literal("hello"))
1031-
) == NotStartsWith(term="name", literal=literal("he"))
1044+
"name", BoundNotStartsWith(term=bound_reference_str, literal=literal("h"))
1045+
) == NotStartsWith(term="name", literal=literal("h"))
10321046

10331047

10341048
def _test_projection(lhs: UnboundPredicate | None, rhs: UnboundPredicate | None) -> None:

0 commit comments

Comments
 (0)