Skip to content

Commit 10fbafc

Browse files
committed
This PR extends PyIceberg geospatial support in three areas:
1. Adds geospatial bounds metric computation from WKB values (geometry + geography). 2. Adds spatial predicate expression/binding support (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) with conservative evaluator behavior. 3. Improves Arrow/Parquet interoperability for GeoArrow WKB, including explicit handling of geometry vs planar-geography ambiguity at the schema-compatibility boundary. This increment is compatibility-first and does **not** introduce new runtime dependencies. Base `geometry`/`geography` types existed, but there were still practical gaps: - Geospatial columns were not contributing spec-encoded bounds in data-file metrics. - Spatial predicates were not modeled end-to-end in expression binding/visitor plumbing. - GeoArrow metadata can be ambiguous for `geometry` vs `geography(..., "planar")`, causing false compatibility failures during import/add-files flows. - Added pure-Python geospatial utilities in `pyiceberg/utils/geospatial.py`: - WKB envelope extraction - antimeridian-aware geography envelope merge - Iceberg geospatial bound serialization/deserialization - Added `GeospatialStatsAggregator` and geospatial aggregate helpers in `pyiceberg/io/pyarrow.py`. - Updated write/import paths to compute geospatial bounds from actual row values (not Parquet binary min/max stats): - `write_file(...)` - `parquet_file_to_data_file(...)` - Prevented incorrect partition inference from geospatial envelope bounds. - Added expression types in `pyiceberg/expressions/__init__.py`: - `STContains`, `STIntersects`, `STWithin`, `STOverlaps` - bound counterparts and JSON parsing support - Added visitor dispatch/plumbing in `pyiceberg/expressions/visitors.py`. - Behavior intentionally conservative in this increment: - row-level expression evaluator raises `NotImplementedError` - manifest/metrics evaluators return conservative might-match defaults - translation paths preserve spatial predicates where possible - Added GeoArrow WKB decoding helper in `pyiceberg/io/pyarrow.py` to map extension metadata to Iceberg geospatial types. - Added boundary-only compatibility option in `pyiceberg/schema.py`: - `_check_schema_compatible(..., allow_planar_geospatial_equivalence=False)` - Enabled that option only in `_check_pyarrow_schema_compatible(...)` to allow: - `geometry` <-> `geography(..., "planar")` when CRS strings match - while still rejecting spherical geography mismatches - Added one-time warning log when `geoarrow-pyarrow` is unavailable and code falls back to binary. - Updated user docs: `mkdocs/docs/geospatial.md` - Added decisions record: `mkdocs/docs/dev/geospatial-types-decisions-v1.md` Added/updated tests across: - `tests/utils/test_geospatial.py` - `tests/io/test_pyarrow_stats.py` - `tests/io/test_pyarrow.py` - `tests/expressions/test_spatial_predicates.py` - `tests/integration/test_geospatial.py` Coverage includes: - geospatial bound encoding/decoding (XY/XYZ/XYM/XYZM) - geography antimeridian behavior - geospatial metrics generation from write/import paths - spatial predicate modeling/binding/translation behavior - planar ambiguity compatibility guardrails - warning behavior for missing `geoarrow-pyarrow` - No user-facing API removals. - New compatibility relaxation is intentionally scoped to Arrow/Parquet schema-compatibility boundary only. - Core schema/type compatibility remains strict elsewhere. - No spatial pushdown/row execution implementation in this PR. - Spatial predicate execution semantics. - Spatial predicate pushdown/pruning. - Runtime WKB <-> WKT conversion strategy.
1 parent 9687d08 commit 10fbafc

File tree

11 files changed

+1588
-11
lines changed

11 files changed

+1588
-11
lines changed

mkdocs/docs/geospatial.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,11 +84,11 @@ point_wkb = bytes.fromhex("0101000000000000000000000000000000000000")
8484

8585
1. **WKB/WKT Conversion**: Converting between WKB bytes and WKT strings requires external libraries (like Shapely). PyIceberg does not include this conversion to avoid heavy dependencies.
8686

87-
2. **Spatial Predicates**: Spatial filtering (e.g., ST_Contains, ST_Intersects) is not yet supported for query pushdown.
87+
2. **Spatial Predicates Execution**: Spatial predicate APIs (`st-contains`, `st-intersects`, `st-within`, `st-overlaps`) are available in expression trees and binding. Row-level execution and metrics/pushdown evaluation are not implemented yet.
8888

89-
3. **Bounds Metrics**: Geometry/geography columns do not currently contribute to data file bounds metrics.
89+
3. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.
9090

91-
4. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.
91+
4. **GeoArrow planar ambiguity**: In GeoArrow metadata, `geometry` and `geography(..., 'planar')` can be encoded identically (no explicit edge metadata). PyIceberg resolves this ambiguity at the Arrow/Parquet schema-compatibility boundary by treating them as compatible when CRS matches, while keeping core schema compatibility strict elsewhere.
9292

9393
## Format Version
9494

pyiceberg/expressions/__init__.py

Lines changed: 172 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
from pyiceberg.expressions.literals import AboveMax, BelowMin, Literal, literal
3030
from pyiceberg.schema import Accessor, Schema
3131
from pyiceberg.typedef import IcebergBaseModel, IcebergRootModel, L, LiteralValue, StructProtocol
32-
from pyiceberg.types import DoubleType, FloatType, NestedField
32+
from pyiceberg.types import DoubleType, FloatType, GeographyType, GeometryType, NestedField
3333
from pyiceberg.utils.singleton import Singleton
3434

3535

@@ -48,6 +48,16 @@ def _to_literal(value: L | Literal[L]) -> Literal[L]:
4848
return literal(value)
4949

5050

51+
def _to_bytes(value: bytes | bytearray | memoryview) -> bytes:
52+
if isinstance(value, bytes):
53+
return value
54+
if isinstance(value, bytearray):
55+
return bytes(value)
56+
if isinstance(value, memoryview):
57+
return value.tobytes()
58+
raise TypeError(f"Expected bytes-like value, got {type(value)}")
59+
60+
5161
class BooleanExpression(IcebergBaseModel, ABC):
5262
"""An expression that evaluates to a boolean."""
5363

@@ -109,6 +119,14 @@ def handle_primitive_type(cls, v: Any, handler: ValidatorFunctionWrapHandler) ->
109119
return StartsWith(**v)
110120
elif field_type == "not-starts-with":
111121
return NotStartsWith(**v)
122+
elif field_type == "st-contains":
123+
return STContains(**v)
124+
elif field_type == "st-intersects":
125+
return STIntersects(**v)
126+
elif field_type == "st-within":
127+
return STWithin(**v)
128+
elif field_type == "st-overlaps":
129+
return STOverlaps(**v)
112130

113131
# Set
114132
elif field_type == "in":
@@ -1106,3 +1124,156 @@ def __invert__(self) -> StartsWith:
11061124
@property
11071125
def as_bound(self) -> type[BoundNotStartsWith]: # type: ignore
11081126
return BoundNotStartsWith
1127+
1128+
1129+
class SpatialPredicate(UnboundPredicate, ABC):
1130+
type: TypingLiteral["st-contains", "st-intersects", "st-within", "st-overlaps"] = Field(alias="type")
1131+
term: UnboundTerm
1132+
value: bytes = Field()
1133+
model_config = ConfigDict(populate_by_name=True, frozen=True, arbitrary_types_allowed=True)
1134+
1135+
def __init__(
1136+
self,
1137+
term: str | UnboundTerm,
1138+
geometry: bytes | bytearray | memoryview | None = None,
1139+
**kwargs: Any,
1140+
) -> None:
1141+
if geometry is None and "value" in kwargs:
1142+
geometry = kwargs["value"]
1143+
if geometry is None:
1144+
raise TypeError("Spatial predicates require WKB bytes")
1145+
1146+
super().__init__(term=_to_unbound_term(term), value=_to_bytes(geometry))
1147+
1148+
@property
1149+
def geometry(self) -> bytes:
1150+
return self.value
1151+
1152+
def bind(self, schema: Schema, case_sensitive: bool = True) -> BoundSpatialPredicate:
1153+
bound_term = self.term.bind(schema, case_sensitive)
1154+
if not isinstance(bound_term.ref().field.field_type, (GeometryType, GeographyType)):
1155+
raise TypeError(
1156+
f"Spatial predicates can only be bound against geometry/geography fields: {bound_term.ref().field}"
1157+
)
1158+
return self.as_bound(bound_term, self.geometry)
1159+
1160+
def __eq__(self, other: Any) -> bool:
1161+
if isinstance(other, self.__class__):
1162+
return self.term == other.term and self.geometry == other.geometry
1163+
return False
1164+
1165+
def __str__(self) -> str:
1166+
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"
1167+
1168+
def __repr__(self) -> str:
1169+
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"
1170+
1171+
@property
1172+
@abstractmethod
1173+
def as_bound(self) -> type[BoundSpatialPredicate]: ...
1174+
1175+
1176+
class BoundSpatialPredicate(BoundPredicate, ABC):
1177+
value: bytes = Field()
1178+
1179+
def __init__(self, term: BoundTerm, geometry: bytes | bytearray | memoryview):
1180+
super().__init__(term=term, value=_to_bytes(geometry))
1181+
1182+
@property
1183+
def geometry(self) -> bytes:
1184+
return self.value
1185+
1186+
def __eq__(self, other: Any) -> bool:
1187+
if isinstance(other, self.__class__):
1188+
return self.term == other.term and self.geometry == other.geometry
1189+
return False
1190+
1191+
def __str__(self) -> str:
1192+
return f"{self.__class__.__name__}(term={str(self.term)}, geometry={self.geometry!r})"
1193+
1194+
def __repr__(self) -> str:
1195+
return f"{str(self.__class__.__name__)}(term={repr(self.term)}, geometry={self.geometry!r})"
1196+
1197+
@property
1198+
@abstractmethod
1199+
def as_unbound(self) -> type[SpatialPredicate]: ...
1200+
1201+
1202+
class BoundSTContains(BoundSpatialPredicate):
1203+
def __invert__(self) -> BooleanExpression:
1204+
return Not(child=self)
1205+
1206+
@property
1207+
def as_unbound(self) -> type[STContains]:
1208+
return STContains
1209+
1210+
1211+
class BoundSTIntersects(BoundSpatialPredicate):
1212+
def __invert__(self) -> BooleanExpression:
1213+
return Not(child=self)
1214+
1215+
@property
1216+
def as_unbound(self) -> type[STIntersects]:
1217+
return STIntersects
1218+
1219+
1220+
class BoundSTWithin(BoundSpatialPredicate):
1221+
def __invert__(self) -> BooleanExpression:
1222+
return Not(child=self)
1223+
1224+
@property
1225+
def as_unbound(self) -> type[STWithin]:
1226+
return STWithin
1227+
1228+
1229+
class BoundSTOverlaps(BoundSpatialPredicate):
1230+
def __invert__(self) -> BooleanExpression:
1231+
return Not(child=self)
1232+
1233+
@property
1234+
def as_unbound(self) -> type[STOverlaps]:
1235+
return STOverlaps
1236+
1237+
1238+
class STContains(SpatialPredicate):
1239+
type: TypingLiteral["st-contains"] = Field(default="st-contains", alias="type")
1240+
1241+
def __invert__(self) -> BooleanExpression:
1242+
return Not(child=self)
1243+
1244+
@property
1245+
def as_bound(self) -> type[BoundSTContains]:
1246+
return BoundSTContains
1247+
1248+
1249+
class STIntersects(SpatialPredicate):
1250+
type: TypingLiteral["st-intersects"] = Field(default="st-intersects", alias="type")
1251+
1252+
def __invert__(self) -> BooleanExpression:
1253+
return Not(child=self)
1254+
1255+
@property
1256+
def as_bound(self) -> type[BoundSTIntersects]:
1257+
return BoundSTIntersects
1258+
1259+
1260+
class STWithin(SpatialPredicate):
1261+
type: TypingLiteral["st-within"] = Field(default="st-within", alias="type")
1262+
1263+
def __invert__(self) -> BooleanExpression:
1264+
return Not(child=self)
1265+
1266+
@property
1267+
def as_bound(self) -> type[BoundSTWithin]:
1268+
return BoundSTWithin
1269+
1270+
1271+
class STOverlaps(SpatialPredicate):
1272+
type: TypingLiteral["st-overlaps"] = Field(default="st-overlaps", alias="type")
1273+
1274+
def __invert__(self) -> BooleanExpression:
1275+
return Not(child=self)
1276+
1277+
@property
1278+
def as_bound(self) -> type[BoundSTOverlaps]:
1279+
return BoundSTOverlaps

pyiceberg/expressions/visitors.py

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,12 @@
4747
BoundNotStartsWith,
4848
BoundPredicate,
4949
BoundSetPredicate,
50+
BoundSpatialPredicate,
5051
BoundStartsWith,
52+
BoundSTContains,
53+
BoundSTIntersects,
54+
BoundSTOverlaps,
55+
BoundSTWithin,
5156
BoundTerm,
5257
BoundUnaryPredicate,
5358
Not,
@@ -326,6 +331,18 @@ def visit_starts_with(self, term: BoundTerm, literal: LiteralValue) -> T:
326331
def visit_not_starts_with(self, term: BoundTerm, literal: LiteralValue) -> T:
327332
"""Visit bound NotStartsWith predicate."""
328333

334+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> T:
335+
raise NotImplementedError(f"{self.__class__.__name__} does not implement st-contains")
336+
337+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> T:
338+
raise NotImplementedError(f"{self.__class__.__name__} does not implement st-intersects")
339+
340+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> T:
341+
raise NotImplementedError(f"{self.__class__.__name__} does not implement st-within")
342+
343+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> T:
344+
raise NotImplementedError(f"{self.__class__.__name__} does not implement st-overlaps")
345+
329346
def visit_unbound_predicate(self, predicate: UnboundPredicate) -> T:
330347
"""Visit an unbound predicate.
331348
@@ -421,6 +438,26 @@ def _(expr: BoundNotStartsWith, visitor: BoundBooleanExpressionVisitor[T]) -> T:
421438
return visitor.visit_not_starts_with(term=expr.term, literal=expr.literal)
422439

423440

441+
@visit_bound_predicate.register(BoundSTContains)
442+
def _(expr: BoundSTContains, visitor: BoundBooleanExpressionVisitor[T]) -> T:
443+
return visitor.visit_st_contains(term=expr.term, geometry=expr.geometry)
444+
445+
446+
@visit_bound_predicate.register(BoundSTIntersects)
447+
def _(expr: BoundSTIntersects, visitor: BoundBooleanExpressionVisitor[T]) -> T:
448+
return visitor.visit_st_intersects(term=expr.term, geometry=expr.geometry)
449+
450+
451+
@visit_bound_predicate.register(BoundSTWithin)
452+
def _(expr: BoundSTWithin, visitor: BoundBooleanExpressionVisitor[T]) -> T:
453+
return visitor.visit_st_within(term=expr.term, geometry=expr.geometry)
454+
455+
456+
@visit_bound_predicate.register(BoundSTOverlaps)
457+
def _(expr: BoundSTOverlaps, visitor: BoundBooleanExpressionVisitor[T]) -> T:
458+
return visitor.visit_st_overlaps(term=expr.term, geometry=expr.geometry)
459+
460+
424461
def rewrite_not(expr: BooleanExpression) -> BooleanExpression:
425462
return visit(expr, _RewriteNotVisitor())
426463

@@ -514,6 +551,18 @@ def visit_starts_with(self, term: BoundTerm, literal: LiteralValue) -> bool:
514551
def visit_not_starts_with(self, term: BoundTerm, literal: LiteralValue) -> bool:
515552
return not self.visit_starts_with(term, literal)
516553

554+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> bool:
555+
raise NotImplementedError("st-contains row-level evaluation is not implemented")
556+
557+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> bool:
558+
raise NotImplementedError("st-intersects row-level evaluation is not implemented")
559+
560+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> bool:
561+
raise NotImplementedError("st-within row-level evaluation is not implemented")
562+
563+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> bool:
564+
raise NotImplementedError("st-overlaps row-level evaluation is not implemented")
565+
517566
def visit_true(self) -> bool:
518567
return True
519568

@@ -762,6 +811,18 @@ def visit_not_starts_with(self, term: BoundTerm, literal: LiteralValue) -> bool:
762811

763812
return ROWS_MIGHT_MATCH
764813

814+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> bool:
815+
return ROWS_MIGHT_MATCH
816+
817+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> bool:
818+
return ROWS_MIGHT_MATCH
819+
820+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> bool:
821+
return ROWS_MIGHT_MATCH
822+
823+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> bool:
824+
return ROWS_MIGHT_MATCH
825+
765826
def visit_true(self) -> bool:
766827
return ROWS_MIGHT_MATCH
767828

@@ -905,6 +966,8 @@ def visit_bound_predicate(self, predicate: BoundPredicate) -> BooleanExpression:
905966
pred = predicate.as_unbound(field.name, predicate.literal)
906967
elif isinstance(predicate, BoundSetPredicate):
907968
pred = predicate.as_unbound(field.name, predicate.literals)
969+
elif isinstance(predicate, BoundSpatialPredicate):
970+
raise NotImplementedError("Spatial predicate translation is not supported when source columns are missing")
908971
else:
909972
raise ValueError(f"Unsupported predicate: {predicate}")
910973

@@ -926,6 +989,8 @@ def visit_bound_predicate(self, predicate: BoundPredicate) -> BooleanExpression:
926989
return predicate.as_unbound(file_column_name, predicate.literal)
927990
elif isinstance(predicate, BoundSetPredicate):
928991
return predicate.as_unbound(file_column_name, predicate.literals)
992+
elif isinstance(predicate, BoundSpatialPredicate):
993+
return predicate.as_unbound(file_column_name, predicate.geometry)
929994
else:
930995
raise ValueError(f"Unsupported predicate: {predicate}")
931996

@@ -1065,6 +1130,18 @@ def visit_starts_with(self, term: BoundTerm, literal: LiteralValue) -> list[tupl
10651130
def visit_not_starts_with(self, term: BoundTerm, literal: LiteralValue) -> list[tuple[str, str, Any]]:
10661131
return []
10671132

1133+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> list[tuple[str, str, Any]]:
1134+
return []
1135+
1136+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> list[tuple[str, str, Any]]:
1137+
return []
1138+
1139+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> list[tuple[str, str, Any]]:
1140+
return []
1141+
1142+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> list[tuple[str, str, Any]]:
1143+
return []
1144+
10681145
def visit_true(self) -> list[tuple[str, str, Any]]:
10691146
return [] # Not supported
10701147

@@ -1153,6 +1230,18 @@ def _is_nan(self, val: Any) -> bool:
11531230
# In the case of None or other non-numeric types
11541231
return False
11551232

1233+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> bool:
1234+
return ROWS_MIGHT_MATCH
1235+
1236+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> bool:
1237+
return ROWS_MIGHT_MATCH
1238+
1239+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> bool:
1240+
return ROWS_MIGHT_MATCH
1241+
1242+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> bool:
1243+
return ROWS_MIGHT_MATCH
1244+
11561245

11571246
class _InclusiveMetricsEvaluator(_MetricsEvaluator):
11581247
struct: StructType
@@ -1739,6 +1828,18 @@ def visit_starts_with(self, term: BoundTerm, literal: LiteralValue) -> bool:
17391828
def visit_not_starts_with(self, term: BoundTerm, literal: LiteralValue) -> bool:
17401829
return ROWS_MIGHT_NOT_MATCH
17411830

1831+
def visit_st_contains(self, term: BoundTerm, geometry: bytes) -> bool:
1832+
return ROWS_MIGHT_NOT_MATCH
1833+
1834+
def visit_st_intersects(self, term: BoundTerm, geometry: bytes) -> bool:
1835+
return ROWS_MIGHT_NOT_MATCH
1836+
1837+
def visit_st_within(self, term: BoundTerm, geometry: bytes) -> bool:
1838+
return ROWS_MIGHT_NOT_MATCH
1839+
1840+
def visit_st_overlaps(self, term: BoundTerm, geometry: bytes) -> bool:
1841+
return ROWS_MIGHT_NOT_MATCH
1842+
17421843
def _get_field(self, field_id: int) -> NestedField:
17431844
field = self.struct.field(field_id=field_id)
17441845
if field is None:

0 commit comments

Comments
 (0)