Skip to content

Define validation policy for antimeridian-crossing bboxes and geometries #531

@sethfitz

Description

Problem

Validation of bbox columns and feature geometries does not currently cover behavior at the 180°/-180° meridian. Today, bbox validation in PySpark checks latitude completeness, ordering (ymin <= ymax), and range ([-90, 90]); Pydantic only checks completeness. Nothing constrains xmin, xmax, or geometry coordinate ranges on either side.

The decision affects two distinct surfaces:

  • Bbox column encoding — how a row whose feature spans 180° is represented in the per-row bbox struct that downstream readers use for spatial pushdown via Parquet statistics.
  • Feature geometry encoding — whether a LineString/Polygon that crosses 180° is stored as one geometry with coordinates spanning the meridian, or split into a Multi-geometry on either side.

These can be decided independently, but the prior art treats them as a pair.

Hard constraint: GeoParquet and reader compatibility

Whatever convention we adopt has to round-trip through GeoParquet readers without producer-side workarounds. Concretely:

  • Bbox struct values have to be interpretable by readers that follow the GeoParquet spec (which defers to RFC 7946 §5.2).
  • Feature geometries have to be parseable by Shapely, GEOS, and PROJ without coordinate values that violate their domain assumptions. Shapely treats longitude as a Cartesian x; predicates (contains, intersects, union) produce wrong results when an input geometry has vertices on both sides of 180° in a single ring.

This rules out anything that emits coordinates outside [-180, 180] in either the bbox or the geometry, regardless of how convenient it would be internally.

Prior art

RFC 7946 (GeoJSON) — two relevant sections:

  • §5.2 (bboxes): an antimeridian-crossing bbox has its west-edge longitude greater than its east-edge longitude. The Fiji example: [177.0, -20.0, -178.0, -16.0] is a 5° span across the meridian; [-178.0, -20.0, 177.0, -16.0] is the complementary 355° span. The ordering inversion is the signal, not an error.
  • §3.1.9 (geometries): "Any geometry that crosses the antimeridian SHOULD be represented by cutting it in two such that neither part's representation crosses the antimeridian." SHOULD, not MUST — a recommendation, but the strongest one the RFC offers.

GeoParquet — adopted RFC 7946's bbox convention explicitly (opengeospatial/geoparquet#112, resolved in PR #145). The spec acknowledges spatial-filter pushdown is degraded for crossing bboxes: a row with xmin > xmax matches when x >= xmin OR x <= xmax, which Parquet column statistics can't express directly. Readers either special-case the wraparound or accept that crossing rows aren't pruned.

STAC — same convention as GeoJSON/GeoParquet. stac-check ships a geometry_validation.bbox_antimeridian rule that flags non-conforming bboxes.

PostGIS — provides ST_ShiftLongitude and ST_WrapX for operations on crossing geometries; the canonical answer is to shift into 0-360 space for computation, not to store outside [-180, 180].

Shapely / GEOS — no native antimeridian awareness. Community guidance (Towards Data Science article and related libraries like antimeridian) is to split at 180° before computation. A geometry that crosses 180° as a single ring is treated by Shapely as wrapping the long way around the globe.

Considerations

Pushdown vs. correctness. Bbox-on-row exists to drive spatial pruning via Parquet row-group stats. Any encoding that lets a single bbox represent both sides of the meridian (RFC 7946's xmin > xmax, or world-spanning [-180, 180]) degrades pushdown for crossing features. The trade is between "rare false positives on Pacific queries" (RFC 7946 convention) and "every Pacific feature false-positives globally" (world-spanning).

Schema shape. Changing bbox from a single struct to a list of structs would preserve pushdown per-part but breaks the column statistics contract the existing distribution relies on. Any encoding that keeps bbox as a single struct stays compatible with current readers.

Geometry/bbox coupling. GeoParquet bboxes are per-feature, not per-geometry, so splitting a crossing geometry into Multi-parts on either side doesn't help the bbox encoding question — the feature-level bbox still has to represent the full extent. The two decisions can be taken independently, but neither resolves the other.

Geometry-level validation. Out of scope for this issue. We don't currently validate geometries beyond their type. Coordinate-level geometry checks (including antimeridian-split enforcement) would require introducing Apache Sedona; that's a separate decision.

Producer responsibility. The schema validates what's emitted; it doesn't transform raw inputs. Whichever convention is chosen, producer pipelines need to know whether their inputs (often raw geometries crossing 180°) need to be split or re-encoded before they hit the validator. Worth confirming what the current Overture production pipeline emits before locking in checks that would reject existing data.

Registry alignment. The inverted index at s3://overturemaps-us-west-2/registry/ carries per-row bbox. The catalog's collections.parquet carries per-partition bbox. Both should follow the same convention as the row-level bbox column for spatial queries to compose.

Sources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions