Skip to content

Commit c448799

Browse files
committed
Add user notes for bbox strategy.
1 parent c75901b commit c448799

1 file changed

Lines changed: 26 additions & 0 deletions

File tree

src/openpois/io/geohash_partition.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,32 @@ def write_label_partitioned_dataset(
182182
Partition values that are not alphanumeric (e.g., ``"Fast Food
183183
Restaurant"``) are URL-encoded in the directory name. DuckDB's
184184
``hive_partitioning=1`` decodes these transparently at read time.
185+
186+
Output files include a GeoParquet 1.1 ``covering.bbox`` struct column
187+
(``write_covering_bbox=True``) and use ``row_group_size=100_000`` so
188+
spatial bbox predicates can prune row groups. Gotchas for consumers:
189+
190+
* Predicate pushdown only fires when filters reference the ``bbox``
191+
struct fields directly — e.g.,
192+
``WHERE bbox.xmin <= ? AND bbox.xmax >= ? AND bbox.ymin <= ? AND
193+
bbox.ymax >= ?``. ``ST_Intersects(geometry, …)`` alone will not
194+
trigger bbox pruning in DuckDB; pass both predicates or rely on
195+
GeoPandas's ``read_parquet(..., bbox=...)`` which adds the struct
196+
filter automatically.
197+
* Minimum reader versions: DuckDB ≥ 0.10, PyArrow ≥ 15, GeoPandas
198+
≥ 1.0, GDAL ≥ 3.8. Older readers see the ``bbox`` column as a
199+
plain struct and silently skip the pruning optimization.
200+
* The ``bbox`` column appears in ``DESCRIBE`` output and in
201+
``SELECT *`` results. Downstream pipelines that materialize all
202+
columns will pick it up; drop it explicitly if it conflicts with a
203+
schema contract.
204+
* Writing requires ``geopandas >= 1.0``; calling this function from
205+
an older environment will fail with an unexpected-kwarg error on
206+
``write_covering_bbox``.
207+
* ``row_group_size=100_000`` is fine for small partitions (they end
208+
up as a single row group) but it caps pruning granularity for very
209+
large partitions. Tune downward if a single file grows past ~5M
210+
rows and viewport queries still feel slow.
185211
"""
186212
output_dir = Path(output_dir)
187213

0 commit comments

Comments
 (0)