Skip to content

Commit 6bdc9a5

Browse files
committed
Update partitioning strategy to optimize for label-based queries. Moving forward, we will use PMtiles for local display.
1 parent 128e530 commit 6bdc9a5

7 files changed

Lines changed: 571 additions & 42 deletions

File tree

.claude/CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Style: Black (format-on-save in VSCode). Lint: flake8 + pylint, configured in `p
4949
- [docs/data-sources.md](docs/data-sources.md) — URLs, auth, schema quirks for every source
5050
- [docs/taxonomy-setup.md](docs/taxonomy-setup.md) — crosswalk CSVs, build_taxonomy.py, frontend sync
5151
- [docs/data-versioning.md](docs/data-versioning.md)`versions:` block, path resolution, external references
52+
- [docs/partitioning-strategy.md](docs/partitioning-strategy.md) — Hive layout of the partitioned Parquet (`shared_label` for conflated, `primary_tag` for OSM), query patterns, when each layout applies
5253
- [docs/turnover-model-methodology.md](docs/turnover-model-methodology.md) — statistical derivation of the POI turnover model with ZIE extension
5354

5455
## Running to-do
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Partitioning strategy
2+
3+
How the rated OSM snapshot and the conflated dataset are laid out on disk, and why.
4+
5+
## Why this layout
6+
7+
Historically both datasets were Hive-partitioned by a 4-character geohash (~1,000–3,000 cells over CONUS) and uploaded to S3 so the web frontend could fetch just the cells covering a map viewport. The current use case is different: **local, nationwide queries filtered primarily by destination type**, with spatial filters as a frequent secondary slice.
8+
9+
Geohash partitioning is actively bad for that pattern — a nationwide "all pharmacies" query has to open every geohash directory. Partitioning by destination type gives near-zero scan for type-filtered queries (one file instead of ~1,500), and we retain spatial efficiency by sorting each partition by geohash so bbox / state / region filters prune via Parquet row-group min/max stats.
10+
11+
Confirmed on the real data: `WHERE shared_label = 'Pharmacy'` on the 17.8 M-row conflated set scans `1/93` files in ~5 ms.
12+
13+
## Layouts
14+
15+
### Conflated (`conflated_partitioned/`)
16+
17+
| | |
18+
|---|---|
19+
| Path | `~/data/openpois/conflation/<versions.conflation>/conflated_partitioned/` |
20+
| Partition column | `shared_label` (URL-encoded in dir name; DuckDB `hive_partitioning=1` decodes transparently) |
21+
| Partitions | 93 (incl. one `shared_label=` bucket for ~720 k unlabeled POIs that don't map to any crosswalk entry) |
22+
| Rows | 17,788,585 total for `20260423` |
23+
| Within-partition sort | ascending `geohash` (precision 6, retained as a column) |
24+
| Dropped at write | `shared_label` (lives in the Hive dir name) |
25+
| On-disk size | ~2.7 GB for `20260423` |
26+
27+
### Rated OSM snapshot (`osm_snapshot_partitioned/`)
28+
29+
| | |
30+
|---|---|
31+
| Path | `~/data/openpois/snapshots/osm/<versions.osm_data>/osm_snapshot_partitioned/` |
32+
| Partition column | derived `primary_tag` ∈ {shop, healthcare, leisure, amenity, tourism, office, craft, historic} |
33+
| Partitions | 8 |
34+
| Rows | 8,708,504 total for `20260417`. Distribution: amenity 4.90 M, leisure 2.22 M, shop 0.79 M, tourism 0.38 M, office 0.16 M, historic 0.12 M, healthcare 0.11 M, craft 0.03 M |
35+
| Within-partition sort | ascending `geohash` (precision 6, retained as a column) |
36+
| Dropped at write | `primary_tag` (lives in the Hive dir name) |
37+
| On-disk size | ~1.2 GB for `20260417` (down from 1.9 GB under the old geohash layout) |
38+
39+
## `primary_tag` derivation (OSM)
40+
41+
~1.9% of rated OSM POIs carry more than one top-level tag (e.g., OSM id `25603734` has both `shop=convenience` and `amenity=fuel`). To pick one partition per POI we apply the same **first-non-null priority** already used by [assign_osm_shared_label()](../../src/openpois/conflation/taxonomy.py), sourced from [`config.yaml` `download.osm.filter_keys`](../../config.yaml):
42+
43+
```
44+
shop > healthcare > leisure > amenity > tourism > office > craft > historic
45+
```
46+
47+
This keeps OSM-only queries and conflation-side labeling consistent: a shop+amenity POI sits under `primary_tag=shop/` and the conflation side labels it via the `shop` crosswalk. All filter-key tag columns (`shop`, `amenity`, etc.) are retained inside the files, so a secondary filter like `primary_tag = 'shop' AND shop = 'bakery'` still works within the one partition that was opened.
48+
49+
Every POI in the rated snapshot has at least one filter-key tag populated (guaranteed by the PBF filtering step in [scripts/osm_snapshot/download.py](../../scripts/osm_snapshot/download.py)), so no null / `__unlabeled__` bucket is needed.
50+
51+
## How to query
52+
53+
All examples use DuckDB with `hive_partitioning=1`, which URL-decodes partition values back to their original form.
54+
55+
```python
56+
import duckdb
57+
58+
CONFLATED = "~/data/openpois/conflation/20260423/conflated_partitioned/**/*.parquet"
59+
OSM = "~/data/openpois/snapshots/osm/20260417/osm_snapshot_partitioned/**/*.parquet"
60+
```
61+
62+
**Type-only, nationwide — reads one file.**
63+
64+
```sql
65+
SELECT COUNT(*) FROM read_parquet(CONFLATED, hive_partitioning=1)
66+
WHERE shared_label = 'Pharmacy';
67+
```
68+
69+
**Type + spatial bbox via `geohash` prefix — row-group pruning inside one partition.**
70+
71+
```sql
72+
SELECT name, geohash
73+
FROM read_parquet(CONFLATED, hive_partitioning=1)
74+
WHERE shared_label = 'Pharmacy'
75+
AND geohash LIKE '9q5%'; -- western US geohash-3 cell
76+
```
77+
78+
For lat/lon bboxes, convert to geohash prefixes with `pygeohash.bbox`/`expand`. A ZXY or state-level filter can usually be expressed as a small disjunction of `geohash LIKE` prefixes.
79+
80+
**Secondary filter inside an OSM partition.**
81+
82+
```sql
83+
SELECT COUNT(*) FROM read_parquet(OSM, hive_partitioning=1)
84+
WHERE primary_tag = 'shop' AND shop = 'bakery'; -- one file scanned
85+
```
86+
87+
**Joining conflated and OSM (e.g., type breakdown by OSM tag).**
88+
89+
```sql
90+
SELECT c.shared_label, o.primary_tag, COUNT(*)
91+
FROM read_parquet(CONFLATED, hive_partitioning=1) c
92+
JOIN read_parquet(OSM, hive_partitioning=1) o USING (osm_id)
93+
WHERE c.shared_label = 'Pharmacy'
94+
GROUP BY 1, 2;
95+
```
96+
97+
## When NOT to use this layout
98+
99+
The geohash-partitioned layout is a better fit for **small-bbox, many-types-at-once** queries — which is exactly the web-map viewport case we moved away from. If the S3 / map-viewport path comes back, the helpers are still in place: see `add_geohash_columns` and `write_partitioned_dataset` in [src/openpois/io/geohash_partition.py](../../src/openpois/io/geohash_partition.py), and the original S3 upload step in [scripts/conflation/upload_to_s3.py](../../scripts/conflation/upload_to_s3.py). Swap the function calls in the two `format_for_upload.py` scripts back to the geohash variants.
100+
101+
## Maintenance
102+
103+
**Regenerate after a new conflation or snapshot run:**
104+
105+
```bash
106+
python -u scripts/osm_snapshot/format_for_upload.py 2>&1 | tee ~/data/openpois/logs/osm_repartition_<version>.log
107+
python -u scripts/conflation/format_for_upload.py 2>&1 | tee ~/data/openpois/logs/conflated_repartition_<version>.log
108+
```
109+
110+
Each script deletes the existing partitioned directory at its versioned path and rewrites it. Geohash precision is controlled by `upload.geohash_precision_sort` in [config.yaml](../../config.yaml) (currently 6 ≈ 0.6 × 1.2 km).
111+
112+
**Where the code lives:**
113+
114+
- [src/openpois/io/geohash_partition.py](../../src/openpois/io/geohash_partition.py)`add_geohash_column`, `compute_primary_osm_tag`, `write_label_partitioned_dataset` (plus the older geohash-partition helpers).
115+
- [scripts/conflation/format_for_upload.py](../../scripts/conflation/format_for_upload.py) — conflated partitioning entry point.
116+
- [scripts/osm_snapshot/format_for_upload.py](../../scripts/osm_snapshot/format_for_upload.py) — OSM partitioning entry point.
117+
- [tests/test_geohash_partition.py](../../tests/test_geohash_partition.py) — unit tests + a DuckDB Hive-decode round-trip.
118+
119+
**S3 upload is currently disabled**`scripts/conflation/upload_to_s3.py` is not run as part of this flow. The `upload.latest_url_*` / `upload.s3_*` keys in `config.yaml` are stale but harmless; clean them up in a later pass if the frontend integration is formally retired.

config.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -243,8 +243,7 @@ upload:
243243
s3_prefix_conflation: "snapshots/conflated"
244244
latest_url_osm: "https://openpois-public.s3.us-west-2.amazonaws.com/snapshots/osm/20260423/osm_snapshot_partitioned/"
245245
latest_url_conflation: "https://openpois-public.s3.us-west-2.amazonaws.com/snapshots/conflated/20260423/conflated_partitioned/"
246-
geohash_precision_partition: 4 # ~39 km x 20 km cells; ~1,000–3,000 cells over CONUS
247-
geohash_precision_sort: 6 # ~0.6 km x 1.2 km; fine-grained sort within each partition
246+
geohash_precision_sort: 6 # ~0.6 km x 1.2 km; within-partition sort key for spatial row-group pruning
248247
# PMTiles generation — single-zoom archive at z14 for both OSM and conflated.
249248
# Site's View.minZoom is 14; OpenLayers over-zooms past z14 natively so
250249
# z15-20 render as lossless geometric scale-ups of the z14 tile.
Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,25 @@
11
"""
2-
Spatially partition the conflated POI dataset for optimized web map viewport queries.
2+
Partition the conflated POI dataset by destination type for local queries.
33
4-
Reads conflated.parquet, adds a geohash-4 partition column computed from each POI's
5-
centroid, sorts rows within partitions by geohash-6 for spatial locality, and writes
6-
a Hive-style partitioned dataset:
4+
Reads conflated.parquet, adds a geohash sort key from each POI's centroid,
5+
and writes a Hive-style dataset partitioned by `shared_label`:
76
87
conflated_partitioned/
9-
geohash_prefix=9q/
10-
part-0.parquet
11-
geohash_prefix=dr/
12-
part-0.parquet
8+
shared_label=Pharmacy/part-0.parquet
9+
shared_label=Restaurant/part-0.parquet
1310
...
1411
15-
Clients can fetch only the geohash cells covering their map viewport, avoiding a full
16-
dataset scan.
12+
Rows within each partition are sorted by the `geohash` column so spatial
13+
filters prune via Parquet row-group min/max stats. Queries like
14+
``WHERE shared_label = 'Pharmacy'`` read a single partition file.
1715
"""
1816
import geopandas as gpd
1917
from config_versioned import Config
2018

21-
from openpois.io.geohash_partition import add_geohash_columns, write_partitioned_dataset
19+
from openpois.io.geohash_partition import (
20+
add_geohash_column,
21+
write_label_partitioned_dataset,
22+
)
2223

2324
# -----------------------------------------------------------------------------
2425
# Configuration
@@ -30,8 +31,8 @@
3031
OUTPUT_DIR = config.get_file_path("conflation", "partitioned")
3132
OVERWRITE = True
3233

33-
PRECISION_PARTITION = config.get("upload", "geohash_precision_partition")
3434
PRECISION_SORT = config.get("upload", "geohash_precision_sort")
35+
PARTITION_COL = "shared_label"
3536

3637
# -----------------------------------------------------------------------------
3738
# Main workflow
@@ -42,15 +43,20 @@
4243
gdf = gpd.read_parquet(INPUT_PATH)
4344
print(f"Loaded {len(gdf):,} POIs")
4445

45-
print("Computing geohash columns from centroids ...")
46-
gdf = add_geohash_columns(
46+
print(f"Computing geohash-{PRECISION_SORT} sort column from centroids ...")
47+
gdf = add_geohash_column(gdf, precision = PRECISION_SORT)
48+
49+
write_label_partitioned_dataset(
4750
gdf,
48-
precision_partition = PRECISION_PARTITION,
49-
precision_sort = PRECISION_SORT,
51+
output_dir = OUTPUT_DIR,
52+
partition_col = PARTITION_COL,
53+
sort_col = "geohash",
54+
overwrite = OVERWRITE,
5055
)
5156

52-
write_partitioned_dataset(gdf, output_dir = OUTPUT_DIR, overwrite = OVERWRITE)
53-
5457
n_partitions = sum(1 for _ in OUTPUT_DIR.iterdir() if _.is_dir())
55-
print(f"Done. Wrote {len(gdf):,} rows across {n_partitions} geohash partitions.")
58+
print(
59+
f"Done. Wrote {len(gdf):,} rows across {n_partitions} "
60+
f"{PARTITION_COL} partitions."
61+
)
5662
print(f"Output: {OUTPUT_DIR}")
Lines changed: 37 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,31 @@
11
"""
2-
Spatially partition the rated OSM snapshot for optimized web map viewport queries.
2+
Partition the rated OSM snapshot by top-level tag for local queries.
33
4-
Reads osm_snapshot_rated.parquet, adds a geohash-4 partition column computed
5-
from each POI's centroid, sorts rows within partitions by geohash-6 for spatial
6-
locality, and writes a Hive-style partitioned dataset:
4+
Reads osm_snapshot_rated.parquet, derives a `primary_tag` per POI via first-
5+
non-null across the configured `download.osm.filter_keys` priority order
6+
(shop > healthcare > leisure > amenity > tourism > office > craft > historic,
7+
matching the priority in `openpois.conflation.taxonomy.assign_osm_shared_label`),
8+
adds a geohash sort key from each POI's centroid, and writes a Hive-style
9+
dataset:
710
811
osm_snapshot_partitioned/
9-
geohash_prefix=9q/
10-
part-0.parquet
11-
geohash_prefix=dr/
12-
part-0.parquet
12+
primary_tag=amenity/part-0.parquet
13+
primary_tag=shop/part-0.parquet
1314
...
1415
15-
Clients can fetch only the geohash cells covering their map viewport, avoiding
16-
a full dataset scan.
16+
Rows within each partition are sorted by the `geohash` column so spatial
17+
filters prune via Parquet row-group min/max stats. Queries like
18+
``WHERE primary_tag = 'shop' AND shop = 'bakery'`` read a single partition
19+
file.
1720
"""
1821
import geopandas as gpd
1922
from config_versioned import Config
2023

21-
from openpois.io.geohash_partition import add_geohash_columns, write_partitioned_dataset
24+
from openpois.io.geohash_partition import (
25+
add_geohash_column,
26+
compute_primary_osm_tag,
27+
write_label_partitioned_dataset,
28+
)
2229

2330
# -----------------------------------------------------------------------------
2431
# Configuration
@@ -30,8 +37,9 @@
3037
OUTPUT_DIR = config.get_file_path("snapshot_osm", "partitioned")
3138
OVERWRITE = True
3239

33-
PRECISION_PARTITION = config.get("upload", "geohash_precision_partition")
40+
FILTER_KEYS = config.get("download", "osm", "filter_keys")
3441
PRECISION_SORT = config.get("upload", "geohash_precision_sort")
42+
PARTITION_COL = "primary_tag"
3543

3644
# -----------------------------------------------------------------------------
3745
# Main workflow
@@ -42,15 +50,25 @@
4250
gdf = gpd.read_parquet(INPUT_PATH)
4351
print(f"Loaded {len(gdf):,} POIs")
4452

45-
print("Computing geohash columns from centroids ...")
46-
gdf = add_geohash_columns(
47-
gdf,
48-
precision_partition = PRECISION_PARTITION,
49-
precision_sort = PRECISION_SORT,
53+
print(f"Deriving {PARTITION_COL} from filter_keys {FILTER_KEYS} ...")
54+
gdf = compute_primary_osm_tag(
55+
gdf, filter_keys = FILTER_KEYS, out_col = PARTITION_COL
5056
)
5157

52-
write_partitioned_dataset(gdf, output_dir = OUTPUT_DIR, overwrite = OVERWRITE)
58+
print(f"Computing geohash-{PRECISION_SORT} sort column from centroids ...")
59+
gdf = add_geohash_column(gdf, precision = PRECISION_SORT)
60+
61+
write_label_partitioned_dataset(
62+
gdf,
63+
output_dir = OUTPUT_DIR,
64+
partition_col = PARTITION_COL,
65+
sort_col = "geohash",
66+
overwrite = OVERWRITE,
67+
)
5368

5469
n_partitions = sum(1 for _ in OUTPUT_DIR.iterdir() if _.is_dir())
55-
print(f"Done. Wrote {len(gdf):,} rows across {n_partitions} geohash partitions.")
70+
print(
71+
f"Done. Wrote {len(gdf):,} rows across {n_partitions} "
72+
f"{PARTITION_COL} partitions."
73+
)
5674
print(f"Output: {OUTPUT_DIR}")

0 commit comments

Comments
 (0)