Skip to content

Commit 55bb0dc

Browse files
committed
Optimize history matching code.
1 parent 0605094 commit 55bb0dc

2 files changed

Lines changed: 143 additions & 68 deletions

File tree

.claude/TODO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Short running list of in-progress / upcoming work. Edit freely; trim older compl
1212
- *Overture release*`scripts/overture/download.py` already resolves a concrete release (pinned or auto-detected) inside `download_overture_snapshot`; currently only the `.parts/<release>/` directory records it and `.parts/` is deleted on success. Surface the resolved release by writing `~/data/openpois/snapshots/overture/<version>/download_metadata.json` with `{"release": "2026-04-15.0", ...}` before the cleanup step. `_resolve_overture_release` reads that file ahead of the `.parts/` heuristic.
1313
- *Turnover-model commit*`scripts/models/osm_turnover.py` should capture `git rev-parse HEAD` at training time and either (a) extend `config.write_self("model_output")` to include a `git_commit` entry or (b) drop a `git_commit.txt` next to the model artifacts. `_resolve_model_commit` reads that value instead of the publish-time HEAD, which is the right fingerprint if code has changed between training and publishing.
1414
- Publishing behaviour: if any of the three files is missing, keep the current fallback (and print a visible warning) so old pipeline runs still publish cleanly.
15+
- [ ] **DuckDB `ST_Distance_Sphere` returns wrong distances in v1.4.1.** Added 2026-05-19. The bundled spherical distance is off by ~25 % at continental scale (NYC → LA registers as ~4,900 km vs the correct ~3,940 km) and similarly inflated at small scales (a Seattle 65 m pair reads as 43 m). [src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py) used to depend on it for the R1 current-OSM-survivor filter and silently produced ~4 spurious suppressions per Seattle run as a result; the implementation has been switched to a sklearn BallTree haversine query. Any *new* use of `ST_Distance_Sphere` anywhere in the pipeline should be audited — prefer BallTree or shapely-on-projected-CRS. Tracked separately from the WSL2 httpfs pin below; both should be revisited when we bump DuckDB.
1516
- [ ] Watch for a DuckDB release that fixes the WSL2 httpfs "Information loss on integer cast" crash (issue #21669, fix PR #21395). Once a tagged release ships with the fix and a full `scripts/overture/download.py` run on WSL2 completes, we can unpin from `duckdb==1.4.1` and revert the per-part download to a single-query DuckDB scan. Added 2026-04-17.
1617
- [ ] Auto-check taxonomy changes whenever we switch to a new Overture Maps version (detect new/removed L0/L1/L2 categories vs. `taxonomy_crosswalk_overture_maps.csv` and flag gaps). Added 2026-04-16.
1718
- [ ] Watch for Overture L0/L1 → flat `basic_category` migration (~June 2026). Crosswalk CSV + `assign_overture_shared_label` will need updating. See [docs/taxonomy-setup.md](docs/taxonomy-setup.md).

src/openpois/conflation/change_detection.py

Lines changed: 142 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
import pandas as pd
3131
import pyarrow.parquet as pq
3232
from rapidfuzz import fuzz
33+
from sklearn.neighbors import BallTree
3334

3435
from openpois.conflation.ghost_osm import _is_token_subset_or_superset
3536
from openpois.conflation.match import (
@@ -240,6 +241,9 @@ def find_shadow_matches(
240241
].reset_index(drop = True)
241242

242243

244+
_EARTH_RADIUS_M = 6_371_000.0
245+
246+
243247
def apply_current_survivor_filter(
244248
matches: pd.DataFrame,
245249
unmatched_overture: gpd.GeoDataFrame,
@@ -248,21 +252,39 @@ def apply_current_survivor_filter(
248252
radius_m: float,
249253
name_similarity_threshold: float,
250254
test_bbox: dict | None = None,
251-
duckdb_memory_limit: str = "6GB",
255+
query_chunk_size: int = 50_000,
252256
verbose: bool = True,
253257
) -> tuple[pd.DataFrame, int]:
254258
"""Drop shadow matches where the POI is still present in the live
255259
OSM snapshot under a different geometry / spelling.
256260
257-
For each match, spatial-joins the Overture POI's centroid against
258-
the rated OSM snapshot for any feature within ``radius_m``. If any
259-
such feature's ``name`` token_set_ratio against the Overture name
260-
is ≥ ``name_similarity_threshold``, the match is dropped — the POI
261-
isn't gone, the primary matcher just missed it.
262-
263-
Returns ``(kept_matches, n_dropped)``. The DuckDB spatial join is
264-
bounded by ``test_bbox`` when given so the Seattle A/B path stays
265-
fast.
261+
For each Overture row in ``matches``, find the live rated-snapshot
262+
POIs within ``radius_m`` and check whether any name token-set-
263+
matches the Overture name at ≥ ``name_similarity_threshold``. If
264+
so, drop the match — the primary matcher just missed it.
265+
266+
Scales to nationwide via:
267+
268+
- A single BallTree over rated-snapshot centroids (haversine
269+
metric) instead of a DuckDB cross-join. Build is O(M log M),
270+
query is O(log M) per match. The earlier DuckDB
271+
``ST_Distance_Sphere`` implementation also returned distances
272+
that were ~50% off in the v1.4.1 pin we use today (the
273+
bundled formula is buggy — NYC → LA registers as ~4,900 km
274+
versus the correct ~3,940 km), which silently broke the 50 m
275+
gate. BallTree haversine here is the reference implementation.
276+
- DuckDB is still used for the *load*: it computes centroids in
277+
SQL and emits just ``(lon, lat, name)``, avoiding the ~8 GB
278+
peak from materializing 8.7 M shapely Points via GeoPandas.
279+
The optional ``test_bbox`` pushes down into that SQL for the
280+
Seattle path.
281+
- Snapshot rows with empty names are dropped before the tree
282+
build — they can't satisfy the name gate anyway, and this
283+
typically halves the tree size on real OSM data.
284+
- The match side is queried in chunks of ``query_chunk_size`` so
285+
peak memory stays bounded regardless of match count.
286+
287+
Returns ``(kept_matches, n_dropped)``.
266288
"""
267289
if matches.empty:
268290
return matches.copy(), 0
@@ -275,80 +297,132 @@ def apply_current_survivor_filter(
275297
f"{radius_m}m, name>={name_similarity_threshold}"
276298
)
277299

278-
ov_lons = unmatched_overture.geometry.x.to_numpy()
279-
ov_lats = unmatched_overture.geometry.y.to_numpy()
280-
bbox = test_bbox or {
281-
"xmin": float(np.min(ov_lons[ov_idx_arr])) - 0.01,
282-
"ymin": float(np.min(ov_lats[ov_idx_arr])) - 0.01,
283-
"xmax": float(np.max(ov_lons[ov_idx_arr])) + 0.01,
284-
"ymax": float(np.max(ov_lats[ov_idx_arr])) + 0.01,
285-
}
286-
300+
# ---------------- snapshot load + centroid extraction ----------
301+
# DuckDB extracts (lon, lat, name) directly from the parquet,
302+
# computing centroids server-side in C and emitting three thin
303+
# columns. This avoids materializing 8.7M shapely Point objects
304+
# (which costs ~8 GB peak when going through GeoPandas) and lets
305+
# us push the bbox filter into the SQL for the Seattle path.
306+
# Filters out unnamed rows up front — they can't satisfy the
307+
# name gate anyway, typically halves the tree size.
308+
if verbose:
309+
print(
310+
f" Loading rated snapshot centroids from "
311+
f"{rated_snapshot_path} ..."
312+
)
313+
bbox_clause = ""
314+
if test_bbox is not None:
315+
bbox_clause = (
316+
"AND lon BETWEEN "
317+
f"{test_bbox['xmin']} AND {test_bbox['xmax']} "
318+
"AND lat BETWEEN "
319+
f"{test_bbox['ymin']} AND {test_bbox['ymax']}"
320+
)
287321
con = duckdb.connect()
288-
con.execute(f"SET memory_limit = '{duckdb_memory_limit}'")
289-
con.execute("INSTALL spatial; LOAD spatial;")
290-
ov_subset = pd.DataFrame({
291-
"match_idx": np.arange(len(matches)),
292-
"ov_name": _to_str_array(
293-
unmatched_overture["name"]
294-
)[ov_idx_arr],
295-
"ov_lon": ov_lons[ov_idx_arr],
296-
"ov_lat": ov_lats[ov_idx_arr],
297-
})
298-
ov_subset.to_parquet("/tmp/cd_r1_ov.parquet")
299-
300-
nearby = con.execute(f"""
301-
SELECT ov.match_idx, ov.ov_name,
302-
s.name AS osm_name,
303-
ST_Distance_Sphere(
304-
ST_Point(ov.ov_lon, ov.ov_lat),
305-
ST_Centroid(s.geometry)
306-
) AS dist_m
307-
FROM read_parquet('/tmp/cd_r1_ov.parquet') ov
308-
JOIN read_parquet('{rated_snapshot_path}') s
309-
ON ST_Distance_Sphere(
310-
ST_Point(ov.ov_lon, ov.ov_lat),
311-
ST_Centroid(s.geometry)
312-
) <= {radius_m}
313-
AND ST_X(ST_Centroid(s.geometry))
314-
BETWEEN {bbox['xmin']} AND {bbox['xmax']}
315-
AND ST_Y(ST_Centroid(s.geometry))
316-
BETWEEN {bbox['ymin']} AND {bbox['ymax']}
317-
""").fetch_df()
318-
con.close()
322+
try:
323+
con.execute("INSTALL spatial; LOAD spatial;")
324+
snap = con.execute(f"""
325+
WITH centroids AS (
326+
SELECT
327+
ST_X(ST_Centroid(geometry)) AS lon,
328+
ST_Y(ST_Centroid(geometry)) AS lat,
329+
COALESCE(name, '') AS name
330+
FROM read_parquet('{rated_snapshot_path}')
331+
)
332+
SELECT lon, lat, name
333+
FROM centroids
334+
WHERE name <> ''
335+
AND lon IS NOT NULL
336+
AND lat IS NOT NULL
337+
{bbox_clause}
338+
""").fetch_df()
339+
finally:
340+
con.close()
341+
342+
snap_lons = snap["lon"].to_numpy()
343+
snap_lats = snap["lat"].to_numpy()
344+
snap_names = snap["name"].to_numpy()
345+
del snap
346+
gc.collect()
319347

320348
if verbose:
321349
print(
322-
f" {len(nearby):,} nearby-OSM candidate rows; "
323-
f"computing token_set_ratio ..."
350+
f" Snapshot rows in scope (named, non-NaN, in bbox): "
351+
f"{len(snap_lons):,}"
324352
)
325353

326-
if not len(nearby):
354+
if len(snap_lons) == 0:
327355
return matches.copy(), 0
328356

329-
nearby["sim"] = [
330-
fuzz.token_set_ratio(str(a), str(b))
331-
for a, b in zip(
332-
nearby["ov_name"].astype(str),
333-
nearby["osm_name"].astype(str),
357+
# ---------------- BallTree build ------------------------------
358+
if verbose:
359+
print(
360+
f" Building BallTree on {len(snap_lons):,} points "
361+
f"(haversine, ~140 MB at 8.7 M points) ..."
362+
)
363+
snap_coords_rad = np.column_stack([
364+
np.deg2rad(snap_lats), np.deg2rad(snap_lons),
365+
])
366+
tree = BallTree(snap_coords_rad, metric = "haversine")
367+
del snap_coords_rad
368+
gc.collect()
369+
370+
# ---------------- per-match query + name check ----------------
371+
ov_lons = unmatched_overture.geometry.x.to_numpy()[ov_idx_arr]
372+
ov_lats = unmatched_overture.geometry.y.to_numpy()[ov_idx_arr]
373+
ov_names = _to_str_array(unmatched_overture["name"])[ov_idx_arr]
374+
radius_rad = radius_m / _EARTH_RADIUS_M
375+
376+
suppress_idx_set: set[int] = set()
377+
n_pairs_scored = 0
378+
n_chunks = (len(matches) + query_chunk_size - 1) // query_chunk_size
379+
if verbose:
380+
print(
381+
f" Querying for {len(matches):,} matches in "
382+
f"{n_chunks} chunk(s) of up to {query_chunk_size:,} ..."
334383
)
335-
]
336-
suppress_idx = (
337-
nearby[nearby["sim"] >= name_similarity_threshold]
338-
["match_idx"]
339-
.unique()
340-
)
384+
385+
for start in range(0, len(matches), query_chunk_size):
386+
end = min(start + query_chunk_size, len(matches))
387+
chunk_coords = np.column_stack([
388+
np.deg2rad(ov_lats[start:end]),
389+
np.deg2rad(ov_lons[start:end]),
390+
])
391+
neighbor_idx_per_match = tree.query_radius(
392+
chunk_coords, r = radius_rad,
393+
)
394+
395+
for local_i, idx_arr in enumerate(neighbor_idx_per_match):
396+
if len(idx_arr) == 0:
397+
continue
398+
ov_name = ov_names[start + local_i]
399+
if not ov_name:
400+
continue
401+
global_i = start + local_i
402+
for snap_i in idx_arr:
403+
snap_name = snap_names[snap_i]
404+
if not snap_name:
405+
continue
406+
n_pairs_scored += 1
407+
sim = fuzz.token_set_ratio(ov_name, snap_name)
408+
if sim >= name_similarity_threshold:
409+
suppress_idx_set.add(global_i)
410+
break # short-circuit; one hit is enough
341411

342412
if verbose:
343-
print(f" R1 suppressed: {len(suppress_idx)}")
413+
print(
414+
f" Scored {n_pairs_scored:,} name-similarity pairs; "
415+
f"R1 suppressed: {len(suppress_idx_set)}"
416+
)
344417

345-
if not len(suppress_idx):
418+
if not suppress_idx_set:
346419
return matches.copy(), 0
347420

421+
suppress_idx_arr = np.fromiter(suppress_idx_set, dtype = int)
348422
keep_mask = np.ones(len(matches), dtype = bool)
349-
keep_mask[suppress_idx] = False
423+
keep_mask[suppress_idx_arr] = False
350424
kept = matches.loc[keep_mask].reset_index(drop = True)
351-
return kept, int(len(suppress_idx))
425+
return kept, int(len(suppress_idx_set))
352426

353427

354428
def apply_shadow_match(

0 commit comments

Comments
 (0)