Skip to content

Commit 0481f63

Browse files
authored
Merge pull request #29 from henryspatialanalysis/feature/change-detection
OSM change detection: incorporate OSM changes and closures in conflation
2 parents 5313b2f + 55bb0dc commit 0481f63

14 files changed

Lines changed: 2730 additions & 17 deletions

File tree

.claude/TODO.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ Short running list of in-progress / upcoming work. Edit freely; trim older compl
66

77
## Upcoming
88

9+
- [ ] **Per-region calibration knob for the change-detection penalty.** Added 2026-05-19. Today `conflation.change_detection.default_delta` is a single global scalar (with per-`shared_label` overrides from the fitted turnover model). The model was fit on national OSM-history data, so the per-group δ values are a national average of OSM editor reliability. That assumption breaks in regions where OSM is sparse or stale — e.g., a "Restaurant deletion" in a rural county where OSM has low edit traffic may not be an actual closure, just an unmaintained entry. We should add a release-valve: allow `default_delta` (and ideally the per-`shared_label` deltas) to be overridden per state (or per Census place/county). Cleanest landing spot is a new optional CSV at `directories.model_output.regional_overrides` keyed by `(state_fips, shared_label) → delta_override`, and `change_detection.load_delta_lookup` would merge it in after the national values. Until we have a vetted set from a non-Seattle region we don't have data to calibrate this, but the hook should be in place. Tracking against the asymmetric-blindness problem documented in the May 2026 plan at `~/.claude/plans/our-current-deduplication-strategy-wild-graham.md`.
910
- [ ] **Auto-capture the three per-version README fields** so the publish step doesn't need `publish.version_metadata` overrides. Added 2026-04-24. Today `build_version_readme` in [src/openpois/publish/build_readme.py](../src/openpois/publish/build_readme.py) falls back to config overrides or best-effort guesses; aim is for the pipeline to write authoritative values alongside the data it produces, and the publish step to just read them.
1011
- *OSM snapshot date*`scripts/osm_snapshot/download.py` should write a `~/data/openpois/snapshots/osm/<version>/download_metadata.json` containing `{"downloaded_at": "<ISO date>", "pbf_url": "..."}` after the PBF download completes. `_resolve_osm_snapshot_date` then reads that file before falling back to the version string.
1112
- *Overture release*`scripts/overture/download.py` already resolves a concrete release (pinned or auto-detected) inside `download_overture_snapshot`; currently only the `.parts/<release>/` directory records it and `.parts/` is deleted on success. Surface the resolved release by writing `~/data/openpois/snapshots/overture/<version>/download_metadata.json` with `{"release": "2026-04-15.0", ...}` before the cleanup step. `_resolve_overture_release` reads that file ahead of the `.parts/` heuristic.
1213
- *Turnover-model commit*`scripts/models/osm_turnover.py` should capture `git rev-parse HEAD` at training time and either (a) extend `config.write_self("model_output")` to include a `git_commit` entry or (b) drop a `git_commit.txt` next to the model artifacts. `_resolve_model_commit` reads that value instead of the publish-time HEAD, which is the right fingerprint if code has changed between training and publishing.
1314
- Publishing behaviour: if any of the three files is missing, keep the current fallback (and print a visible warning) so old pipeline runs still publish cleanly.
15+
- [ ] **DuckDB `ST_Distance_Sphere` returns wrong distances in v1.4.1.** Added 2026-05-19. The bundled spherical distance is off by ~25 % at continental scale (NYC → LA registers as ~4,900 km vs the correct ~3,940 km) and similarly inflated at small scales (a Seattle 65 m pair reads as 43 m). [src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py) used to depend on it for the R1 current-OSM-survivor filter and silently produced ~4 spurious suppressions per Seattle run as a result; the implementation has been switched to a sklearn BallTree haversine query. Any *new* use of `ST_Distance_Sphere` anywhere in the pipeline should be audited — prefer BallTree or shapely-on-projected-CRS. Tracked separately from the WSL2 httpfs pin below; both should be revisited when we bump DuckDB.
1416
- [ ] Watch for a DuckDB release that fixes the WSL2 httpfs "Information loss on integer cast" crash (issue #21669, fix PR #21395). Once a tagged release ships with the fix and a full `scripts/overture/download.py` run on WSL2 completes, we can unpin from `duckdb==1.4.1` and revert the per-part download to a single-query DuckDB scan. Added 2026-04-17.
1517
- [ ] Auto-check taxonomy changes whenever we switch to a new Overture Maps version (detect new/removed L0/L1/L2 categories vs. `taxonomy_crosswalk_overture_maps.csv` and flag gaps). Added 2026-04-16.
1618
- [ ] Watch for Overture L0/L1 → flat `basic_category` migration (~June 2026). Crosswalk CSV + `assign_overture_shared_label` will need updating. See [docs/taxonomy-setup.md](docs/taxonomy-setup.md).

config.yaml

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
# Versioned directories (used with config.get_dir_path())
22
versions:
3-
osm_data: "20260416"
3+
osm_data: "20260515"
44
model_output: "20260422_by_shared_label"
55
snapshot_osm: "20260417"
66
snapshot_overture: "20260423"
77
conflation: "20260423"
88
source_coop: "2026-04-23-v0" # Source Cooperative upload folder (YYYY-MM-DD-v<IDX>); bump v<IDX> only for same-day re-uploads
9+
# Ghost POI dataset reconstructed from OSM history (one row per
10+
# detected previous-state event). Pinned to the same value as
11+
# ``osm_data`` since it is derived from the same history parquets.
12+
ghost_osm: "20260515"
913

1014
# Settings for downloading data
1115
download:
@@ -188,6 +192,11 @@ directories:
188192
partitioned: conflated_partitioned
189193
pmtiles: conflated.pmtiles
190194
summary_by_label: summary_by_label.csv
195+
ghost_osm:
196+
versioned: true
197+
path: ~/data/openpois/ghost_osm
198+
files:
199+
ghosts: ghosts.parquet
191200
testing:
192201
versioned: false
193202
path: ~/data/openpois/testing
@@ -222,6 +231,48 @@ conflation:
222231
ymin: 47.50
223232
xmax: -122.25
224233
ymax: 47.70
234+
# Change-detection feature: use OSM history to penalize Overture POIs
235+
# that co-locate with a "ghost" — a previous state of an OSM element
236+
# (primary-tag deletion, lifecycle-prefix addition, or substantial
237+
# rename). Disabled by default for clean A/B testing.
238+
change_detection:
239+
enabled: false
240+
# Minimum composite score for an Overture × ghost shadow match.
241+
# Same scale as the main matcher's min_match_score.
242+
min_shadow_match_score: 0.50
243+
# rapidfuzz.fuzz.token_set_ratio threshold below which an OSM name
244+
# change is considered a "substantial rename" rather than a typo
245+
# fix. Range 0-100. Lower = stricter (fewer events emitted).
246+
name_change_similarity_threshold: 50
247+
# Fallback delta for ghosts whose shared_label isn't in the fitted
248+
# model's per-group params. Equals sigmoid(logit_delta_0) for the
249+
# current 20260422_by_shared_label fit (logit_delta_0 = -2.72).
250+
default_delta: 0.062
251+
# Hard gate on Overture-name vs ghost-prior-name token_set_ratio
252+
# (0-100), applied *before* the composite-score-based shadow
253+
# matcher. The default 0 keeps the loose matcher: any spatial +
254+
# type + composite match above ``min_shadow_match_score`` will
255+
# fire, even when Overture's name doesn't lexically match the
256+
# OSM ghost. This is intentional. A higher value would only
257+
# fire when Overture is showing the *same name* OSM closed,
258+
# which we explored in May 2026 (decision rule A) and rejected
259+
# because it loses the bulk of real closures where Overture has
260+
# already updated to a different current name at a churned
261+
# address (Sleep Train → Roosevelt Square etc.). Knob retained
262+
# for future data-quality-only modes; leave at 0 for production
263+
# change detection.
264+
min_prior_name_match_score: 0
265+
suppress_if_current_survivor:
266+
# Belt-and-suspenders post-filter: drop the penalty if a
267+
# *current* OSM POI within radius_m has name token_set_ratio
268+
# >= threshold against the Overture name. Catches cases where
269+
# the POI is still in OSM under different geometry (e.g.,
270+
# node remapped to a building way) and the primary matcher
271+
# missed it. Kept enabled because it's cheap and orthogonal
272+
# to min_prior_name_match_score.
273+
enabled: true
274+
radius_m: 50
275+
name_similarity_threshold: 70
225276

226277
# Settings for publishing snapshots to Source Cooperative
227278
# (https://source.coop/henryspatialanalysis/openpois). Source Coop is
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
#!/usr/bin/env python
2+
"""
3+
Apply the change-detection penalty to a baseline conflated dataset.
4+
5+
Reads:
6+
- the baseline ``conflated.parquet`` (no change detection)
7+
- ``ghosts.parquet`` (from ``scripts/conflation/build_ghosts.py``)
8+
- ``fitted_params.csv`` for the active ``model_output`` version
9+
10+
Writes a new conflated parquet (suffix ``_cd`` by default) whose
11+
unmatched-Overture rows have had ``conf_mean`` re-weighted by
12+
``δ_group`` for any spatial+name+taxonomy match against a ghost. Audit
13+
columns are appended (``shadow_*`` + ``original_conf_mean``) so the
14+
demoted rows can be inspected by hand.
15+
16+
Usage:
17+
python scripts/conflation/apply_change_detection.py \
18+
--baseline-suffix=baseline --output-suffix=cd [--test]
19+
20+
Both ``--baseline-suffix`` and ``--output-suffix`` are inserted into
21+
the conflated filename before ``.parquet`` (e.g. ``conflated_cd.parquet``).
22+
"""
23+
from __future__ import annotations
24+
25+
import argparse
26+
import time
27+
from pathlib import Path
28+
29+
from config_versioned import Config
30+
31+
from openpois.conflation.change_detection import apply_shadow_match
32+
33+
34+
def _suffixed_path(base_path: Path, suffix: str | None) -> Path:
35+
"""Insert ``suffix`` before the parquet extension."""
36+
if not suffix:
37+
return base_path
38+
return base_path.with_name(
39+
f"{base_path.stem}_{suffix}{base_path.suffix}"
40+
)
41+
42+
43+
def main() -> None:
44+
parser = argparse.ArgumentParser(
45+
description = (
46+
"Apply change-detection penalty to a baseline conflated "
47+
"dataset using OSM-history-derived ghost POIs."
48+
)
49+
)
50+
parser.add_argument(
51+
"--baseline-suffix",
52+
default = "baseline",
53+
help = (
54+
"Suffix inserted into the input parquet filename "
55+
"(default: 'baseline' → conflated_baseline.parquet). "
56+
"Pass an empty string to read conflated.parquet directly."
57+
),
58+
)
59+
parser.add_argument(
60+
"--output-suffix",
61+
default = "cd",
62+
help = (
63+
"Suffix inserted into the output parquet filename "
64+
"(default: 'cd' → conflated_cd.parquet)."
65+
),
66+
)
67+
parser.add_argument(
68+
"--test",
69+
action = "store_true",
70+
help = (
71+
"Restrict ghosts to the configured conflation.test_bbox. "
72+
"Use when the baseline was produced with --test."
73+
),
74+
)
75+
parser.add_argument(
76+
"--min-prior-name-score",
77+
type = float,
78+
default = None,
79+
help = (
80+
"Override config's min_prior_name_match_score. Higher "
81+
"values require a stricter Overture-name vs ghost-prior-"
82+
"name token_set_ratio match before a penalty fires. "
83+
"Default config value implements decision rule A "
84+
"(name-match required)."
85+
),
86+
)
87+
parser.add_argument(
88+
"--no-survivor-filter",
89+
action = "store_true",
90+
help = (
91+
"Disable the current-OSM-survivor post-filter for this "
92+
"run. Used for ablation against the vetted set."
93+
),
94+
)
95+
args = parser.parse_args()
96+
97+
config = Config("~/repos/openpois/config.yaml")
98+
99+
conflated_base = config.get_file_path("conflation", "conflated")
100+
baseline_path = _suffixed_path(
101+
conflated_base, args.baseline_suffix,
102+
)
103+
output_path = _suffixed_path(
104+
conflated_base, args.output_suffix,
105+
)
106+
ghosts_path = config.get_file_path("ghost_osm", "ghosts")
107+
108+
model_dir = Path(config.get_dir_path("model_output"))
109+
fitted_params_path = model_dir / config.get(
110+
"directories", "model_output", "files", "fitted_params",
111+
)
112+
113+
cd_cfg = config.get("conflation", "change_detection")
114+
min_match_score = float(cd_cfg["min_shadow_match_score"])
115+
default_delta = float(cd_cfg["default_delta"])
116+
min_prior_name_match_score = float(
117+
cd_cfg.get("min_prior_name_match_score", 0)
118+
)
119+
if args.min_prior_name_score is not None:
120+
min_prior_name_match_score = float(args.min_prior_name_score)
121+
122+
survivor_filter = cd_cfg.get("suppress_if_current_survivor") or {}
123+
if args.no_survivor_filter:
124+
survivor_filter = dict(survivor_filter)
125+
survivor_filter["enabled"] = False
126+
print("Current-OSM-survivor filter disabled for this run.")
127+
128+
max_radius_m = float(config.get("conflation", "max_radius_m"))
129+
default_radius_m = float(
130+
config.get("conflation", "default_radius_m")
131+
)
132+
distance_weight = float(config.get("conflation", "distance_weight"))
133+
name_weight = float(config.get("conflation", "name_weight"))
134+
type_weight = float(config.get("conflation", "type_weight"))
135+
identifier_weight = float(
136+
config.get("conflation", "identifier_weight")
137+
)
138+
139+
# R1 needs the rated snapshot; no other auxiliary inputs are
140+
# needed by the simplified pipeline.
141+
rated_snapshot_path = config.get_file_path(
142+
"snapshot_osm", "rated_snapshot",
143+
)
144+
145+
test_bbox = (
146+
config.get("conflation", "test_bbox") if args.test else None
147+
)
148+
149+
print(f"Baseline: {baseline_path}")
150+
print(f"Ghosts: {ghosts_path}")
151+
print(f"Fitted params: {fitted_params_path}")
152+
print(f"Output: {output_path}")
153+
print(f"Rated snapshot (survivor filter): {rated_snapshot_path}")
154+
print(
155+
f"min_match_score={min_match_score} "
156+
f"max_radius_m={max_radius_m} "
157+
f"default_delta={default_delta} "
158+
f"min_prior_name_match_score={min_prior_name_match_score}"
159+
)
160+
if args.test:
161+
print(f"Test bbox: {test_bbox}")
162+
163+
t0 = time.time()
164+
summary = apply_shadow_match(
165+
conflated_path = baseline_path,
166+
ghosts_path = ghosts_path,
167+
fitted_params_path = fitted_params_path,
168+
output_path = output_path,
169+
min_match_score = min_match_score,
170+
max_radius_m = max_radius_m,
171+
default_radius_m = default_radius_m,
172+
distance_weight = distance_weight,
173+
name_weight = name_weight,
174+
type_weight = type_weight,
175+
identifier_weight = identifier_weight,
176+
default_delta = default_delta,
177+
test_bbox = test_bbox,
178+
rated_snapshot_path = rated_snapshot_path,
179+
survivor_filter = survivor_filter,
180+
min_prior_name_match_score = min_prior_name_match_score,
181+
)
182+
elapsed = time.time() - t0
183+
184+
print(f"\nApplied change-detection in {elapsed:.0f}s")
185+
print(f" Total conflated rows: {summary['n_total']:,}")
186+
print(
187+
f" Unmatched Overture rows: "
188+
f"{summary['n_unmatched_overture']:,}"
189+
)
190+
print(f" Ghosts considered: {summary['n_ghosts']:,}")
191+
print(
192+
f" Shadow matches (final): "
193+
f"{summary['n_shadow_matches']:,}"
194+
)
195+
print(
196+
f" Dropped by survivor filter: "
197+
f"{summary['n_survivor_dropped']}"
198+
)
199+
print(
200+
f" Mean penalty factor (Δ/old): "
201+
f"{summary['mean_penalty_factor']:.4f}"
202+
)
203+
print(f" Output: {output_path}")
204+
205+
206+
if __name__ == "__main__":
207+
main()

scripts/conflation/build_ghosts.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
#!/usr/bin/env python
2+
"""
3+
Build the ghost-OSM POI dataset from OSM history.
4+
5+
A ghost is a previous state of an OSM node that we believe no longer
6+
reflects ground truth (primary tag deleted, lifecycle prefix added, or
7+
substantial rename). The output Parquet feeds the change-detection
8+
pass in ``scripts/conflation/conflate.py``.
9+
10+
Config keys used (config.yaml):
11+
versions.osm_data, versions.ghost_osm — pinned together
12+
directories.osm_data.osm_versions
13+
directories.osm_data.osm_changes
14+
directories.ghost_osm.ghosts
15+
download.osm.filter_keys — POI tag keys
16+
conflation.change_detection.name_change_similarity_threshold
17+
18+
Usage:
19+
python scripts/conflation/build_ghosts.py
20+
"""
21+
from __future__ import annotations
22+
23+
import time
24+
25+
from config_versioned import Config
26+
27+
from openpois.conflation.ghost_osm import build_ghosts
28+
29+
30+
def main() -> None:
31+
config = Config("~/repos/openpois/config.yaml")
32+
33+
versions_path = config.get_file_path("osm_data", "osm_versions")
34+
changes_path = config.get_file_path("osm_data", "osm_changes")
35+
output_path = config.get_file_path("ghost_osm", "ghosts")
36+
37+
filter_keys = config.get("download", "osm", "filter_keys")
38+
name_threshold = float(
39+
config.get(
40+
"conflation", "change_detection",
41+
"name_change_similarity_threshold",
42+
)
43+
)
44+
45+
print(f"Versions path: {versions_path}")
46+
print(f"Changes path: {changes_path}")
47+
print(f"Output path: {output_path}")
48+
print(f"POI keys: {filter_keys}")
49+
print(f"Name similarity threshold: {name_threshold}")
50+
51+
t0 = time.time()
52+
ghosts = build_ghosts(
53+
versions_path = versions_path,
54+
changes_path = changes_path,
55+
poi_keys = filter_keys,
56+
name_change_similarity_threshold = name_threshold,
57+
)
58+
elapsed = time.time() - t0
59+
print(f"\nBuilt {len(ghosts):,} ghosts in {elapsed:.0f}s")
60+
61+
if len(ghosts):
62+
event_counts = (
63+
ghosts["event_type"].value_counts().to_dict()
64+
)
65+
print("Event-type breakdown:")
66+
for et, n in sorted(event_counts.items(), key = lambda kv: -kv[1]):
67+
print(f" {et}: {n:,}")
68+
69+
sl_total = int((ghosts["shared_label"] != "").sum())
70+
print(
71+
f"shared_label assigned: {sl_total:,}/{len(ghosts):,} "
72+
f"({100 * sl_total / max(len(ghosts), 1):.1f}%)"
73+
)
74+
75+
output_path.parent.mkdir(parents = True, exist_ok = True)
76+
ghosts.to_parquet(output_path, compression = "zstd")
77+
print(f"\nWrote {output_path}")
78+
config.write_self("ghost_osm")
79+
80+
81+
if __name__ == "__main__":
82+
main()

0 commit comments

Comments
 (0)