Add a "current OSM survivor" sanity check to changed points.

njhenry · njhenry · commit 060509419210 · 2026-05-19T16:40:37.000-07:00
diff --git a/.claude/TODO.md b/.claude/TODO.md
@@ -6,6 +6,7 @@ Short running list of in-progress / upcoming work. Edit freely; trim older compl
 
 ## Upcoming
 
+- [ ] **Per-region calibration knob for the change-detection penalty.** Added 2026-05-19. Today `conflation.change_detection.default_delta` is a single global scalar (with per-`shared_label` overrides from the fitted turnover model). The model was fit on national OSM-history data, so the per-group δ values are a national average of OSM editor reliability. That assumption breaks in regions where OSM is sparse or stale — e.g., a "Restaurant deletion" in a rural county where OSM has low edit traffic may not be an actual closure, just an unmaintained entry. We should add a release-valve: allow `default_delta` (and ideally the per-`shared_label` deltas) to be overridden per state (or per Census place/county). Cleanest landing spot is a new optional CSV at `directories.model_output.regional_overrides` keyed by `(state_fips, shared_label) → delta_override`, and `change_detection.load_delta_lookup` would merge it in after the national values. Until we have a vetted set from a non-Seattle region we don't have data to calibrate this, but the hook should be in place. Tracking against the asymmetric-blindness problem documented in the May 2026 plan at `~/.claude/plans/our-current-deduplication-strategy-wild-graham.md`.
 - [ ] **Auto-capture the three per-version README fields** so the publish step doesn't need `publish.version_metadata` overrides. Added 2026-04-24. Today `build_version_readme` in [src/openpois/publish/build_readme.py](../src/openpois/publish/build_readme.py) falls back to config overrides or best-effort guesses; aim is for the pipeline to write authoritative values alongside the data it produces, and the publish step to just read them.
     - *OSM snapshot date* — `scripts/osm_snapshot/download.py` should write a `~/data/openpois/snapshots/osm/<version>/download_metadata.json` containing `{"downloaded_at": "<ISO date>", "pbf_url": "..."}` after the PBF download completes. `_resolve_osm_snapshot_date` then reads that file before falling back to the version string.
     - *Overture release* — `scripts/overture/download.py` already resolves a concrete release (pinned or auto-detected) inside `download_overture_snapshot`; currently only the `.parts/<release>/` directory records it and `.parts/` is deleted on success. Surface the resolved release by writing `~/data/openpois/snapshots/overture/<version>/download_metadata.json` with `{"release": "2026-04-15.0", ...}` before the cleanup step. `_resolve_overture_release` reads that file ahead of the `.parts/` heuristic.
diff --git a/config.yaml b/config.yaml
@@ -248,6 +248,31 @@ conflation:
     # model's per-group params. Equals sigmoid(logit_delta_0) for the
     # current 20260422_by_shared_label fit (logit_delta_0 = -2.72).
     default_delta: 0.062
+    # Hard gate on Overture-name vs ghost-prior-name token_set_ratio
+    # (0-100), applied *before* the composite-score-based shadow
+    # matcher. The default 0 keeps the loose matcher: any spatial +
+    # type + composite match above ``min_shadow_match_score`` will
+    # fire, even when Overture's name doesn't lexically match the
+    # OSM ghost. This is intentional. A higher value would only
+    # fire when Overture is showing the *same name* OSM closed,
+    # which we explored in May 2026 (decision rule A) and rejected
+    # because it loses the bulk of real closures where Overture has
+    # already updated to a different current name at a churned
+    # address (Sleep Train → Roosevelt Square etc.). Knob retained
+    # for future data-quality-only modes; leave at 0 for production
+    # change detection.
+    min_prior_name_match_score: 0
+    suppress_if_current_survivor:
+      # Belt-and-suspenders post-filter: drop the penalty if a
+      # *current* OSM POI within radius_m has name token_set_ratio
+      # >= threshold against the Overture name. Catches cases where
+      # the POI is still in OSM under different geometry (e.g.,
+      # node remapped to a building way) and the primary matcher
+      # missed it. Kept enabled because it's cheap and orthogonal
+      # to min_prior_name_match_score.
+      enabled: true
+      radius_m: 50
+      name_similarity_threshold: 70
 
 # Settings for publishing snapshots to Source Cooperative
 # (https://source.coop/henryspatialanalysis/openpois). Source Coop is
diff --git a/scripts/conflation/apply_change_detection.py b/scripts/conflation/apply_change_detection.py
@@ -72,6 +72,26 @@ def main() -> None:
             "Use when the baseline was produced with --test."
         ),
     )
+    parser.add_argument(
+        "--min-prior-name-score",
+        type = float,
+        default = None,
+        help = (
+            "Override config's min_prior_name_match_score. Higher "
+            "values require a stricter Overture-name vs ghost-prior-"
+            "name token_set_ratio match before a penalty fires. "
+            "Default config value implements decision rule A "
+            "(name-match required)."
+        ),
+    )
+    parser.add_argument(
+        "--no-survivor-filter",
+        action = "store_true",
+        help = (
+            "Disable the current-OSM-survivor post-filter for this "
+            "run. Used for ablation against the vetted set."
+        ),
+    )
     args = parser.parse_args()
 
     config = Config("~/repos/openpois/config.yaml")
@@ -93,6 +113,17 @@ def main() -> None:
     cd_cfg = config.get("conflation", "change_detection")
     min_match_score = float(cd_cfg["min_shadow_match_score"])
     default_delta = float(cd_cfg["default_delta"])
+    min_prior_name_match_score = float(
+        cd_cfg.get("min_prior_name_match_score", 0)
+    )
+    if args.min_prior_name_score is not None:
+        min_prior_name_match_score = float(args.min_prior_name_score)
+
+    survivor_filter = cd_cfg.get("suppress_if_current_survivor") or {}
+    if args.no_survivor_filter:
+        survivor_filter = dict(survivor_filter)
+        survivor_filter["enabled"] = False
+        print("Current-OSM-survivor filter disabled for this run.")
 
     max_radius_m = float(config.get("conflation", "max_radius_m"))
     default_radius_m = float(
@@ -105,6 +136,12 @@ def main() -> None:
         config.get("conflation", "identifier_weight")
     )
 
+    # R1 needs the rated snapshot; no other auxiliary inputs are
+    # needed by the simplified pipeline.
+    rated_snapshot_path = config.get_file_path(
+        "snapshot_osm", "rated_snapshot",
+    )
+
     test_bbox = (
         config.get("conflation", "test_bbox") if args.test else None
     )
@@ -113,10 +150,12 @@ def main() -> None:
     print(f"Ghosts:   {ghosts_path}")
     print(f"Fitted params: {fitted_params_path}")
     print(f"Output:   {output_path}")
+    print(f"Rated snapshot (survivor filter): {rated_snapshot_path}")
     print(
         f"min_match_score={min_match_score} "
         f"max_radius_m={max_radius_m} "
-        f"default_delta={default_delta}"
+        f"default_delta={default_delta} "
+        f"min_prior_name_match_score={min_prior_name_match_score}"
     )
     if args.test:
         print(f"Test bbox: {test_bbox}")
@@ -136,6 +175,9 @@ def main() -> None:
         identifier_weight = identifier_weight,
         default_delta = default_delta,
         test_bbox = test_bbox,
+        rated_snapshot_path = rated_snapshot_path,
+        survivor_filter = survivor_filter,
+        min_prior_name_match_score = min_prior_name_match_score,
     )
     elapsed = time.time() - t0
 
@@ -147,9 +189,13 @@ def main() -> None:
     )
     print(f"  Ghosts considered:          {summary['n_ghosts']:,}")
     print(
-        f"  Shadow matches:             "
+        f"  Shadow matches (final):     "
         f"{summary['n_shadow_matches']:,}"
     )
+    print(
+        f"  Dropped by survivor filter: "
+        f"{summary['n_survivor_dropped']}"
+    )
     print(
         f"  Mean penalty factor (Δ/old): "
         f"{summary['mean_penalty_factor']:.4f}"
diff --git a/src/openpois/conflation/change_detection.py b/src/openpois/conflation/change_detection.py
@@ -24,10 +24,12 @@
 import gc
 from pathlib import Path
 
+import duckdb
 import geopandas as gpd
 import numpy as np
 import pandas as pd
 import pyarrow.parquet as pq
+from rapidfuzz import fuzz
 
 from openpois.conflation.ghost_osm import _is_token_subset_or_superset
 from openpois.conflation.match import (
@@ -86,6 +88,7 @@ def find_shadow_matches(
     name_weight: float,
     type_weight: float,
     identifier_weight: float,
+    min_prior_name_match_score: float = 0.0,
 ) -> pd.DataFrame:
     """Run a single-pass match between Overture rows and ghost rows.
 
@@ -98,6 +101,14 @@ def find_shadow_matches(
     bit arrays) so type_score is binary on exact ``shared_label``
     equality — the change-detection penalty is conservative and
     should only fire when taxonomy genuinely matches.
+
+    ``min_prior_name_match_score`` is an additional hard gate on the
+    Overture-name vs ghost-prior-name token_set_ratio (0–100). When
+    > 0, candidate pairs below that threshold are dropped *before*
+    the composite-score-based selection runs. Subset/superset pairs
+    pass regardless. Set this to require a strong direct name match
+    (e.g. 70) and you'll trade most of the recall for much higher
+    precision. Default 0 disables the gate.
     """
     if len(unmatched_overture) == 0 or len(ghosts) == 0:
         return pd.DataFrame(
@@ -143,6 +154,38 @@ def find_shadow_matches(
 
     ov_labels = _to_str_array(unmatched_overture["shared_label"])
 
+    # Optional pre-gate: drop candidate pairs whose Overture-name vs
+    # ghost-prior-name token_set_ratio is below the configured floor.
+    # Subset/superset pairs pass regardless (a short subset like
+    # "CVS" vs "CVS Pharmacy" can dip below threshold on token-set
+    # ratio but is obviously the same business). This is the "tighten
+    # matcher" alternative — when set high (e.g. 70) it trades most
+    # recall for high precision and removes the need for downstream
+    # suppression rules.
+    if min_prior_name_match_score > 0 and not candidates.empty:
+        cand_osm_idx = candidates["osm_idx"].to_numpy()
+        cand_ov_idx = candidates["overture_idx"].to_numpy()
+        keep = np.zeros(len(candidates), dtype = bool)
+        for i in range(len(candidates)):
+            gname = ghost_names[cand_osm_idx[i]]
+            oname = ov_names[cand_ov_idx[i]]
+            if not gname or not oname:
+                continue
+            if _is_token_subset_or_superset(gname, oname):
+                keep[i] = True
+                continue
+            sim = fuzz.token_set_ratio(gname, oname)
+            if sim >= min_prior_name_match_score:
+                keep[i] = True
+        candidates = candidates.loc[keep].reset_index(drop = True)
+        if candidates.empty:
+            return pd.DataFrame(
+                columns = [
+                    "osm_idx", "overture_idx",
+                    "composite_score", "distance_m",
+                ]
+            )
+
     # All-zero L0 bits → only exact shared_label match scores 1.0
     # (broad-group bitmask overlap collapses to 0 because all bits
     # are 0). Keeps the secondary pass conservative.
@@ -197,6 +240,117 @@ def find_shadow_matches(
     ].reset_index(drop = True)
 
 
+def apply_current_survivor_filter(
+    matches: pd.DataFrame,
+    unmatched_overture: gpd.GeoDataFrame,
+    *,
+    rated_snapshot_path: Path,
+    radius_m: float,
+    name_similarity_threshold: float,
+    test_bbox: dict | None = None,
+    duckdb_memory_limit: str = "6GB",
+    verbose: bool = True,
+) -> tuple[pd.DataFrame, int]:
+    """Drop shadow matches where the POI is still present in the live
+    OSM snapshot under a different geometry / spelling.
+
+    For each match, spatial-joins the Overture POI's centroid against
+    the rated OSM snapshot for any feature within ``radius_m``. If any
+    such feature's ``name`` token_set_ratio against the Overture name
+    is ≥ ``name_similarity_threshold``, the match is dropped — the POI
+    isn't gone, the primary matcher just missed it.
+
+    Returns ``(kept_matches, n_dropped)``. The DuckDB spatial join is
+    bounded by ``test_bbox`` when given so the Seattle A/B path stays
+    fast.
+    """
+    if matches.empty:
+        return matches.copy(), 0
+
+    ov_idx_arr = matches["overture_idx"].to_numpy().astype(int)
+
+    if verbose:
+        print(
+            f"  R1 (current-OSM-survivor): radius="
+            f"{radius_m}m, name>={name_similarity_threshold}"
+        )
+
+    ov_lons = unmatched_overture.geometry.x.to_numpy()
+    ov_lats = unmatched_overture.geometry.y.to_numpy()
+    bbox = test_bbox or {
+        "xmin": float(np.min(ov_lons[ov_idx_arr])) - 0.01,
+        "ymin": float(np.min(ov_lats[ov_idx_arr])) - 0.01,
+        "xmax": float(np.max(ov_lons[ov_idx_arr])) + 0.01,
+        "ymax": float(np.max(ov_lats[ov_idx_arr])) + 0.01,
+    }
+
+    con = duckdb.connect()
+    con.execute(f"SET memory_limit = '{duckdb_memory_limit}'")
+    con.execute("INSTALL spatial; LOAD spatial;")
+    ov_subset = pd.DataFrame({
+        "match_idx": np.arange(len(matches)),
+        "ov_name": _to_str_array(
+            unmatched_overture["name"]
+        )[ov_idx_arr],
+        "ov_lon": ov_lons[ov_idx_arr],
+        "ov_lat": ov_lats[ov_idx_arr],
+    })
+    ov_subset.to_parquet("/tmp/cd_r1_ov.parquet")
+
+    nearby = con.execute(f"""
+        SELECT ov.match_idx, ov.ov_name,
+               s.name AS osm_name,
+               ST_Distance_Sphere(
+                   ST_Point(ov.ov_lon, ov.ov_lat),
+                   ST_Centroid(s.geometry)
+               ) AS dist_m
+        FROM read_parquet('/tmp/cd_r1_ov.parquet') ov
+        JOIN read_parquet('{rated_snapshot_path}') s
+          ON ST_Distance_Sphere(
+                 ST_Point(ov.ov_lon, ov.ov_lat),
+                 ST_Centroid(s.geometry)
+             ) <= {radius_m}
+         AND ST_X(ST_Centroid(s.geometry))
+             BETWEEN {bbox['xmin']} AND {bbox['xmax']}
+         AND ST_Y(ST_Centroid(s.geometry))
+             BETWEEN {bbox['ymin']} AND {bbox['ymax']}
+    """).fetch_df()
+    con.close()
+
+    if verbose:
+        print(
+            f"    {len(nearby):,} nearby-OSM candidate rows; "
+            f"computing token_set_ratio ..."
+        )
+
+    if not len(nearby):
+        return matches.copy(), 0
+
+    nearby["sim"] = [
+        fuzz.token_set_ratio(str(a), str(b))
+        for a, b in zip(
+            nearby["ov_name"].astype(str),
+            nearby["osm_name"].astype(str),
+        )
+    ]
+    suppress_idx = (
+        nearby[nearby["sim"] >= name_similarity_threshold]
+        ["match_idx"]
+        .unique()
+    )
+
+    if verbose:
+        print(f"    R1 suppressed: {len(suppress_idx)}")
+
+    if not len(suppress_idx):
+        return matches.copy(), 0
+
+    keep_mask = np.ones(len(matches), dtype = bool)
+    keep_mask[suppress_idx] = False
+    kept = matches.loc[keep_mask].reset_index(drop = True)
+    return kept, int(len(suppress_idx))
+
+
 def apply_shadow_match(
     conflated_path: Path,
     ghosts_path: Path,
@@ -212,6 +366,9 @@ def apply_shadow_match(
     identifier_weight: float,
     default_delta: float,
     test_bbox: dict | None = None,
+    rated_snapshot_path: Path | None = None,
+    survivor_filter: dict | None = None,
+    min_prior_name_match_score: float = 0.0,
     verbose: bool = True,
 ) -> dict:
     """Post-process a conflated dataset with the change-detection penalty.
@@ -316,9 +473,40 @@ def apply_shadow_match(
             name_weight = name_weight,
             type_weight = type_weight,
             identifier_weight = identifier_weight,
+            min_prior_name_match_score = min_prior_name_match_score,
+        )
+        if verbose:
+            print(
+                f"  Shadow matches (pre-survivor-filter): "
+                f"{len(matches):,}"
+            )
+
+    # -- Current-OSM-survivor filter ----------------------------------
+    n_survivor_dropped = 0
+    if (
+        survivor_filter
+        and bool(survivor_filter.get("enabled", False))
+        and rated_snapshot_path is not None
+        and len(matches) > 0
+    ):
+        if verbose:
+            print("Applying current-OSM-survivor filter ...")
+        matches, n_survivor_dropped = apply_current_survivor_filter(
+            matches = matches,
+            unmatched_overture = unmatched_ov,
+            rated_snapshot_path = rated_snapshot_path,
+            radius_m = float(survivor_filter.get("radius_m", 50)),
+            name_similarity_threshold = float(
+                survivor_filter.get("name_similarity_threshold", 70)
+            ),
+            test_bbox = test_bbox,
+            verbose = verbose,
         )
         if verbose:
-            print(f"  Shadow matches: {len(matches):,}")
+            print(
+                f"  Shadow matches (post-survivor-filter): "
+                f"{len(matches):,} (dropped {n_survivor_dropped})"
+            )
 
     # -- Build audit columns -------------------------------------------
     n = len(conflated)
@@ -409,6 +597,7 @@ def apply_shadow_match(
         "n_unmatched_overture": int(len(ov_global_idx)),
         "n_ghosts": int(len(ghosts)),
         "n_shadow_matches": int(len(matches)),
+        "n_survivor_dropped": int(n_survivor_dropped),
         "mean_penalty_factor": (
             float(
                 (new_conf_mean[shadow_matched]
diff --git a/vetting_viz/seattle_evaluation.py b/vetting_viz/seattle_evaluation.py