|
| 1 | +# Change detection |
| 2 | + |
| 3 | +OSM edit history is used to downweight Overture POIs whose location has seen a |
| 4 | +recent closure / rename / lifecycle event in OSM. This is a post-processing |
| 5 | +pass on the conflated dataset; the no-CD baseline is preserved as |
| 6 | +``conflated_baseline.parquet`` and the CD-applied result becomes the canonical |
| 7 | +``conflated.parquet`` that downstream partition / PMTiles / publish steps |
| 8 | +consume. |
| 9 | + |
| 10 | +## Pipeline |
| 11 | + |
| 12 | +Four stages, each separately runnable and individually inspectable: |
| 13 | + |
| 14 | +```text |
| 15 | + OSM history parquets |
| 16 | + (osm_versions, osm_changes) |
| 17 | + │ |
| 18 | + ▼ |
| 19 | + 1. build_ghosts.py |
| 20 | + for each (element, version) emit |
| 21 | + at most one of: |
| 22 | + • hard_delete |
| 23 | + • lifecycle_prefix_added |
| 24 | + • primary_tag_deleted |
| 25 | + • substantial_rename |
| 26 | + │ |
| 27 | + ▼ ghosts.parquet |
| 28 | +─── 2. conflate.py (--output-suffix=baseline) ────────────────────────── |
| 29 | + rated_osm ──► match ──► conflated_baseline.parquet |
| 30 | + overture ─────┘ (unchanged from the no-CD pipeline) |
| 31 | + │ |
| 32 | + ▼ |
| 33 | +─── 3. apply_change_detection.py ────────────────────────────────────── |
| 34 | + conflated_baseline.parquet |
| 35 | + │ |
| 36 | + shadow-matcher |
| 37 | + (reuses match.py scoring) |
| 38 | + │ |
| 39 | + │ matches |
| 40 | + ▼ |
| 41 | + R1: current-OSM-survivor filter |
| 42 | + (drop if a live OSM POI with the |
| 43 | + same name lives within 50 m) |
| 44 | + │ |
| 45 | + ▼ |
| 46 | + new_conf = old_conf × δ_group |
| 47 | + audit columns appended |
| 48 | + │ |
| 49 | + ▼ conflated.parquet (canonical) |
| 50 | +─── 4. downstream (unchanged) ───────────────────────────────────────── |
| 51 | + summarize.py · format_for_upload.py · prepare_pmtiles.py · publish |
| 52 | +``` |
| 53 | + |
| 54 | +## How to run it |
| 55 | + |
| 56 | +The Makefile target wires the three new sub-steps together and is the |
| 57 | +canonical entry point for national runs: |
| 58 | + |
| 59 | +```bash |
| 60 | +conda activate openpois |
| 61 | + |
| 62 | +make conflate # full CONUS |
| 63 | +make conflate TEST=1 # Seattle bbox dry run |
| 64 | +``` |
| 65 | + |
| 66 | +Each sub-step writes its own log under `~/data/openpois/logs/`. Sub-targets |
| 67 | +exist for partial re-runs: |
| 68 | + |
| 69 | +| Target | When to use | |
| 70 | +|---|---| |
| 71 | +| `make build_ghosts` | Re-derive ghosts after bumping `versions.osm_data`. | |
| 72 | +| `make conflate_baseline` | Re-run matching only; reuses the existing ghost build. | |
| 73 | +| `make apply_cd` | Re-apply the CD penalty (e.g., after tuning the δ source or `min_prior_name_match_score`). | |
| 74 | +| `make conflate` | End-to-end. Runs the three steps above in order. | |
| 75 | + |
| 76 | +For one-off A/B experiments outside the make flow, see |
| 77 | +[scripts/conflation/apply_change_detection.py](../scripts/conflation/apply_change_detection.py) |
| 78 | +— it accepts `--baseline-suffix`, `--output-suffix`, `--no-survivor-filter`, |
| 79 | +and `--min-prior-name-score` for ablation. |
| 80 | + |
| 81 | +## Stage details |
| 82 | + |
| 83 | +### 1. Ghost extraction |
| 84 | + |
| 85 | +[src/openpois/conflation/ghost_osm.py](../src/openpois/conflation/ghost_osm.py) |
| 86 | +walks `osm_changes.parquet` in one flat pass, maintaining a per-element rolling |
| 87 | +tag dictionary. For each `(element, version)` it emits at most one *ghost* of |
| 88 | +the priority-ordered event types: |
| 89 | + |
| 90 | +1. `hard_delete` — `visible` flipped `true → false`. Fires regardless of name. |
| 91 | +2. `lifecycle_prefix_added` — a `disused:` / `was:` / `demolished:` / |
| 92 | + `abandoned:` / `removed:` / `razed:` key appeared. Only fires when the |
| 93 | + prior state was un-named (avoids the noise of named lifecycle retagging). |
| 94 | +3. `primary_tag_deleted` — a POI tag key was Deleted. Same no-prior-name gate. |
| 95 | +4. `substantial_rename` — `name` changed with `rapidfuzz.token_set_ratio < 50` |
| 96 | + **and** neither name is a token-level subset/superset of the other |
| 97 | + (guards "Walgreens" ↔ "Walgreens Pharmacy"). |
| 98 | + |
| 99 | +Nodes only in the current implementation: ways and relations would require |
| 100 | +geometry reconstruction beyond what the per-version parquets capture. |
| 101 | + |
| 102 | +Critical upstream fix: the OSM history ingestion in |
| 103 | +[src/openpois/io/osm_history_pbf.py](../src/openpois/io/osm_history_pbf.py) |
| 104 | +uses a two-pass filter (`osmium tags-filter` → ID list → `osmium getid |
| 105 | +--with-history`). A single `tags-filter` pass silently drops every deletion |
| 106 | +version (those rows carry no tags), which makes `hard_delete` impossible to |
| 107 | +observe. The two-pass approach recovers ~600 k node deletions nationwide. |
| 108 | + |
| 109 | +### 2. Baseline conflation |
| 110 | + |
| 111 | +[scripts/conflation/conflate.py](../scripts/conflation/conflate.py) runs the |
| 112 | +existing matcher unchanged. The only addition is `--output-suffix=baseline`, |
| 113 | +which writes `conflated_baseline.parquet` so the no-CD result is preserved |
| 114 | +side-by-side with the CD-applied canonical output. |
| 115 | + |
| 116 | +### 3. Shadow matching + penalty |
| 117 | + |
| 118 | +[src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py) |
| 119 | +does three things: |
| 120 | + |
| 121 | +1. **Shadow matching** — for each unmatched-Overture row from the baseline, |
| 122 | + find ghosts within the per-`shared_label` radius (BallTree on ghost |
| 123 | + centroids, haversine metric). Reuses the composite scoring from |
| 124 | + [src/openpois/conflation/match.py](../src/openpois/conflation/match.py) |
| 125 | + with `distance_weight=0`, `name_weight=0.5`, `type_weight=0.3`, |
| 126 | + `identifier_weight=0.2`. Type score is binary on exact `shared_label` |
| 127 | + equality. Greedy one-to-one above `min_shadow_match_score = 0.50`. |
| 128 | + Subset/superset name pairs are dropped as obvious same-entity matches. |
| 129 | + |
| 130 | +2. **R1 current-OSM-survivor filter** — for each surviving shadow match, |
| 131 | + spatial-query the live rated snapshot for OSM POIs within 50 m of the |
| 132 | + Overture centroid. If any has `token_set_ratio ≥ 70` against the Overture |
| 133 | + name, the match is dropped. The POI is still in OSM under different |
| 134 | + geometry / spelling and the primary matcher just missed it. Implemented |
| 135 | + via DuckDB centroid extraction + sklearn `BallTree` haversine query; |
| 136 | + nationwide cost ~90 s, ~3-5 GB peak memory. |
| 137 | + |
| 138 | +3. **Penalty** — multiply Overture's `conf_mean` by the fitted δ for the |
| 139 | + ghost's `shared_label`. δ is the per-group `delta` posterior mean from the |
| 140 | + `RandomByTypeModel` fit (read from `fitted_params.csv`); falls back to |
| 141 | + `default_delta` (0.062) for groups absent from the fit. Audit columns are |
| 142 | + appended on penalized rows: `shadow_matched`, `shadow_ghost_id`, |
| 143 | + `shadow_event_type`, `shadow_event_timestamp`, `shadow_score`, |
| 144 | + `shadow_distance_m`, `original_conf_mean`. |
| 145 | + |
| 146 | +### 4. Downstream consumption |
| 147 | + |
| 148 | +`summarize.py`, `format_for_upload.py`, `prepare_pmtiles.py`, and |
| 149 | +`publish/upload_to_source_coop.py` all read `conflated.parquet` by config and |
| 150 | +require no changes. They now consume the CD-applied output by default. The |
| 151 | +no-CD archive (`conflated_baseline.parquet`) is left on disk for spot-checks |
| 152 | +and ablation. |
| 153 | + |
| 154 | +## Tunables |
| 155 | + |
| 156 | +Under `conflation.change_detection` in [config.yaml](../config.yaml): |
| 157 | + |
| 158 | +| Knob | Default | Effect | |
| 159 | +|---|---|---| |
| 160 | +| `enabled` | `false` | Reserved; the production gate is the matcher itself, not this flag. | |
| 161 | +| `min_shadow_match_score` | `0.50` | Composite score threshold for the shadow matcher. | |
| 162 | +| `name_change_similarity_threshold` | `50` | Below this `token_set_ratio`, a name change becomes a `substantial_rename` ghost. | |
| 163 | +| `default_delta` | `0.062` | Fallback δ for `shared_label` values absent from the fitted model. Equals `sigmoid(logit_delta_0)` for the current fit. | |
| 164 | +| `min_prior_name_match_score` | `0` | Hard gate on Overture-vs-prior-name `token_set_ratio` before any composite scoring. **Leave at 0** — values ≥ 70 produce high precision but miss real closures where Overture has updated to a different current business name at a churned address. | |
| 165 | +| `suppress_if_current_survivor.enabled` | `true` | R1 filter on/off. | |
| 166 | +| `suppress_if_current_survivor.radius_m` | `50` | R1 search radius (meters). | |
| 167 | +| `suppress_if_current_survivor.name_similarity_threshold` | `70` | R1 token_set_ratio gate. | |
| 168 | + |
| 169 | +## Validation |
| 170 | + |
| 171 | +Last hand-vetted Seattle A/B (May 2026, 290 reviewed POIs): |
| 172 | + |
| 173 | +| | Baseline | With CD | |
| 174 | +|---|---|---| |
| 175 | +| Demoted Overture rows | 0 | 293 | |
| 176 | +| Vetted true-drops captured | — | 221 | |
| 177 | +| Vetted false-drops still penalized | — | 58 | |
| 178 | +| Precision (vs vetted truth) | — | 79.2 % | |
| 179 | + |
| 180 | +The remaining ~20 % FPR is the cost of catching the broad "churn at this |
| 181 | +address" signal — see the open per-region calibration TODO for the planned |
| 182 | +follow-up. |
| 183 | + |
| 184 | +## Known limits |
| 185 | + |
| 186 | +- **Asymmetric blindness.** OSM history captures closures cleanly but is |
| 187 | + silent on new openings. A real closure (e.g., a node tagged "Calvary |
| 188 | + Chapel" was deleted) plus a different current business at the same address |
| 189 | + (Overture shows "Redemption Church") reads as evidence Overture is stale, |
| 190 | + even when Overture is right. This explains ~75 % of the residual false |
| 191 | + positives on the Seattle vetting set and is intrinsic — fixing it requires |
| 192 | + data we don't currently ingest (Overture POI creation timestamps, |
| 193 | + per-region prior calibration, or ground-truth surveys). |
| 194 | + |
| 195 | +- **Single national δ per `shared_label`.** The fitted turnover model is |
| 196 | + national-average, so the penalty magnitude is wrong in regions where OSM |
| 197 | + mapping is sparse or stale. A per-state override mechanism is tracked in |
| 198 | + [.claude/TODO.md](../.claude/TODO.md). |
| 199 | + |
| 200 | +- **DuckDB v1.4.1 has a buggy `ST_Distance_Sphere`.** The bundled spherical |
| 201 | + distance returns values ~25 % too high at continental scale and ~30-50 % |
| 202 | + off at small scales. The pipeline does **not** use `ST_Distance_Sphere` |
| 203 | + anywhere — distance work goes through `sklearn.neighbors.BallTree` with |
| 204 | + the haversine metric — but any new code touching this area should avoid it |
| 205 | + until the DuckDB pin is bumped. Tracked in |
| 206 | + [.claude/TODO.md](../.claude/TODO.md). |
| 207 | + |
| 208 | +## Vetting tool |
| 209 | + |
| 210 | +[vetting_viz/](../vetting_viz/) is a single-page Leaflet app for hand-vetting |
| 211 | +the demoted-POI CSV produced by `diff_change_detection.py`. Run via: |
| 212 | + |
| 213 | +```bash |
| 214 | +conda run -n openpois python -m http.server --directory vetting_viz 8765 |
| 215 | +# → http://localhost:8765/ → Load CSV → seattle_demoted_pois_v6.csv |
| 216 | +``` |
| 217 | + |
| 218 | +Markers are colored by vetting status; clicking a point opens the full row |
| 219 | +plus a radio for tagging. Export to CSV when done; reload to resume the |
| 220 | +session. |
| 221 | + |
| 222 | +## File map |
| 223 | + |
| 224 | +| Path | Role | |
| 225 | +|---|---| |
| 226 | +| [src/openpois/io/osm_history_pbf.py](../src/openpois/io/osm_history_pbf.py) | Two-pass filter to retain deletion versions. | |
| 227 | +| [src/openpois/conflation/ghost_osm.py](../src/openpois/conflation/ghost_osm.py) | `_scan_all_changes`, event-type detection. | |
| 228 | +| [scripts/conflation/build_ghosts.py](../scripts/conflation/build_ghosts.py) | Stage 1 driver. | |
| 229 | +| [src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py) | Shadow matcher, R1 filter, δ penalty. | |
| 230 | +| [scripts/conflation/apply_change_detection.py](../scripts/conflation/apply_change_detection.py) | Stage 3 driver. | |
| 231 | +| [scripts/conflation/diff_change_detection.py](../scripts/conflation/diff_change_detection.py) | Demoted-POI CSV producer. | |
| 232 | +| [vetting_viz/](../vetting_viz/) | Manual review UI. | |
| 233 | +| [Makefile](../Makefile) | `make conflate` orchestrator + sub-targets. | |
| 234 | +| [config.yaml](../config.yaml) → `conflation.change_detection` | All tunables. | |
0 commit comments