Skip to content

Commit d0b20be

Browse files
committed
Add documentation relaed to snapshot conflation.
1 parent 0481f63 commit d0b20be

4 files changed

Lines changed: 317 additions & 4 deletions

File tree

.claude/CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ Style: Black (format-on-save in VSCode). Lint: flake8 + pylint, configured in `p
5454
- [docs/package-versioning.md](docs/package-versioning.md) — semver bumps for the Python package + Vue site + Sphinx docs (distinct from data versioning)
5555
- [docs/partitioning-strategy.md](docs/partitioning-strategy.md) — Hive layout of the partitioned Parquet (`shared_label` for conflated, `primary_tag` for OSM), query patterns, when each layout applies
5656
- [docs/turnover-model-methodology.md](docs/turnover-model-methodology.md) — statistical derivation of the POI turnover model with ZIE extension
57+
- [docs/change-detection.md](docs/change-detection.md) — OSM-history-derived ghost POIs, shadow matching, and the per-`shared_label` δ penalty applied to Overture. Canonical entry point is `make conflate`, which runs the three-step `build_ghosts``conflate.py --output-suffix=baseline``apply_change_detection.py` pipeline so all national runs include the CD penalty by default.
5758

5859
## Running to-do
5960

.claude/skills/conflate-snapshots/SKILL.md

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ upload for web consumption.
1212

1313
- Rated OSM snapshot (`osm_snapshot_rated.parquet`) at `versions.snapshot_osm` — produced by [skills/full-data-pull](../full-data-pull/SKILL.md) step 3.
1414
- Overture snapshot (`overture_snapshot.parquet`) at `versions.snapshot_overture`.
15+
- OSM history parquets (`osm_versions.parquet`, `osm_changes.parquet`) at `versions.osm_data` — produced by [skills/model-history-pipeline](../model-history-pipeline/SKILL.md). Required by the change-detection step in stage 4.
1516
- **Fresh Source Cooperative temp credentials** in `.env.json` at the repo root. Tokens expire in ~1 hour.
1617

1718
> ⚠️ **Credential refresh check.** Source Cooperative uses short-lived AWS
@@ -37,12 +38,27 @@ upload for web consumption.
3738

3839
3. **Sync taxonomy if crosswalks changed** — run the [sync-taxonomy](../sync-taxonomy/SKILL.md) skill. It regenerates `site/public/taxonomy.html` and `site/src/taxonomy.generated.js`, and detects drift in the hand-maintained display labels.
3940

40-
4. **Run conflation**~22M POIs; peak RSS ~10 GB projected (actual peak prints at each phase via the `log_rss` lines in stdout; record the result here after each full run):
41+
4. **Run the conflation pipeline.** The canonical entry point is `make conflate`, which orchestrates three sub-steps so every national run gets the OSM-history change-detection penalty automatically (see [docs/change-detection.md](../../../docs/change-detection.md) for the design and tunables):
42+
1. `build_ghosts.py` — reconstruct ghost POIs from OSM history (`ghosts.parquet` under `versions.ghost_osm`).
43+
2. `conflate.py --output-suffix=baseline` — OSM × Overture matching, writes `conflated_baseline.parquet` (no-CD archive).
44+
3. `apply_change_detection.py` — shadow-match unmatched Overture against the ghosts and apply the per-`shared_label` δ penalty; writes the canonical `conflated.parquet`.
45+
4146
```bash
42-
python scripts/conflation/conflate.py # full run
43-
python scripts/conflation/conflate.py --test # Seattle bbox dry run
47+
make conflate # full CONUS, ~22M POIs, peak RSS ~10 GB (matching)
48+
# + ~5 GB (CD step) projected
49+
make conflate TEST=1 # Seattle bbox dry run
50+
51+
# Sub-targets for partial re-runs:
52+
make build_ghosts # ghosts only
53+
make conflate_baseline # matching only (writes conflated_baseline.parquet)
54+
make apply_cd # CD pass only (reads baseline, writes conflated.parquet)
4455
```
45-
Outputs: `conflated.parquet`, `match_diagnostics.parquet`.
56+
57+
Outputs:
58+
- `conflated.parquet` — canonical output that downstream steps consume (CD applied).
59+
- `conflated_baseline.parquet` — same shape but without the CD penalty; kept on disk for spot-checks.
60+
- `ghosts.parquet` under `versions.ghost_osm` — see [docs/change-detection.md](../../../docs/change-detection.md).
61+
- `match_diagnostics.parquet`.
4662

4763
5. **Match-rate sanity check**:
4864
```bash
@@ -103,6 +119,8 @@ upload for web consumption.
103119
- Matching: [src/openpois/conflation/match.py](../../../src/openpois/conflation/match.py)
104120
- Merging: [src/openpois/conflation/merge.py](../../../src/openpois/conflation/merge.py)
105121
- Taxonomy assignment: [src/openpois/conflation/taxonomy.py](../../../src/openpois/conflation/taxonomy.py)
122+
- Change-detection (ghost emission + shadow matching + R1): [src/openpois/conflation/ghost_osm.py](../../../src/openpois/conflation/ghost_osm.py), [src/openpois/conflation/change_detection.py](../../../src/openpois/conflation/change_detection.py)
106123
- Publish orchestration: [scripts/publish/upload_to_source_coop.py](../../../scripts/publish/upload_to_source_coop.py)
107124
- Source Coop S3 adapter: [src/openpois/io/source_coop.py](../../../src/openpois/io/source_coop.py)
108125
- Conflation algorithm docs: [scripts/conflation/README.md](../../../scripts/conflation/README.md)
126+
- Change-detection design: [docs/change-detection.md](../../../docs/change-detection.md)

Makefile

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,66 @@ site_preview:
4949
@cp -r docs/_build/html site/dist/docs
5050
@python -m http.server 4173 --directory site/dist;
5151

52+
# -----------------------------------------------------------------------------
53+
# Conflation pipeline (canonical entry point for all national runs)
54+
#
55+
# `make conflate` runs the three steps that produce the published
56+
# conflated.parquet:
57+
#
58+
# 1. build_ghosts.py - reconstruct "ghost" POI dataset
59+
# from OSM history (deletions,
60+
# primary-tag removals, lifecycle
61+
# prefixes, substantial renames).
62+
# 2. conflate.py - OSM x Overture matching as before,
63+
# written to conflated_baseline.parquet
64+
# so the pre-CD result is archived.
65+
# 3. apply_change_detection.py - penalize Overture POIs that shadow-
66+
# match a ghost; emits the canonical
67+
# conflated.parquet that downstream
68+
# summarize / format_for_upload /
69+
# prepare_pmtiles / publish steps
70+
# consume.
71+
#
72+
# Each sub-step tees a per-run log under ~/data/openpois/logs/.
73+
#
74+
# Pass TEST=1 to scope to the Seattle bbox:
75+
# make conflate # full CONUS
76+
# make conflate TEST=1 # Seattle bbox dry run
77+
#
78+
# Sub-targets (build_ghosts / conflate_baseline / apply_cd) are exposed
79+
# for partial re-runs when one stage is being iterated on.
80+
81+
TEST ?=
82+
TEST_FLAG := $(if $(TEST),--test,)
83+
LOG_DIR := $(HOME)/data/openpois/logs
84+
LOG_TS := $(shell date +%Y%m%d_%H%M%S)
85+
86+
.PHONY: conflate build_ghosts conflate_baseline apply_cd
87+
88+
build_ghosts:
89+
@mkdir -p $(LOG_DIR)
90+
@$(CONDA_PYTHON) -u scripts/conflation/build_ghosts.py \
91+
2>&1 | tee $(LOG_DIR)/build_ghosts_$(LOG_TS).log
92+
93+
conflate_baseline:
94+
@mkdir -p $(LOG_DIR)
95+
@$(CONDA_PYTHON) -u scripts/conflation/conflate.py \
96+
--output-suffix=baseline $(TEST_FLAG) \
97+
2>&1 | tee $(LOG_DIR)/conflate_baseline_$(LOG_TS).log
98+
99+
apply_cd:
100+
@mkdir -p $(LOG_DIR)
101+
@$(CONDA_PYTHON) -u scripts/conflation/apply_change_detection.py \
102+
--baseline-suffix=baseline --output-suffix="" $(TEST_FLAG) \
103+
2>&1 | tee $(LOG_DIR)/apply_cd_$(LOG_TS).log
104+
105+
conflate: build_ghosts conflate_baseline apply_cd
106+
@echo
107+
@echo "Conflation pipeline complete."
108+
@echo " Canonical output: ~/data/openpois/conflation/<version>/conflated.parquet"
109+
@echo " (no-CD archive: conflated_baseline.parquet)"
110+
@echo " Logs under: $(LOG_DIR)/{build_ghosts,conflate_baseline,apply_cd}_$(LOG_TS).log"
111+
52112
# Convenience target to print all of the available targets in this file
53113
# From https://stackoverflow.com/questions/4219255
54114
.PHONY: list

docs/change-detection.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
# Change detection
2+
3+
OSM edit history is used to downweight Overture POIs whose location has seen a
4+
recent closure / rename / lifecycle event in OSM. This is a post-processing
5+
pass on the conflated dataset; the no-CD baseline is preserved as
6+
``conflated_baseline.parquet`` and the CD-applied result becomes the canonical
7+
``conflated.parquet`` that downstream partition / PMTiles / publish steps
8+
consume.
9+
10+
## Pipeline
11+
12+
Four stages, each separately runnable and individually inspectable:
13+
14+
```text
15+
OSM history parquets
16+
(osm_versions, osm_changes)
17+
18+
19+
1. build_ghosts.py
20+
for each (element, version) emit
21+
at most one of:
22+
• hard_delete
23+
• lifecycle_prefix_added
24+
• primary_tag_deleted
25+
• substantial_rename
26+
27+
▼ ghosts.parquet
28+
─── 2. conflate.py (--output-suffix=baseline) ──────────────────────────
29+
rated_osm ──► match ──► conflated_baseline.parquet
30+
overture ─────┘ (unchanged from the no-CD pipeline)
31+
32+
33+
─── 3. apply_change_detection.py ──────────────────────────────────────
34+
conflated_baseline.parquet
35+
36+
shadow-matcher
37+
(reuses match.py scoring)
38+
39+
│ matches
40+
41+
R1: current-OSM-survivor filter
42+
(drop if a live OSM POI with the
43+
same name lives within 50 m)
44+
45+
46+
new_conf = old_conf × δ_group
47+
audit columns appended
48+
49+
▼ conflated.parquet (canonical)
50+
─── 4. downstream (unchanged) ─────────────────────────────────────────
51+
summarize.py · format_for_upload.py · prepare_pmtiles.py · publish
52+
```
53+
54+
## How to run it
55+
56+
The Makefile target wires the three new sub-steps together and is the
57+
canonical entry point for national runs:
58+
59+
```bash
60+
conda activate openpois
61+
62+
make conflate # full CONUS
63+
make conflate TEST=1 # Seattle bbox dry run
64+
```
65+
66+
Each sub-step writes its own log under `~/data/openpois/logs/`. Sub-targets
67+
exist for partial re-runs:
68+
69+
| Target | When to use |
70+
|---|---|
71+
| `make build_ghosts` | Re-derive ghosts after bumping `versions.osm_data`. |
72+
| `make conflate_baseline` | Re-run matching only; reuses the existing ghost build. |
73+
| `make apply_cd` | Re-apply the CD penalty (e.g., after tuning the δ source or `min_prior_name_match_score`). |
74+
| `make conflate` | End-to-end. Runs the three steps above in order. |
75+
76+
For one-off A/B experiments outside the make flow, see
77+
[scripts/conflation/apply_change_detection.py](../scripts/conflation/apply_change_detection.py)
78+
— it accepts `--baseline-suffix`, `--output-suffix`, `--no-survivor-filter`,
79+
and `--min-prior-name-score` for ablation.
80+
81+
## Stage details
82+
83+
### 1. Ghost extraction
84+
85+
[src/openpois/conflation/ghost_osm.py](../src/openpois/conflation/ghost_osm.py)
86+
walks `osm_changes.parquet` in one flat pass, maintaining a per-element rolling
87+
tag dictionary. For each `(element, version)` it emits at most one *ghost* of
88+
the priority-ordered event types:
89+
90+
1. `hard_delete``visible` flipped `true → false`. Fires regardless of name.
91+
2. `lifecycle_prefix_added` — a `disused:` / `was:` / `demolished:` /
92+
`abandoned:` / `removed:` / `razed:` key appeared. Only fires when the
93+
prior state was un-named (avoids the noise of named lifecycle retagging).
94+
3. `primary_tag_deleted` — a POI tag key was Deleted. Same no-prior-name gate.
95+
4. `substantial_rename``name` changed with `rapidfuzz.token_set_ratio < 50`
96+
**and** neither name is a token-level subset/superset of the other
97+
(guards "Walgreens" ↔ "Walgreens Pharmacy").
98+
99+
Nodes only in the current implementation: ways and relations would require
100+
geometry reconstruction beyond what the per-version parquets capture.
101+
102+
Critical upstream fix: the OSM history ingestion in
103+
[src/openpois/io/osm_history_pbf.py](../src/openpois/io/osm_history_pbf.py)
104+
uses a two-pass filter (`osmium tags-filter` → ID list → `osmium getid
105+
--with-history`). A single `tags-filter` pass silently drops every deletion
106+
version (those rows carry no tags), which makes `hard_delete` impossible to
107+
observe. The two-pass approach recovers ~600 k node deletions nationwide.
108+
109+
### 2. Baseline conflation
110+
111+
[scripts/conflation/conflate.py](../scripts/conflation/conflate.py) runs the
112+
existing matcher unchanged. The only addition is `--output-suffix=baseline`,
113+
which writes `conflated_baseline.parquet` so the no-CD result is preserved
114+
side-by-side with the CD-applied canonical output.
115+
116+
### 3. Shadow matching + penalty
117+
118+
[src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py)
119+
does three things:
120+
121+
1. **Shadow matching** — for each unmatched-Overture row from the baseline,
122+
find ghosts within the per-`shared_label` radius (BallTree on ghost
123+
centroids, haversine metric). Reuses the composite scoring from
124+
[src/openpois/conflation/match.py](../src/openpois/conflation/match.py)
125+
with `distance_weight=0`, `name_weight=0.5`, `type_weight=0.3`,
126+
`identifier_weight=0.2`. Type score is binary on exact `shared_label`
127+
equality. Greedy one-to-one above `min_shadow_match_score = 0.50`.
128+
Subset/superset name pairs are dropped as obvious same-entity matches.
129+
130+
2. **R1 current-OSM-survivor filter** — for each surviving shadow match,
131+
spatial-query the live rated snapshot for OSM POIs within 50 m of the
132+
Overture centroid. If any has `token_set_ratio ≥ 70` against the Overture
133+
name, the match is dropped. The POI is still in OSM under different
134+
geometry / spelling and the primary matcher just missed it. Implemented
135+
via DuckDB centroid extraction + sklearn `BallTree` haversine query;
136+
nationwide cost ~90 s, ~3-5 GB peak memory.
137+
138+
3. **Penalty** — multiply Overture's `conf_mean` by the fitted δ for the
139+
ghost's `shared_label`. δ is the per-group `delta` posterior mean from the
140+
`RandomByTypeModel` fit (read from `fitted_params.csv`); falls back to
141+
`default_delta` (0.062) for groups absent from the fit. Audit columns are
142+
appended on penalized rows: `shadow_matched`, `shadow_ghost_id`,
143+
`shadow_event_type`, `shadow_event_timestamp`, `shadow_score`,
144+
`shadow_distance_m`, `original_conf_mean`.
145+
146+
### 4. Downstream consumption
147+
148+
`summarize.py`, `format_for_upload.py`, `prepare_pmtiles.py`, and
149+
`publish/upload_to_source_coop.py` all read `conflated.parquet` by config and
150+
require no changes. They now consume the CD-applied output by default. The
151+
no-CD archive (`conflated_baseline.parquet`) is left on disk for spot-checks
152+
and ablation.
153+
154+
## Tunables
155+
156+
Under `conflation.change_detection` in [config.yaml](../config.yaml):
157+
158+
| Knob | Default | Effect |
159+
|---|---|---|
160+
| `enabled` | `false` | Reserved; the production gate is the matcher itself, not this flag. |
161+
| `min_shadow_match_score` | `0.50` | Composite score threshold for the shadow matcher. |
162+
| `name_change_similarity_threshold` | `50` | Below this `token_set_ratio`, a name change becomes a `substantial_rename` ghost. |
163+
| `default_delta` | `0.062` | Fallback δ for `shared_label` values absent from the fitted model. Equals `sigmoid(logit_delta_0)` for the current fit. |
164+
| `min_prior_name_match_score` | `0` | Hard gate on Overture-vs-prior-name `token_set_ratio` before any composite scoring. **Leave at 0** — values ≥ 70 produce high precision but miss real closures where Overture has updated to a different current business name at a churned address. |
165+
| `suppress_if_current_survivor.enabled` | `true` | R1 filter on/off. |
166+
| `suppress_if_current_survivor.radius_m` | `50` | R1 search radius (meters). |
167+
| `suppress_if_current_survivor.name_similarity_threshold` | `70` | R1 token_set_ratio gate. |
168+
169+
## Validation
170+
171+
Last hand-vetted Seattle A/B (May 2026, 290 reviewed POIs):
172+
173+
| | Baseline | With CD |
174+
|---|---|---|
175+
| Demoted Overture rows | 0 | 293 |
176+
| Vetted true-drops captured || 221 |
177+
| Vetted false-drops still penalized || 58 |
178+
| Precision (vs vetted truth) || 79.2 % |
179+
180+
The remaining ~20 % FPR is the cost of catching the broad "churn at this
181+
address" signal — see the open per-region calibration TODO for the planned
182+
follow-up.
183+
184+
## Known limits
185+
186+
- **Asymmetric blindness.** OSM history captures closures cleanly but is
187+
silent on new openings. A real closure (e.g., a node tagged "Calvary
188+
Chapel" was deleted) plus a different current business at the same address
189+
(Overture shows "Redemption Church") reads as evidence Overture is stale,
190+
even when Overture is right. This explains ~75 % of the residual false
191+
positives on the Seattle vetting set and is intrinsic — fixing it requires
192+
data we don't currently ingest (Overture POI creation timestamps,
193+
per-region prior calibration, or ground-truth surveys).
194+
195+
- **Single national δ per `shared_label`.** The fitted turnover model is
196+
national-average, so the penalty magnitude is wrong in regions where OSM
197+
mapping is sparse or stale. A per-state override mechanism is tracked in
198+
[.claude/TODO.md](../.claude/TODO.md).
199+
200+
- **DuckDB v1.4.1 has a buggy `ST_Distance_Sphere`.** The bundled spherical
201+
distance returns values ~25 % too high at continental scale and ~30-50 %
202+
off at small scales. The pipeline does **not** use `ST_Distance_Sphere`
203+
anywhere — distance work goes through `sklearn.neighbors.BallTree` with
204+
the haversine metric — but any new code touching this area should avoid it
205+
until the DuckDB pin is bumped. Tracked in
206+
[.claude/TODO.md](../.claude/TODO.md).
207+
208+
## Vetting tool
209+
210+
[vetting_viz/](../vetting_viz/) is a single-page Leaflet app for hand-vetting
211+
the demoted-POI CSV produced by `diff_change_detection.py`. Run via:
212+
213+
```bash
214+
conda run -n openpois python -m http.server --directory vetting_viz 8765
215+
# → http://localhost:8765/ → Load CSV → seattle_demoted_pois_v6.csv
216+
```
217+
218+
Markers are colored by vetting status; clicking a point opens the full row
219+
plus a radio for tagging. Export to CSV when done; reload to resume the
220+
session.
221+
222+
## File map
223+
224+
| Path | Role |
225+
|---|---|
226+
| [src/openpois/io/osm_history_pbf.py](../src/openpois/io/osm_history_pbf.py) | Two-pass filter to retain deletion versions. |
227+
| [src/openpois/conflation/ghost_osm.py](../src/openpois/conflation/ghost_osm.py) | `_scan_all_changes`, event-type detection. |
228+
| [scripts/conflation/build_ghosts.py](../scripts/conflation/build_ghosts.py) | Stage 1 driver. |
229+
| [src/openpois/conflation/change_detection.py](../src/openpois/conflation/change_detection.py) | Shadow matcher, R1 filter, δ penalty. |
230+
| [scripts/conflation/apply_change_detection.py](../scripts/conflation/apply_change_detection.py) | Stage 3 driver. |
231+
| [scripts/conflation/diff_change_detection.py](../scripts/conflation/diff_change_detection.py) | Demoted-POI CSV producer. |
232+
| [vetting_viz/](../vetting_viz/) | Manual review UI. |
233+
| [Makefile](../Makefile) | `make conflate` orchestrator + sub-targets. |
234+
| [config.yaml](../config.yaml)`conflation.change_detection` | All tunables. |

0 commit comments

Comments
 (0)