Skip to content

Commit e9c5c4f

Browse files
authored
Merge pull request #34 from henryspatialanalysis/lifecycle/may-2026-release
Lifecycle/may 2026 release
2 parents 14d6280 + f86b99d commit e9c5c4f

8 files changed

Lines changed: 273 additions & 43 deletions

File tree

.claude/skills/conflate-snapshots/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ upload for web consumption.
1212

1313
- Rated OSM snapshot (`osm_snapshot_rated.parquet`) at `versions.snapshot_osm` — produced by [skills/full-data-pull](../full-data-pull/SKILL.md) step 3.
1414
- Overture snapshot (`overture_snapshot.parquet`) at `versions.snapshot_overture`.
15-
- OSM history parquets (`osm_versions.parquet`, `osm_changes.parquet`) at `versions.osm_data`produced by [skills/model-history-pipeline](../model-history-pipeline/SKILL.md). Required by the change-detection step in stage 4.
15+
- OSM history parquets (`osm_versions.parquet`, `osm_changes.parquet`) at `versions.osm_data`**regenerated each month** by [skills/full-data-pull](../full-data-pull/SKILL.md) step 2 (via `scripts/osm_data/download_history.py`). The full re-fit pipeline at [skills/model-history-pipeline](../model-history-pipeline/SKILL.md) is only invoked when re-fitting λ. Required by the change-detection step in stage 4.
1616
- **Fresh Source Cooperative temp credentials** in `.env.json` at the repo root. Tokens expire in ~1 hour.
1717

1818
> ⚠️ **Credential refresh check.** Source Cooperative uses short-lived AWS

.claude/skills/full-data-pull/SKILL.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,15 @@ description: Use when the user wants to refresh the independent POI snapshots (O
55

66
# Full data pull
77

8-
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS) and applies the rating model to OSM so conflation can run.
8+
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS), refreshes the OSM history that drives ghost reconstruction, and applies the rating model to OSM so conflation can run.
99

1010
## Prerequisites
1111

1212
- conda env `openpois` active.
1313
- For OSM: `osmium` in env bin (resolved automatically via `Path(sys.executable).parent / "osmium"`).
1414
- Boundary cache at `directories.boundary` (auto-downloads on first use).
1515
- A fitted model exists for the OSM rating step (see [skills/model-history-pipeline](../model-history-pipeline/SKILL.md)).
16+
- For OSM history: a fresh Geofabrik OAuth cookie file at `download.osm.history_cookie_file` (Netscape format; any OSM account works). See [docs/data-sources.md](../../docs/data-sources.md#osm-history-geofabrik-full-history-pbfs).
1617

1718
## Steps
1819

@@ -21,23 +22,30 @@ Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR,
2122
versions:
2223
snapshot_osm: "YYYYMMDD"
2324
snapshot_overture: "YYYYMMDD"
25+
osm_data: "YYYYMMDD" # bumps each month — history is refreshed for ghosts
26+
ghost_osm: "YYYYMMDD" # pinned to osm_data; bumps in lockstep
2427
```
25-
See [docs/data-versioning.md](../../docs/data-versioning.md).
28+
`model_output` does **not** bump unless you're re-fitting λ from scratch (see [skills/model-history-pipeline](../model-history-pipeline/SKILL.md)). See [docs/data-versioning.md](../../docs/data-versioning.md).
2629

2730
2. **Run the downloads** (independent — order doesn't matter, can run in parallel):
2831

2932
```bash
3033
python scripts/osm_snapshot/download.py # 4 Geofabrik PBFs → osm_snapshot.parquet
3134
python scripts/overture/download.py # DuckDB over S3 → overture_snapshot.parquet
35+
python scripts/osm_data/download_history.py # 4 internal OSH PBFs → osm_versions.parquet + osm_changes.parquet
3236
```
33-
The snapshot loader pulls 4 extracts in sequence: `us`, `pr`, `usvi`, `american_oceania`. Per-source details, auth, and schema quirks are in [docs/data-sources.md](../../docs/data-sources.md).
37+
Each loader pulls 4 extracts in sequence: `us`, `pr`, `usvi`, `american_oceania`. Per-source details, auth, and schema quirks are in [docs/data-sources.md](../../docs/data-sources.md).
3438

3539
**Gotcha — interrupted snapshot runs**: all 4 extracts share `~/data/openpois/snapshots/osm/<v>/parse_chunks/`. If a run dies between extracts, leftover chunks from extract N may be silently mistaken for extract N+1's parsed output on resume (the parser short-circuits on existing chunks). Before resuming an interrupted snapshot run, nuke the work dir:
3640
```bash
3741
rm -rf ~/data/openpois/snapshots/osm/{version}/parse_chunks/
3842
```
3943
This forces a clean re-parse of whichever extract was in flight; completed extracts (which write their own per-extract intermediate parquet next to the final output) are still skipped.
4044

45+
**Gotcha — `download_history.py` is for ghost regeneration only**: do **not** re-run `scripts/osm_data/format_tabular.py` or `scripts/models/osm_turnover.py` in the monthly cycle — those are part of the model-fit pipeline, which stays pinned to `versions.model_output`. The monthly history refresh only feeds `build_ghosts.py` (invoked by `make conflate`).
46+
47+
**Gotcha — per-territory 404 tolerance**: if Geofabrik stops publishing a territory's `*-internal.osh.pbf`, the loader logs a warning, skips that extract, and continues. The territory's POIs still flow through downstream stages but the rater falls back to the global-mean δ for its `shared_label`s.
48+
4149
3. **Apply the rating model to OSM** → `osm_snapshot_rated.parquet`:
4250
```bash
4351
python scripts/osm_snapshot/apply_model.py

CHANGELOG.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Changelog
2+
3+
## 2026-05-21-v0
4+
5+
### Snapshot inputs
6+
7+
| Source | Value |
8+
| ---------------------- | ------------------------------------------- |
9+
| OSM snapshot date | 2026-05-21 |
10+
| Overture release | `2026-05-20.0` (pinned) |
11+
| OSM snapshot rows | 8,799,633 |
12+
| Overture snapshot rows | 13,458,763 |
13+
| Boundary footprint | US + all territories (PR, USVI, GU, MP, AS) |
14+
15+
### Conflated output
16+
17+
| Metric | This run | Prior | Δ |
18+
| ---------------------------- | ----------- | ------------ | ----------------------- |
19+
| Total rows | 17,989,377 | 17,788,585 | +200,792 (+1.13%) |
20+
| Matched OSM × Overture | 2,696,484 | 2,677,091 | +19,393 (+0.72%) |
21+
| OSM-only | 6,103,149 | 6,031,413 | +71,736 (+1.19%) |
22+
| Overture-only | 9,189,744 | 9,080,081 | +109,663 (+1.21%) |
23+
| Shadow-matched (CD penalty) | 47,925 | n/a | new — first run with change detection |
24+
| Shared labels | 93 | 93 | unchanged set |
25+
26+
### Methods changes vs. prior release
27+
28+
- **Change detection (new).** Post-conflation pass that reconstructs "ghost" POIs from OSM history (deleted or renamed nodes) and uses them to penalize unmatched Overture POIs that shadow-match a ghost. Penalty multiplies the Overture row's `conf_mean` by the per-`shared_label` δ from the fitted turnover model. 47,925 rows penalized this run. Adds audit columns to every conflated row: `shadow_matched`, `shadow_ghost_id`, `shadow_event_type`, `shadow_event_timestamp`, `shadow_score`, `shadow_distance_m`, `original_conf_mean`. **PR #29**; design in `docs/change-detection.md`.
29+
- **US territory expansion.** Spatial footprint widened from CONUS + PR to include all US territories (PR, USVI, GU, MP, AS). Affects both snapshots and the conflation domain. **PR #31**.
30+
- **Wider metadata propagation.** Additional OSM and Overture metadata fields now flow through to the conflated parquet (website, wikidata, wikipedia, etc.). **PR #30**.
31+
- **PMTiles re-tuned.** Zoom range narrowed to Z10–Z14 with `--drop-densest-as-needed`, so feature drops cascade through low zooms instead of failing tile builds. Site updated with zoom-aware point styling. **PR #33**.
32+
- **Covering bbox in partitioned parquet.** GeoParquet 1.1 `bbox` struct column emitted via `write_covering_bbox=True`, enabling DuckDB row-group pruning on viewport queries. **PR #32**.
33+
- **Overture release pinned.** `download.overture.release_date` set to `2026-05-20.0` (was `null` = auto-detect latest). Future runs against the same pin are deterministic.
34+
- **Pipeline memory hardening (uncommitted on `lifecycle/may-2026-release`).** Both `apply_change_detection.py` and the partitioned-write helper hit the 24 GB WSL cap on nationwide inputs. The CD writer now mutates in place and streams the output parquet in row-group chunks via `pyarrow.parquet.ParquetWriter`; the geohash partition writer drops one full-partition copy (numpy `argsort` + `iloc` instead of pandas `sort_values`) and streams large partitions in chunks. See `src/openpois/conflation/change_detection.py` and `src/openpois/io/geohash_partition.py`.
35+
36+
### Taxonomy changes
37+
38+
**Overture crosswalk** (`src/openpois/conflation/data/taxonomy_crosswalk_overture_maps.csv`, uncommitted on `lifecycle/may-2026-release`): 7 new entries under `services_and_business.family_service`, previously unmapped and dropped from the partitioned output.
39+
40+
| Overture sub-category | Maps to |
41+
| ----------------------------- | ------------------ |
42+
| `funeral_service` | Other Professional |
43+
| `adoption_service` | Other Professional |
44+
| `family_service_center` | Other Professional |
45+
| `nanny_service` | Other Professional |
46+
| `genealogist` | Other Professional |
47+
| `elder_care_planning` | Other Professional |
48+
| `mobility_equipment_service` | Other Shop |
49+
50+
This is the proximate cause of the +22,715 row jump (+8.45%) in **Other Professional**.
51+
52+
No OSM-side taxonomy changes since 2026-04-23.
53+
54+
### Top label-level row-count changes
55+
56+
| Shared label | This run | Prior | Δ rows | Δ % | Δ matched |
57+
| ------------------- | ----------- | ----------- | --------- | ------- | --------- |
58+
| Specialty Store | 1,026,395 | 917,422 | +108,973 | +11.88% | +753 |
59+
| Other Amenity | 3,858,315 | 3,819,068 | +39,247 | +1.03% | +3,124 |
60+
| Clothing Store | 317,177 | 288,506 | +28,671 | +9.94% | +779 |
61+
| Other Professional | 291,500 | 268,785 | +22,715 | +8.45% | 0 |
62+
| Other Healthcare | 995,881 | 1,016,112 | −20,231 | −1.99% | +54 |
63+
| (unlabeled) | 701,209 | 719,862 | −18,653 | −2.59% | +1,506 |
64+
| Car Dealer | 182,314 | 164,517 | +17,797 | +10.82% | +521 |
65+
| Restaurant | 718,472 | 702,020 | +16,452 | +2.34% | +1,092 |
66+
| Supermarket | 193,777 | 179,783 | +13,994 | +7.78% | +361 |
67+
| Recreation | 1,302,776 | 1,293,338 | +9,438 | +0.73% | −510 |
68+
69+
Drivers:
70+
- Most positive movers (Specialty Store, Clothing Store, Car Dealer, Supermarket, Bakery, Charging Station) track Overture's snapshot growth (+2.5% overall) landing in shared labels with moderate base counts.
71+
- **Other Professional** also reflects the new `family_service` crosswalk entries above.
72+
- **Other Healthcare** dropping by ~20k against a larger Overture snapshot is worth a closer look — likely an Overture taxonomy reshuffle inside `health_and_medical` upstream. Flagged for QA, not blocking.
73+
74+
### Version pins
75+
76+
| Key | This run | Prior |
77+
| ------------------------- | ------------------------ | ------------------------ |
78+
| `versions.conflation` | 20260521 | 20260423 |
79+
| `versions.snapshot_osm` | 20260521 | 20260417 |
80+
| `versions.snapshot_overture` | 20260521 | 20260423 |
81+
| `versions.osm_data` | 20260521 | 20260515 |
82+
| `versions.ghost_osm` | 20260521 | 20260515 |
83+
| `versions.model_output` | 20260422_by_shared_label | 20260422_by_shared_label (unchanged — model not refit this cycle) |

config.yaml

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
# Versioned directories (used with config.get_dir_path())
22
versions:
3-
osm_data: "20260515"
3+
osm_data: "20260521"
44
model_output: "20260422_by_shared_label"
5-
snapshot_osm: "20260417"
6-
snapshot_overture: "20260423"
7-
conflation: "20260423"
8-
source_coop: "2026-04-23-v0" # Source Cooperative upload folder (YYYY-MM-DD-v<IDX>); bump v<IDX> only for same-day re-uploads
5+
snapshot_osm: "20260521"
6+
snapshot_overture: "20260521"
7+
conflation: "20260521"
8+
source_coop: "2026-05-21-v0" # Source Cooperative upload folder (YYYY-MM-DD-v<IDX>); bump v<IDX> only for same-day re-uploads
99
# Ghost POI dataset reconstructed from OSM history (one row per
1010
# detected previous-state event). Pinned to the same value as
11-
# ``osm_data`` since it is derived from the same history parquets.
12-
ghost_osm: "20260515"
11+
# ``osm_data`` since it is derived from the same history parquets,
12+
# and regenerated together with the monthly snapshot refresh.
13+
ghost_osm: "20260521"
1314

1415
# Settings for downloading data
1516
download:
@@ -74,7 +75,7 @@ download:
7475
'website','wikidata','wikipedia'
7576
]
7677
overture:
77-
release_date: null # null = auto-detect latest
78+
release_date: "2026-05-20.0" # pin for determinism; null = auto-detect latest
7879
s3_bucket: "overturemaps-us-west-2"
7980
s3_region: "us-west-2"
8081
# DuckDB resource caps for the per-part S3 scans and the final polygon

scripts/publish/upload_to_source_coop.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,14 @@ def parse_args() -> argparse.Namespace:
7979
"default because these rarely change."
8080
),
8181
)
82+
parser.add_argument(
83+
"--skip-changelog", action = "store_true",
84+
help = (
85+
"Skip uploading CHANGELOG.md to the repo top level. Default "
86+
"is to upload it on every run so the public copy stays in "
87+
"sync with the latest per-release deltas."
88+
),
89+
)
8290
parser.add_argument(
8391
"--skip-latest-mirror", action = "store_true",
8492
help = (
@@ -204,6 +212,25 @@ def main() -> None:
204212
f"deleted {summary['deleted']} stale object(s)."
205213
)
206214

215+
# -------------------------------------------------------------------------
216+
# Top-level CHANGELOG.md (per-release deltas, updated every run by default)
217+
# -------------------------------------------------------------------------
218+
if not args.skip_changelog:
219+
changelog_path = CONFIG_PATH.parent / "CHANGELOG.md"
220+
if changelog_path.exists():
221+
upload_bytes(
222+
client = client,
223+
data = changelog_path.read_bytes(),
224+
bucket = bucket,
225+
key = f"{repo_prefix}/CHANGELOG.md",
226+
content_type = "text/markdown; charset=utf-8",
227+
dry_run = args.dry_run,
228+
)
229+
else:
230+
print(
231+
f"Skipping CHANGELOG.md upload — {changelog_path} not found."
232+
)
233+
207234
# -------------------------------------------------------------------------
208235
# Top-level README + LICENSE (opt-in)
209236
# -------------------------------------------------------------------------

src/openpois/conflation/change_detection.py

Lines changed: 75 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -647,42 +647,91 @@ def apply_shadow_match(
647647
new_conf_lower[target_global] = np.nan
648648
new_conf_upper[target_global] = np.nan
649649

650-
# -- Stitch into output --------------------------------------------
651-
out = conflated.copy()
652-
out["conf_mean"] = new_conf_mean
653-
out["conf_lower"] = new_conf_lower
654-
out["conf_upper"] = new_conf_upper
655-
out["shadow_matched"] = shadow_matched
656-
out["shadow_ghost_id"] = shadow_ghost_id
657-
out["shadow_event_type"] = shadow_event_type
658-
out["shadow_event_timestamp"] = shadow_event_timestamp.values
659-
out["shadow_score"] = shadow_score
660-
out["shadow_distance_m"] = shadow_distance_m
661-
out["original_conf_mean"] = original_conf_mean
650+
# -- Summary scalars are derived now, while pre- and post-penalty
651+
# conf_mean views are both still alive. Lengths are captured here
652+
# so we can free the heavy intermediates before the parquet write.
653+
mean_penalty_factor = (
654+
float(
655+
(new_conf_mean[shadow_matched]
656+
/ np.where(
657+
original_conf_mean[shadow_matched] == 0, 1,
658+
original_conf_mean[shadow_matched],
659+
)
660+
).mean()
661+
)
662+
if shadow_matched.any() else float("nan")
663+
)
664+
n_ghosts_in = int(len(ghosts))
665+
n_shadow_matches = int(len(matches))
666+
667+
# On nationwide data the original "copy then write" path peaked
668+
# past the 24 GB WSL cap (≈18M-row shapely-geometry GDF + a full
669+
# .copy() + the pyarrow Table materialized inside to_parquet +
670+
# the 9M-row unmatched_ov subset). Free the scratch state before
671+
# the write peak.
672+
del ghosts, matches
673+
if "unmatched_ov" in locals():
674+
del unmatched_ov # noqa: F821 -- only bound in the else branch
675+
gc.collect()
676+
677+
# Mutate conflated in place rather than allocating a full copy:
678+
# the un-penalized baseline isn't needed after this point and the
679+
# audit data is held in standalone numpy arrays.
680+
conflated["conf_mean"] = new_conf_mean
681+
conflated["conf_lower"] = new_conf_lower
682+
conflated["conf_upper"] = new_conf_upper
683+
conflated["shadow_matched"] = shadow_matched
684+
conflated["shadow_ghost_id"] = shadow_ghost_id
685+
conflated["shadow_event_type"] = shadow_event_type
686+
conflated["shadow_event_timestamp"] = shadow_event_timestamp.values
687+
conflated["shadow_score"] = shadow_score
688+
conflated["shadow_distance_m"] = shadow_distance_m
689+
conflated["original_conf_mean"] = original_conf_mean
690+
691+
# Pandas copied each array on assignment, so the standalone
692+
# references are now redundant -- drop them before the write.
693+
del (
694+
new_conf_mean, new_conf_lower, new_conf_upper,
695+
shadow_matched, shadow_ghost_id, shadow_event_type,
696+
shadow_event_timestamp, shadow_score, shadow_distance_m,
697+
original_conf_mean,
698+
)
699+
gc.collect()
662700

663701
output_path = Path(output_path)
664702
output_path.parent.mkdir(parents = True, exist_ok = True)
665703
if verbose:
666704
print(f"Writing {output_path} ...")
667-
out.to_parquet(output_path, compression = "zstd")
705+
706+
# Stream the write in row-group chunks. The default
707+
# GeoDataFrame.to_parquet materializes a full pyarrow Table
708+
# alongside the live GDF, doubling peak memory; on 18M-row
709+
# nationwide inputs that exceeds the 24 GB WSL cap.
710+
from geopandas.io.arrow import _geopandas_to_arrow
711+
712+
chunk_rows = 2_000_000
713+
sample_tbl = _geopandas_to_arrow(conflated.iloc[:1])
714+
schema = sample_tbl.schema
715+
del sample_tbl
716+
with pq.ParquetWriter(
717+
str(output_path), schema, compression = "zstd",
718+
) as writer:
719+
for start in range(0, n, chunk_rows):
720+
end = min(start + chunk_rows, n)
721+
chunk_tbl = _geopandas_to_arrow(
722+
conflated.iloc[start:end]
723+
)
724+
writer.write_table(chunk_tbl)
725+
del chunk_tbl
726+
gc.collect()
668727

669728
summary = {
670729
"n_total": int(n),
671730
"n_unmatched_overture": int(len(ov_global_idx)),
672-
"n_ghosts": int(len(ghosts)),
673-
"n_shadow_matches": int(len(matches)),
731+
"n_ghosts": n_ghosts_in,
732+
"n_shadow_matches": n_shadow_matches,
674733
"n_survivor_dropped": int(n_survivor_dropped),
675-
"mean_penalty_factor": (
676-
float(
677-
(new_conf_mean[shadow_matched]
678-
/ np.where(
679-
original_conf_mean[shadow_matched] == 0, 1,
680-
original_conf_mean[shadow_matched],
681-
)
682-
).mean()
683-
)
684-
if shadow_matched.any() else float("nan")
685-
),
734+
"mean_penalty_factor": mean_penalty_factor,
686735
}
687736

688737
# Confirm read-back schema integrity.

0 commit comments

Comments
 (0)