Skip to content

Commit f86b99d

Browse files
committed
Update changelog as a standard step in the monthly update cycle.
1 parent 4e2b7e6 commit f86b99d

4 files changed

Lines changed: 122 additions & 4 deletions

File tree

.claude/skills/conflate-snapshots/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ upload for web consumption.
1212

1313
- Rated OSM snapshot (`osm_snapshot_rated.parquet`) at `versions.snapshot_osm` — produced by [skills/full-data-pull](../full-data-pull/SKILL.md) step 3.
1414
- Overture snapshot (`overture_snapshot.parquet`) at `versions.snapshot_overture`.
15-
- OSM history parquets (`osm_versions.parquet`, `osm_changes.parquet`) at `versions.osm_data`produced by [skills/model-history-pipeline](../model-history-pipeline/SKILL.md). Required by the change-detection step in stage 4.
15+
- OSM history parquets (`osm_versions.parquet`, `osm_changes.parquet`) at `versions.osm_data`**regenerated each month** by [skills/full-data-pull](../full-data-pull/SKILL.md) step 2 (via `scripts/osm_data/download_history.py`). The full re-fit pipeline at [skills/model-history-pipeline](../model-history-pipeline/SKILL.md) is only invoked when re-fitting λ. Required by the change-detection step in stage 4.
1616
- **Fresh Source Cooperative temp credentials** in `.env.json` at the repo root. Tokens expire in ~1 hour.
1717

1818
> ⚠️ **Credential refresh check.** Source Cooperative uses short-lived AWS

.claude/skills/full-data-pull/SKILL.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,15 @@ description: Use when the user wants to refresh the independent POI snapshots (O
55

66
# Full data pull
77

8-
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS) and applies the rating model to OSM so conflation can run.
8+
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS), refreshes the OSM history that drives ghost reconstruction, and applies the rating model to OSM so conflation can run.
99

1010
## Prerequisites
1111

1212
- conda env `openpois` active.
1313
- For OSM: `osmium` in env bin (resolved automatically via `Path(sys.executable).parent / "osmium"`).
1414
- Boundary cache at `directories.boundary` (auto-downloads on first use).
1515
- A fitted model exists for the OSM rating step (see [skills/model-history-pipeline](../model-history-pipeline/SKILL.md)).
16+
- For OSM history: a fresh Geofabrik OAuth cookie file at `download.osm.history_cookie_file` (Netscape format; any OSM account works). See [docs/data-sources.md](../../docs/data-sources.md#osm-history-geofabrik-full-history-pbfs).
1617

1718
## Steps
1819

@@ -21,23 +22,30 @@ Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR,
2122
versions:
2223
snapshot_osm: "YYYYMMDD"
2324
snapshot_overture: "YYYYMMDD"
25+
osm_data: "YYYYMMDD" # bumps each month — history is refreshed for ghosts
26+
ghost_osm: "YYYYMMDD" # pinned to osm_data; bumps in lockstep
2427
```
25-
See [docs/data-versioning.md](../../docs/data-versioning.md).
28+
`model_output` does **not** bump unless you're re-fitting λ from scratch (see [skills/model-history-pipeline](../model-history-pipeline/SKILL.md)). See [docs/data-versioning.md](../../docs/data-versioning.md).
2629

2730
2. **Run the downloads** (independent — order doesn't matter, can run in parallel):
2831

2932
```bash
3033
python scripts/osm_snapshot/download.py # 4 Geofabrik PBFs → osm_snapshot.parquet
3134
python scripts/overture/download.py # DuckDB over S3 → overture_snapshot.parquet
35+
python scripts/osm_data/download_history.py # 4 internal OSH PBFs → osm_versions.parquet + osm_changes.parquet
3236
```
33-
The snapshot loader pulls 4 extracts in sequence: `us`, `pr`, `usvi`, `american_oceania`. Per-source details, auth, and schema quirks are in [docs/data-sources.md](../../docs/data-sources.md).
37+
Each loader pulls 4 extracts in sequence: `us`, `pr`, `usvi`, `american_oceania`. Per-source details, auth, and schema quirks are in [docs/data-sources.md](../../docs/data-sources.md).
3438

3539
**Gotcha — interrupted snapshot runs**: all 4 extracts share `~/data/openpois/snapshots/osm/<v>/parse_chunks/`. If a run dies between extracts, leftover chunks from extract N may be silently mistaken for extract N+1's parsed output on resume (the parser short-circuits on existing chunks). Before resuming an interrupted snapshot run, nuke the work dir:
3640
```bash
3741
rm -rf ~/data/openpois/snapshots/osm/{version}/parse_chunks/
3842
```
3943
This forces a clean re-parse of whichever extract was in flight; completed extracts (which write their own per-extract intermediate parquet next to the final output) are still skipped.
4044

45+
**Gotcha — `download_history.py` is for ghost regeneration only**: do **not** re-run `scripts/osm_data/format_tabular.py` or `scripts/models/osm_turnover.py` in the monthly cycle — those are part of the model-fit pipeline, which stays pinned to `versions.model_output`. The monthly history refresh only feeds `build_ghosts.py` (invoked by `make conflate`).
46+
47+
**Gotcha — per-territory 404 tolerance**: if Geofabrik stops publishing a territory's `*-internal.osh.pbf`, the loader logs a warning, skips that extract, and continues. The territory's POIs still flow through downstream stages but the rater falls back to the global-mean δ for its `shared_label`s.
48+
4149
3. **Apply the rating model to OSM** → `osm_snapshot_rated.parquet`:
4250
```bash
4351
python scripts/osm_snapshot/apply_model.py

CHANGELOG.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Changelog
2+
3+
## 2026-05-21-v0
4+
5+
### Snapshot inputs
6+
7+
| Source | Value |
8+
| ---------------------- | ------------------------------------------- |
9+
| OSM snapshot date | 2026-05-21 |
10+
| Overture release | `2026-05-20.0` (pinned) |
11+
| OSM snapshot rows | 8,799,633 |
12+
| Overture snapshot rows | 13,458,763 |
13+
| Boundary footprint | US + all territories (PR, USVI, GU, MP, AS) |
14+
15+
### Conflated output
16+
17+
| Metric | This run | Prior | Δ |
18+
| ---------------------------- | ----------- | ------------ | ----------------------- |
19+
| Total rows | 17,989,377 | 17,788,585 | +200,792 (+1.13%) |
20+
| Matched OSM × Overture | 2,696,484 | 2,677,091 | +19,393 (+0.72%) |
21+
| OSM-only | 6,103,149 | 6,031,413 | +71,736 (+1.19%) |
22+
| Overture-only | 9,189,744 | 9,080,081 | +109,663 (+1.21%) |
23+
| Shadow-matched (CD penalty) | 47,925 | n/a | new — first run with change detection |
24+
| Shared labels | 93 | 93 | unchanged set |
25+
26+
### Methods changes vs. prior release
27+
28+
- **Change detection (new).** Post-conflation pass that reconstructs "ghost" POIs from OSM history (deleted or renamed nodes) and uses them to penalize unmatched Overture POIs that shadow-match a ghost. Penalty multiplies the Overture row's `conf_mean` by the per-`shared_label` δ from the fitted turnover model. 47,925 rows penalized this run. Adds audit columns to every conflated row: `shadow_matched`, `shadow_ghost_id`, `shadow_event_type`, `shadow_event_timestamp`, `shadow_score`, `shadow_distance_m`, `original_conf_mean`. **PR #29**; design in `docs/change-detection.md`.
29+
- **US territory expansion.** Spatial footprint widened from CONUS + PR to include all US territories (PR, USVI, GU, MP, AS). Affects both snapshots and the conflation domain. **PR #31**.
30+
- **Wider metadata propagation.** Additional OSM and Overture metadata fields now flow through to the conflated parquet (website, wikidata, wikipedia, etc.). **PR #30**.
31+
- **PMTiles re-tuned.** Zoom range narrowed to Z10–Z14 with `--drop-densest-as-needed`, so feature drops cascade through low zooms instead of failing tile builds. Site updated with zoom-aware point styling. **PR #33**.
32+
- **Covering bbox in partitioned parquet.** GeoParquet 1.1 `bbox` struct column emitted via `write_covering_bbox=True`, enabling DuckDB row-group pruning on viewport queries. **PR #32**.
33+
- **Overture release pinned.** `download.overture.release_date` set to `2026-05-20.0` (was `null` = auto-detect latest). Future runs against the same pin are deterministic.
34+
- **Pipeline memory hardening (uncommitted on `lifecycle/may-2026-release`).** Both `apply_change_detection.py` and the partitioned-write helper hit the 24 GB WSL cap on nationwide inputs. The CD writer now mutates in place and streams the output parquet in row-group chunks via `pyarrow.parquet.ParquetWriter`; the geohash partition writer drops one full-partition copy (numpy `argsort` + `iloc` instead of pandas `sort_values`) and streams large partitions in chunks. See `src/openpois/conflation/change_detection.py` and `src/openpois/io/geohash_partition.py`.
35+
36+
### Taxonomy changes
37+
38+
**Overture crosswalk** (`src/openpois/conflation/data/taxonomy_crosswalk_overture_maps.csv`, uncommitted on `lifecycle/may-2026-release`): 7 new entries under `services_and_business.family_service`, previously unmapped and dropped from the partitioned output.
39+
40+
| Overture sub-category | Maps to |
41+
| ----------------------------- | ------------------ |
42+
| `funeral_service` | Other Professional |
43+
| `adoption_service` | Other Professional |
44+
| `family_service_center` | Other Professional |
45+
| `nanny_service` | Other Professional |
46+
| `genealogist` | Other Professional |
47+
| `elder_care_planning` | Other Professional |
48+
| `mobility_equipment_service` | Other Shop |
49+
50+
This is the proximate cause of the +22,715 row jump (+8.45%) in **Other Professional**.
51+
52+
No OSM-side taxonomy changes since 2026-04-23.
53+
54+
### Top label-level row-count changes
55+
56+
| Shared label | This run | Prior | Δ rows | Δ % | Δ matched |
57+
| ------------------- | ----------- | ----------- | --------- | ------- | --------- |
58+
| Specialty Store | 1,026,395 | 917,422 | +108,973 | +11.88% | +753 |
59+
| Other Amenity | 3,858,315 | 3,819,068 | +39,247 | +1.03% | +3,124 |
60+
| Clothing Store | 317,177 | 288,506 | +28,671 | +9.94% | +779 |
61+
| Other Professional | 291,500 | 268,785 | +22,715 | +8.45% | 0 |
62+
| Other Healthcare | 995,881 | 1,016,112 | −20,231 | −1.99% | +54 |
63+
| (unlabeled) | 701,209 | 719,862 | −18,653 | −2.59% | +1,506 |
64+
| Car Dealer | 182,314 | 164,517 | +17,797 | +10.82% | +521 |
65+
| Restaurant | 718,472 | 702,020 | +16,452 | +2.34% | +1,092 |
66+
| Supermarket | 193,777 | 179,783 | +13,994 | +7.78% | +361 |
67+
| Recreation | 1,302,776 | 1,293,338 | +9,438 | +0.73% | −510 |
68+
69+
Drivers:
70+
- Most positive movers (Specialty Store, Clothing Store, Car Dealer, Supermarket, Bakery, Charging Station) track Overture's snapshot growth (+2.5% overall) landing in shared labels with moderate base counts.
71+
- **Other Professional** also reflects the new `family_service` crosswalk entries above.
72+
- **Other Healthcare** dropping by ~20k against a larger Overture snapshot is worth a closer look — likely an Overture taxonomy reshuffle inside `health_and_medical` upstream. Flagged for QA, not blocking.
73+
74+
### Version pins
75+
76+
| Key | This run | Prior |
77+
| ------------------------- | ------------------------ | ------------------------ |
78+
| `versions.conflation` | 20260521 | 20260423 |
79+
| `versions.snapshot_osm` | 20260521 | 20260417 |
80+
| `versions.snapshot_overture` | 20260521 | 20260423 |
81+
| `versions.osm_data` | 20260521 | 20260515 |
82+
| `versions.ghost_osm` | 20260521 | 20260515 |
83+
| `versions.model_output` | 20260422_by_shared_label | 20260422_by_shared_label (unchanged — model not refit this cycle) |

scripts/publish/upload_to_source_coop.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,14 @@ def parse_args() -> argparse.Namespace:
7979
"default because these rarely change."
8080
),
8181
)
82+
parser.add_argument(
83+
"--skip-changelog", action = "store_true",
84+
help = (
85+
"Skip uploading CHANGELOG.md to the repo top level. Default "
86+
"is to upload it on every run so the public copy stays in "
87+
"sync with the latest per-release deltas."
88+
),
89+
)
8290
parser.add_argument(
8391
"--skip-latest-mirror", action = "store_true",
8492
help = (
@@ -204,6 +212,25 @@ def main() -> None:
204212
f"deleted {summary['deleted']} stale object(s)."
205213
)
206214

215+
# -------------------------------------------------------------------------
216+
# Top-level CHANGELOG.md (per-release deltas, updated every run by default)
217+
# -------------------------------------------------------------------------
218+
if not args.skip_changelog:
219+
changelog_path = CONFIG_PATH.parent / "CHANGELOG.md"
220+
if changelog_path.exists():
221+
upload_bytes(
222+
client = client,
223+
data = changelog_path.read_bytes(),
224+
bucket = bucket,
225+
key = f"{repo_prefix}/CHANGELOG.md",
226+
content_type = "text/markdown; charset=utf-8",
227+
dry_run = args.dry_run,
228+
)
229+
else:
230+
print(
231+
f"Skipping CHANGELOG.md upload — {changelog_path} not found."
232+
)
233+
207234
# -------------------------------------------------------------------------
208235
# Top-level README + LICENSE (opt-in)
209236
# -------------------------------------------------------------------------

0 commit comments

Comments
 (0)