Skip to content

Commit 5178a7c

Browse files
committed
Publish to Source Cooperative instead of personal S3.
1 parent 6bdc9a5 commit 5178a7c

29 files changed

Lines changed: 1499 additions & 484 deletions

.claude/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Style: Black (format-on-save in VSCode). Lint: flake8 + pylint, configured in `p
2929
| Fit λ from OSM history, rate current snapshots | [skills/model-history-pipeline](skills/model-history-pipeline/SKILL.md) |
3030
| Iterate model variants on a pinned history run | [skills/iterate-model-types](skills/iterate-model-types/SKILL.md) |
3131
| Refresh the POI snapshots (OSM / Overture) | [skills/full-data-pull](skills/full-data-pull/SKILL.md) |
32-
| Conflate OSM + Overture, partition, upload to S3 | [skills/conflate-snapshots](skills/conflate-snapshots/SKILL.md) |
32+
| Conflate OSM + Overture, partition, publish to Source Cooperative | [skills/conflate-snapshots](skills/conflate-snapshots/SKILL.md) |
3333
| Bump the frontend to the new data version | [skills/update-site](skills/update-site/SKILL.md) |
3434
| Post-run QA on any of the above | [skills/verify-pipeline-run](skills/verify-pipeline-run/SKILL.md) |
3535

.claude/TODO.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ Short running list of in-progress / upcoming work. Edit freely; trim older compl
66

77
## Upcoming
88

9+
- [ ] **Auto-capture the three per-version README fields** so the publish step doesn't need `publish.version_metadata` overrides. Added 2026-04-24. Today `build_version_readme` in [src/openpois/publish/build_readme.py](../src/openpois/publish/build_readme.py) falls back to config overrides or best-effort guesses; aim is for the pipeline to write authoritative values alongside the data it produces, and the publish step to just read them.
10+
- *OSM snapshot date*`scripts/osm_snapshot/download.py` should write a `~/data/openpois/snapshots/osm/<version>/download_metadata.json` containing `{"downloaded_at": "<ISO date>", "pbf_url": "..."}` after the PBF download completes. `_resolve_osm_snapshot_date` then reads that file before falling back to the version string.
11+
- *Overture release*`scripts/overture/download.py` already resolves a concrete release (pinned or auto-detected) inside `download_overture_snapshot`; currently only the `.parts/<release>/` directory records it and `.parts/` is deleted on success. Surface the resolved release by writing `~/data/openpois/snapshots/overture/<version>/download_metadata.json` with `{"release": "2026-04-15.0", ...}` before the cleanup step. `_resolve_overture_release` reads that file ahead of the `.parts/` heuristic.
12+
- *Turnover-model commit*`scripts/models/osm_turnover.py` should capture `git rev-parse HEAD` at training time and either (a) extend `config.write_self("model_output")` to include a `git_commit` entry or (b) drop a `git_commit.txt` next to the model artifacts. `_resolve_model_commit` reads that value instead of the publish-time HEAD, which is the right fingerprint if code has changed between training and publishing.
13+
- Publishing behaviour: if any of the three files is missing, keep the current fallback (and print a visible warning) so old pipeline runs still publish cleanly.
914
- [ ] Watch for a DuckDB release that fixes the WSL2 httpfs "Information loss on integer cast" crash (issue #21669, fix PR #21395). Once a tagged release ships with the fix and a full `scripts/overture/download.py` run on WSL2 completes, we can unpin from `duckdb==1.4.1` and revert the per-part download to a single-query DuckDB scan. Added 2026-04-17.
1015
- [ ] Auto-check taxonomy changes whenever we switch to a new Overture Maps version (detect new/removed L0/L1/L2 categories vs. `taxonomy_crosswalk_overture_maps.csv` and flag gaps). Added 2026-04-16.
1116
- [ ] Watch for Overture L0/L1 → flat `basic_category` migration (~June 2026). Crosswalk CSV + `assign_overture_shared_label` will need updating. See [docs/taxonomy-setup.md](docs/taxonomy-setup.md).

.claude/docs/data-versioning.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@ versions:
1010
model_output: "20260416_by_leisure" # fitted model artifacts (suffix indicates variant)
1111
snapshot_osm: "20260416" # OSM current-state snapshot
1212
snapshot_overture: "20260417" # Overture snapshot
13-
aws: "20260416" # S3 prefix for uploaded data
1413
conflation: "20260417" # conflated output
14+
source_coop: "2026-04-17-v0" # Source Cooperative upload folder (see below)
1515
```
1616
17-
Each key corresponds to a `directories.<key>` entry in `config.yaml` with `versioned: true`.
17+
Each key corresponds to a `directories.<key>` entry in `config.yaml` with `versioned: true`, except `source_coop`, which only names the remote folder.
1818

1919
## Path resolution
2020

@@ -36,8 +36,9 @@ config.get_file_path("osm_data", "osm_versions")
3636

3737
## Naming conventions
3838

39-
- **Dates**: `YYYYMMDD`, e.g., `20260416`.
39+
- **Local dates**: `YYYYMMDD`, e.g., `20260416`.
4040
- **Model variants**: `{date}_by_{group_key}` (e.g., `20260416_by_leisure`, `20260416_by_amenity`) or `{date}_constant`. See [skills/iterate-model-types](../skills/iterate-model-types/SKILL.md).
41+
- **Source Coop folder**: `YYYY-MM-DD-v<IDX>`. Default `v0` for every fresh publish; only bump `v1`, `v2`, … if republishing under the same calendar date (e.g. a hot-fix). The Source Coop upload script writes the per-version README into this folder, so the suffix must be unique per upload round.
4142
- **Independent cadences**: snapshot versions can (and should) differ across sources — Overture releases ~monthly. Don't force them to match.
4243

4344
## External references (hand-update when bumping)
@@ -46,16 +47,15 @@ Version strings appear in these places outside `versions:` — grep before any c
4647

4748
| File | References |
4849
|---|---|
49-
| [config.yaml](../../config.yaml) | `upload.latest_url_osm`, `upload.latest_url_conflation` (full URL with date) |
50-
| [site/src/constants.js](../../site/src/constants.js) | `OSM_S3_BASE`, `CONFLATED_S3_BASE` |
51-
| [site/public/about.html](../../site/public/about.html) | Hardcoded S3 browse links in the data-access section |
50+
| [site/src/constants.js](../../site/src/constants.js) | `OSM_PMTILES_URL`, `CONFLATED_PMTILES_URL` (full `data.source.coop` URLs) |
51+
| [site/public/about.html](../../site/public/about.html) | Hardcoded Source Coop browse links in the data-access section |
5252
| `osm_data.apply_model.model_stub` (config.yaml) | Which model family [scripts/osm_snapshot/apply_model.py](../../scripts/osm_snapshot/apply_model.py) ingests |
5353

54-
[skills/update-site](../skills/update-site/SKILL.md) covers the frontend side; [skills/conflate-snapshots](../skills/conflate-snapshots/SKILL.md) covers the upload + config side.
54+
[skills/update-site](../skills/update-site/SKILL.md) covers the frontend side; [skills/conflate-snapshots](../skills/conflate-snapshots/SKILL.md) covers the publish + config side.
5555

5656
## Workflow
5757

58-
1. Bump the relevant `versions.*` keys before running a pipeline.
58+
1. Bump the relevant `versions.*` keys before running a pipeline. For a public release, also bump `versions.source_coop` to the new `YYYY-MM-DD-v0`.
5959
2. Run the pipeline — outputs land in the versioned directory.
60-
3. After upload, update `upload.latest_url_*` and the frontend references.
61-
4. Old versions stay on disk / S3 — delete manually when confident nothing references them.
60+
3. After publishing, update the frontend references in `site/src/constants.js` and `site/public/about.html`.
61+
4. Old local versions stay on disk — delete manually when confident nothing references them. Old Source Coop folders stay published indefinitely and serve as an immutable archive.

.claude/docs/partitioning-strategy.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ GROUP BY 1, 2;
9696

9797
## When NOT to use this layout
9898

99-
The geohash-partitioned layout is a better fit for **small-bbox, many-types-at-once** queries — which is exactly the web-map viewport case we moved away from. If the S3 / map-viewport path comes back, the helpers are still in place: see `add_geohash_columns` and `write_partitioned_dataset` in [src/openpois/io/geohash_partition.py](../../src/openpois/io/geohash_partition.py), and the original S3 upload step in [scripts/conflation/upload_to_s3.py](../../scripts/conflation/upload_to_s3.py). Swap the function calls in the two `format_for_upload.py` scripts back to the geohash variants.
99+
The geohash-partitioned layout is a better fit for **small-bbox, many-types-at-once** queries — which is exactly the web-map viewport case we moved away from. If the map-viewport path comes back, the helpers are still in place: see `add_geohash_columns` and `write_partitioned_dataset` in [src/openpois/io/geohash_partition.py](../../src/openpois/io/geohash_partition.py), and the Source Cooperative publish step in [scripts/publish/upload_to_source_coop.py](../../scripts/publish/upload_to_source_coop.py). Swap the function calls in the two `format_for_upload.py` scripts back to the geohash variants.
100100

101101
## Maintenance
102102

@@ -107,7 +107,7 @@ python -u scripts/osm_snapshot/format_for_upload.py 2>&1 | tee ~/data/openpois
107107
python -u scripts/conflation/format_for_upload.py 2>&1 | tee ~/data/openpois/logs/conflated_repartition_<version>.log
108108
```
109109

110-
Each script deletes the existing partitioned directory at its versioned path and rewrites it. Geohash precision is controlled by `upload.geohash_precision_sort` in [config.yaml](../../config.yaml) (currently 6 ≈ 0.6 × 1.2 km).
110+
Each script deletes the existing partitioned directory at its versioned path and rewrites it. Geohash precision is controlled by `publish.geohash_precision_sort` in [config.yaml](../../config.yaml) (currently 6 ≈ 0.6 × 1.2 km).
111111

112112
**Where the code lives:**
113113

@@ -116,4 +116,4 @@ Each script deletes the existing partitioned directory at its versioned path and
116116
- [scripts/osm_snapshot/format_for_upload.py](../../scripts/osm_snapshot/format_for_upload.py) — OSM partitioning entry point.
117117
- [tests/test_geohash_partition.py](../../tests/test_geohash_partition.py) — unit tests + a DuckDB Hive-decode round-trip.
118118

119-
**S3 upload is currently disabled**`scripts/conflation/upload_to_s3.py` is not run as part of this flow. The `upload.latest_url_*` / `upload.s3_*` keys in `config.yaml` are stale but harmless; clean them up in a later pass if the frontend integration is formally retired.
119+
The Source Cooperative publish flow ([scripts/publish/upload_to_source_coop.py](../../scripts/publish/upload_to_source_coop.py)) uploads these same partitioned trees to `<version>/osm-parquet/` and `<version>/conflated-parquet/`. PMTiles generation remains downstream of partitioning.

.claude/skills/conflate-snapshots/SKILL.md

Lines changed: 37 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,33 @@
11
---
22
name: conflate-snapshots
3-
description: Use when the user wants to match rated OSM POIs with Overture POIs into a unified dataset, partition it for web consumption, and push to S3. Triggers: "run conflation", "push new conflated data to S3", "bump conflation version", "reconflate with new parameters", "re-upload the partitioned parquet".
3+
description: Use when the user wants to match rated OSM POIs with Overture POIs into a unified dataset, partition it for web consumption, and push to Source Cooperative. Triggers: "run conflation", "publish new data", "push new conflated data to Source Cooperative", "bump conflation version", "reconflate with new parameters", "re-upload the partitioned parquet".
44
---
55

6-
# Conflate snapshots + publish to S3
6+
# Conflate snapshots + publish to Source Cooperative
77

8-
Taxonomy-aware matching between rated OSM and Overture, then partition and upload for web consumption.
8+
Taxonomy-aware matching between rated OSM and Overture, then partition and
9+
upload for web consumption.
910

1011
## Prerequisites
1112

1213
- Rated OSM snapshot (`osm_snapshot_rated.parquet`) at `versions.snapshot_osm` — produced by [skills/full-data-pull](../full-data-pull/SKILL.md) step 3.
1314
- Overture snapshot (`overture_snapshot.parquet`) at `versions.snapshot_overture`.
14-
- AWS credentials configured for the `openpois-public` bucket (region `us-west-2`).
15+
- **Fresh Source Cooperative temp credentials** in `.env.json` at the repo root. Tokens expire in ~1 hour.
16+
17+
> ⚠️ **Credential refresh check.** Source Cooperative uses short-lived AWS
18+
> credentials (`aws_access_key_id` starting with `ASIA…`). **Before** running
19+
> step 7, ask the user to regenerate them at
20+
> <https://source.coop/repositories/henryspatialanalysis/openpois/manage>
21+
> and overwrite `~/repos/openpois/.env.json`. The upload script will warn if
22+
> the file looks stale, but it cannot tell whether the token itself has
23+
> expired until it actually fails.
1524
1625
## Steps
1726

18-
1. **Bump `versions.conflation` and `versions.aws`** in `config.yaml`. These typically track together since the upload uses the conflation output.
27+
1. **Bump `versions.conflation` and `versions.source_coop`** in `config.yaml`.
28+
`versions.source_coop` is the remote folder name — `YYYY-MM-DD-vN`. Keep
29+
`vN` at `v0`; only bump `v1`, `v2`, … if you re-upload under the same
30+
calendar date.
1931

2032
2. **Review conflation parameters** (`config.yaml``conflation`):
2133
- `min_match_score` (default 0.50) — raises/lowers match acceptance
@@ -52,33 +64,34 @@ Taxonomy-aware matching between rated OSM and Overture, then partition and uploa
5264
python -u scripts/conflation/prepare_pmtiles.py \
5365
2>&1 | tee ~/data/openpois/logs/pmtiles_conflated_<version>.log
5466
```
55-
Properties and zoom range are configured under `upload.pmtiles` in
67+
Properties and zoom range are configured under `publish.pmtiles` in
5668
`config.yaml`.
5769

58-
7. **Upload to S3** — pushes partitioned parquet AND the matching `.pmtiles`
59-
(single file at `…/<version>/<name>.pmtiles`) under `versions.aws`.
60-
```bash
61-
python scripts/osm_snapshot/upload_to_s3.py # OSM parquet + pmtiles
62-
python scripts/conflation/upload_to_s3.py # conflated parquet + pmtiles
63-
```
64-
To upload only the PMTiles (e.g., after regenerating tiles without touching
65-
the parquet), use:
70+
7. **Publish to Source Cooperative** — uploads OSM + conflated parquet,
71+
both PMTiles, and a freshly-rendered per-version `README.md` under
72+
`<repo>/<versions.source_coop>/`. Confirm the credential check above first.
6673
```bash
67-
python scripts/osm_snapshot/upload_pmtiles_to_s3.py [--s3-version YYYYMMDD]
68-
python scripts/conflation/upload_pmtiles_to_s3.py [--s3-version YYYYMMDD]
69-
```
74+
# Preview everything that would be uploaded:
75+
python scripts/publish/upload_to_source_coop.py --dry-run
76+
77+
# Real upload (datasets + version README):
78+
python -u scripts/publish/upload_to_source_coop.py \
79+
2>&1 | tee ~/data/openpois/logs/publish_<version>.log
7080

71-
8. **Update latest-URL pointers** in `config.yaml`:
72-
```yaml
73-
upload:
74-
latest_url_osm: "https://openpois-public.s3.us-west-2.amazonaws.com/snapshots/osm/YYYYMMDD/osm_snapshot_partitioned/"
75-
latest_url_conflation: "https://openpois-public.s3.us-west-2.amazonaws.com/snapshots/conflated/YYYYMMDD/conflated_partitioned/"
81+
# If the top-level README or LICENSE changed:
82+
python scripts/publish/upload_to_source_coop.py --update-top-level
7683
```
84+
`--skip-osm-parquet`, `--skip-conflated-parquet`, and `--skip-pmtiles`
85+
allow partial reuploads (e.g. after regenerating PMTiles alone).
7786

7887
## Verification
7988

8089
- `summary_by_label.csv` match rates should resemble the prior run; large drifts mean a parameter or crosswalk regression.
8190
- `match_diagnostics.parquet` for per-pair forensics on surprising matches.
91+
- Spot-check the version landing page at
92+
<https://source.coop/henryspatialanalysis/openpois/> and confirm the
93+
per-version `README.md` renders with the expected OSM date, Overture
94+
release, and row counts.
8295
- See [skills/verify-pipeline-run](../verify-pipeline-run/SKILL.md).
8396

8497
## Next
@@ -90,4 +103,6 @@ Taxonomy-aware matching between rated OSM and Overture, then partition and uploa
90103
- Matching: [src/openpois/conflation/match.py](../../../src/openpois/conflation/match.py)
91104
- Merging: [src/openpois/conflation/merge.py](../../../src/openpois/conflation/merge.py)
92105
- Taxonomy assignment: [src/openpois/conflation/taxonomy.py](../../../src/openpois/conflation/taxonomy.py)
106+
- Publish orchestration: [scripts/publish/upload_to_source_coop.py](../../../scripts/publish/upload_to_source_coop.py)
107+
- Source Coop S3 adapter: [src/openpois/io/source_coop.py](../../../src/openpois/io/source_coop.py)
93108
- Conflation algorithm docs: [scripts/conflation/README.md](../../../scripts/conflation/README.md)

.claude/skills/full-data-pull/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: full-data-pull
3-
description: Use when the user wants to refresh the independent POI snapshots (OSM, Overture) and rate the OSM snapshot for conflation. Triggers: "refresh all snapshots", "do a new data pull", "download new OSM/Overture", "monthly data refresh", "pull the latest POI data". Does NOT include conflation or S3 upload — those live in conflate-snapshots.
3+
description: Use when the user wants to refresh the independent POI snapshots (OSM, Overture) and rate the OSM snapshot for conflation. Triggers: "refresh all snapshots", "do a new data pull", "download new OSM/Overture", "monthly data refresh", "pull the latest POI data". Does NOT include conflation or Source Cooperative publishing — those live in conflate-snapshots.
44
---
55

66
# Full data pull

0 commit comments

Comments
 (0)