Skip to content

Commit 746ed86

Browse files
committed
Update machine-readable documentation.
1 parent 9cee798 commit 746ed86

11 files changed

Lines changed: 621 additions & 58 deletions

File tree

.claude/CLAUDE.md

Lines changed: 35 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,59 @@
11
# CLAUDE.md
22

3-
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
3+
Guidance for Claude Code working in this repository. Deep-dives live in [skills/](skills/) and [docs/](docs/); this file is orientation.
44

5-
## Environment Setup
5+
## Environment setup
66

77
```bash
8-
make build_env # Create conda environment from environment.yml
9-
make install_package # Install openpois in editable mode (pip install -e .)
8+
make build_env # Create conda env from environment.yml (name: openpois, Python 3.10+)
9+
make install_package # pip install -e . (editable install)
1010
```
1111

12-
The conda environment is named `openpois` and requires Python 3.10+.
12+
Python executable: `/home/nathenry/miniforge3/envs/openpois/bin/python`.
1313

14-
## Common Commands
14+
## Common commands
1515

1616
```bash
1717
pytest # Run tests
18-
make export_env # Export conda environment to environment.yml after adding dependencies
18+
make export_env # Export conda env to environment.yml after adding deps
1919
```
2020

21-
Code style is enforced by Black (format on save in VSCode). Linting via flake8 and pylint, both configured in `pyproject.toml`.
21+
Style: Black (format-on-save in VSCode). Lint: flake8 + pylint, configured in `pyproject.toml`.
2222

23-
## Architecture
23+
## Architecture at a glance
2424

25-
**openpois** models POI (Point of Interest) stability over time using historical OpenStreetMap data. The workflow is:
25+
**openpois** models POI stability over time from OpenStreetMap history, and produces unified OSM + Overture + Foursquare snapshots for web consumption. Work splits into four pipelines:
2626

27-
1. **Download OSM history** — two options depending on scope:
28-
- **US + Puerto Rico (default, `src/openpois/io/osm_history_pbf.py`)** — downloads Geofabrik full-history PBFs (`us-internal.osh.pbf` + `puerto-rico-internal.osh.pbf`), runs `osmium tags-filter --omit-referenced` then `osmium time-filter`, and streams the result through pyosmium into `osm_versions.parquet` + `osm_changes.parquet`. Requires an OSM-account OAuth cookie jar for Geofabrik's internal server. Entry point: `scripts/osm_data/download_history.py`.
29-
- **City-scale fallback (`src/openpois/io/osm_history.py`)** — queries the Overpass API for element IDs in a bounding box, then fetches per-element histories from the OSM API. Seattle-scoped by default; Overpass cannot serve US-wide histories. Entry point: `scripts/osm_data/download.py`.
30-
2. **Format observations** (`src/openpois/osm/format_observations.py`) — converts raw OSM version histories into observation records (one row per version) with flags for tag changes and deletions
31-
3. **Model change rates** (`src/openpois/models/`) — fits an empirical Bayes model using PyTorch to estimate per-group POI change rates (λ) as a Poisson process
32-
4. **Visualize stability** (`src/openpois/osm/change_plots.py`) — plots how long POI tags remain unchanged
27+
| Pipeline | Skill |
28+
|---|---|
29+
| Fit λ from OSM history, rate current snapshots | [skills/model-history-pipeline](skills/model-history-pipeline/SKILL.md) |
30+
| Iterate model variants on a pinned history run | [skills/iterate-model-types](skills/iterate-model-types/SKILL.md) |
31+
| Refresh the three POI snapshots (OSM / Overture / FSQ) | [skills/full-data-pull](skills/full-data-pull/SKILL.md) |
32+
| Conflate OSM + Overture, partition, upload to S3 | [skills/conflate-snapshots](skills/conflate-snapshots/SKILL.md) |
33+
| Bump the frontend to the new data version | [skills/update-site](skills/update-site/SKILL.md) |
34+
| Post-run QA on any of the above | [skills/verify-pipeline-run](skills/verify-pipeline-run/SKILL.md) |
3335

34-
The **scripts/** directory contains end-to-end pipelines that call library functions using settings from `config.yaml`. They are not part of the installed package and serve as reference implementations.
36+
## Where things live
3537

36-
### Key classes and files
38+
| Path | Purpose |
39+
|---|---|
40+
| [src/openpois/io/](../src/openpois/io/) | I/O adapters: OSM history/snapshot, Overture, Foursquare, Census boundary |
41+
| [src/openpois/osm/](../src/openpois/osm/) | OSM-specific transforms: `format_observations`, `change_plots` |
42+
| [src/openpois/models/](../src/openpois/models/) | PyTorch empirical Bayes: `EventRate`, `ModelFitter`, model registry |
43+
| [src/openpois/conflation/](../src/openpois/conflation/) | OSM×Overture matching: `taxonomy`, `match`, `merge` |
44+
| [scripts/](../scripts/) | End-to-end pipelines using config.yaml — not installed, reference only |
45+
| [site/](../site/) | Vue 3 + Vite frontend |
3746

38-
- `EventRate` (`models/event_rate.py`) — wraps a constant or time-varying λ; computes change probabilities via integration
39-
- `ModelFitter` (`models/model_fitter.py`) — fits λ using PyTorch L-BFGS optimizer with optional priors; supports parameter draws for uncertainty
40-
- `pytorch_setup()` / `prepare_data_for_model()` (`models/setup.py`) — initializes torch (GPU/CPU) and prepares filtered, grouped observation data
41-
- `download_osm_history()` (`io/osm_history_pbf.py`) — US+PR history pipeline entry: Geofabrik full-history PBFs → osmium tags-filter (`--omit-referenced`) → osmium time-filter → pyosmium stream → `osm_versions.parquet` + `osm_changes.parquet`. Requires `download.osm.history_cookie_file` to point at a Netscape-format cookie jar with valid Geofabrik OAuth cookies.
42-
- `download_element_histories()` (`io/osm_history.py`) — legacy city-scale entry point (Overpass, `download.osm.history_bbox` config key, Seattle-scoped; Overpass cannot serve US-wide histories)
47+
## Reference docs
4348

44-
### Configuration
49+
- [docs/data-sources.md](docs/data-sources.md) — URLs, auth, schema quirks for every source
50+
- [docs/taxonomy-setup.md](docs/taxonomy-setup.md) — crosswalk CSVs, build_taxonomy.py, frontend sync
51+
- [docs/data-versioning.md](docs/data-versioning.md)`versions:` block, path resolution, external references
4552

46-
`config.yaml` holds all shared settings (spatial boundary, date ranges, OSM tag keys, model hyperparameters, output directory paths with versioning). The `config_versioned` package (external dependency) reads this file. Scripts load config at startup; library functions accept parameters directly.
53+
## Running to-do
4754

48-
- `.get()` raises `ValueError` for null config values — pass `fail_if_none=False` for optional fields like `release_date: null`
55+
[TODO.md](TODO.md) — curated running list. Not auto-synced to git status.
4956

50-
## POI Snapshot Downloads
57+
## Config gotcha worth surfacing
5158

52-
Three separate utilities download current snapshots covering the 50 US states + DC + Puerto Rico (separate from the historical OSM workflow):
53-
54-
### Spatial boundary (`src/openpois/io/boundary.py`)
55-
- Single source of truth for the US+PR extent used by all three snapshot downloaders
56-
- Downloads the Census 1:20M cartographic state shapefile (`cb_2023_us_state_20m`) on first use; cached under `directories.boundary`
57-
- `get_us_pr_boundary()` returns `(boundary_gdf, coarse_bboxes)` — a single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown
58-
- Buffering is done in `EPSG:6933` (World Equal-Area Cylindrical) so the `coastline_buffer_m` (default 100 m) is accurate across CONUS / AK / HI / PR. Because `.dissolve()` removes internal state borders, the uniform outward buffer effectively only expands coastline; land-border expansion into CA/MX is negligible.
59-
- `coarse_bboxes` splits the Aleutians at the antimeridian into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes)
60-
61-
### OSM (`src/openpois/io/osm_snapshot.py`)
62-
- `download_pbf` / `filter_pbf` / `parse_pbf_to_geodataframe` / `download_osm_snapshot`
63-
- Two Geofabrik extracts: `us-latest.osm.pbf` (~11 GB, 50 states incl. AK+HI) + `puerto-rico-latest.osm.pbf` (PR is NOT in the US extract) → osmium tags-filter → pyosmium parse → concat → GeoParquet
64-
- Geofabrik extracts are pre-cut to admin boundaries, so no polygon post-filter is needed
65-
- `osmium` is in the conda env bin but NOT on shell PATH; code resolves it via `Path(sys.executable).parent / "osmium"`
66-
- Run: `python scripts/osm_snapshot/download.py`
67-
68-
### Overture Maps (`src/openpois/io/overture.py`)
69-
- DuckDB + httpfs + spatial extensions; queries public S3 directly, no auth
70-
- **Two-stage spatial filter:** DuckDB `WHERE` clause ORs one disjunct per coarse bbox (predicate pushdown on Overture's `bbox` struct column), then a GeoPandas `sjoin(predicate='within')` post-filter against the exact US+PR polygon
71-
- `taxonomy` field is a named STRUCT: use `taxonomy.hierarchy[1]` (not `taxonomy[1]`)
72-
- `brand` is a singular struct (not array); geometry is native DuckDB GEOMETRY type requiring `LOAD spatial` and `ST_X()/ST_Y()`
73-
- L0 category names (Feb 2026+): `food_and_drink`, `shopping`, `arts_and_entertainment`, `sports_and_recreation`, `health_care`
74-
- Run: `python scripts/overture/download.py`
75-
76-
### Foursquare OS Places (`src/openpois/io/foursquare.py`)
77-
- PyIceberg `RestCatalog`; requires `warehouse="places"` parameter
78-
- Catalog: `uri=https://catalog.h3-hub.foursquare.com/iceberg`, namespace=`datasets`, tables=`places_os` / `categories_os`
79-
- Table is **unpartitioned** (no `dt` column); release date inferred from `last_updated_at` in partition metadata
80-
- Row filter: `country IN ('US', 'PR') AND date_closed IS NULL` — Foursquare uses ISO alpha-2 codes, so PR must be listed explicitly; PyIceberg has no spatial predicate support, so an exact `sjoin(predicate='within')` post-filter runs after the rows are loaded
81-
- `fsq_category_ids` arrives as numpy/pyarrow array — use `len(x) == 0` not `if not x:`
82-
- Token in `FSQ_PORTAL_TOKEN` env var; run: `python scripts/foursquare/download.py`
59+
`config_versioned.Config.get()` raises `ValueError` on null values. For optional fields (e.g., `release_date: null`), pass `fail_if_none=False`. Prefer `config.get_file_path(section, file_key)` over composing `get_dir_path()` + `get()` manually.

.claude/TODO.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# openpois — running to-do
2+
3+
Short running list of in-progress / upcoming work. Edit freely; trim older completed items when the list gets long. Date items `YYYY-MM-DD` when added.
4+
5+
## In progress
6+
7+
_(no items — add some when work is underway)_
8+
9+
## Upcoming
10+
11+
- [ ] Watch for Overture L0/L1 → flat `basic_category` migration (~June 2026). Crosswalk CSV + `assign_overture_shared_label` will need updating. See [docs/taxonomy-setup.md](docs/taxonomy-setup.md).
12+
13+
## Recently done
14+
15+
_(trim after a few weeks)_
16+
17+
---
18+
19+
**Agent note:** When uncommitted changes are present in the repo, do not assume they belong in "In progress" here — confirm with the user first. This file is curated, not auto-synced to git status.

.claude/docs/data-sources.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Data sources
2+
3+
Reference for every external data source openpois ingests. For the workflow that orchestrates these, see the skills under [.claude/skills/](../skills/).
4+
5+
## OSM history (Geofabrik full-history PBFs)
6+
7+
**Used by**: the historical modeling pipeline ([skills/model-history-pipeline](../skills/model-history-pipeline/SKILL.md)).
8+
9+
- **URLs**:
10+
- `download.osm.history_pbf_url``https://osm-internal.download.geofabrik.de/north-america/us-internal.osh.pbf`
11+
- `download.osm.pr_history_pbf_url``.../us/puerto-rico-internal.osh.pbf`
12+
- **Auth**: OAuth — any OSM account works. Produce a Netscape-format cookie jar (browser export or Geofabrik's `oauth_cookie_client.py`). Path: `download.osm.history_cookie_file` (default `~/data/openpois/.creds/geofabrik_cookies.txt`).
13+
- **Pipeline**: `osmium tags-filter --omit-referenced``osmium time-filter` → pyosmium streams to `osm_versions.parquet` + `osm_changes.parquet`.
14+
- **Entry**: [src/openpois/io/osm_history_pbf.py](../../src/openpois/io/osm_history_pbf.py) (`download_osm_history`).
15+
- **Config**: `download.osm.start_date`, `end_date`, `date_interval_days`, `filter_keys`, `extract_keys`.
16+
17+
## OSM snapshot (Geofabrik standard PBFs)
18+
19+
**Used by**: current-state snapshot (`osm_snapshot.parquet`).
20+
21+
- **URLs**:
22+
- US: `https://download.geofabrik.de/north-america/us-latest.osm.pbf` (~11 GB, 50 states incl. AK+HI)
23+
- PR: `https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf`**PR is not in the US extract**
24+
- **Auth**: none (public).
25+
- **Pipeline**: `osmium tags-filter` → pyosmium parse → concat US+PR → GeoParquet.
26+
- **Entry**: [src/openpois/io/osm_snapshot.py](../../src/openpois/io/osm_snapshot.py).
27+
- **Quirks**:
28+
- `osmium` is in the conda env's `bin/` but **not** on shell PATH. Code resolves via `Path(sys.executable).parent / "osmium"`.
29+
- Geofabrik extracts are pre-cut to admin boundaries → no polygon post-filter needed.
30+
31+
## Overture Maps
32+
33+
**Used by**: current-state Overture snapshot (`overture_snapshot.parquet`).
34+
35+
- **URL**: public S3 at `s3://overturemaps-us-west-2/`.
36+
- **Auth**: none (DuckDB + httpfs queries directly).
37+
- **Pipeline**: two-stage spatial filter — DuckDB `WHERE` clause ORs one disjunct per coarse bbox (predicate pushdown on Overture's `bbox` struct), then GeoPandas `sjoin(predicate='within')` against the exact US+PR polygon.
38+
- **Entry**: [src/openpois/io/overture.py](../../src/openpois/io/overture.py).
39+
- **Schema quirks (as of Feb 2026 schema)**:
40+
- `taxonomy` is a named STRUCT `{primary, hierarchy[], alternates[]}` — use `taxonomy.hierarchy[1]` **not** `taxonomy[1]`.
41+
- `brand` is a singular struct, **not** a `brands[]` array.
42+
- L0 category names: `food_and_drink`, `shopping`, `arts_and_entertainment`, `sports_and_recreation`, `health_care`.
43+
- Geometry is native DuckDB GEOMETRY — must `LOAD spatial;` and use `ST_X()` / `ST_Y()`.
44+
- **Upcoming migration (~June 2026)**: L0/L1 hierarchy → flat `basic_category`. Crosswalk CSV + `assign_overture_shared_label` will need updating.
45+
46+
## Foursquare OS Places
47+
48+
**Used by**: current-state Foursquare snapshot (`foursquare_snapshot.parquet`).
49+
50+
- **Catalog**: `https://catalog.h3-hub.foursquare.com/iceberg` — PyIceberg `RestCatalog`.
51+
- **Auth**: `FSQ_PORTAL_TOKEN` env var.
52+
- **Params** (all in config.yaml):
53+
- `warehouse="places"` (required)
54+
- `namespace="datasets"`
55+
- `places_table="places_os"`, `categories_table="categories_os"`
56+
- **Pipeline**: row filter `country IN ('US', 'PR') AND date_closed IS NULL` → PyIceberg scan → sjoin against exact US+PR polygon (PyIceberg has no spatial predicates).
57+
- **Entry**: [src/openpois/io/foursquare.py](../../src/openpois/io/foursquare.py).
58+
- **Quirks**:
59+
- Table is **unpartitioned** (no `dt` column); release date inferred from `last_updated_at` in partition metadata.
60+
- `fsq_category_ids` arrives as numpy/pyarrow array — use `len(x) == 0`, **not** `if not x:`.
61+
- PR uses alpha-2 `'PR'`, not `'US'` — silent drop regression pre-2026-04-16 when filter was `'US'`-only.
62+
63+
## Census boundary
64+
65+
**Used by**: all three snapshot downloaders (spatial clipping).
66+
67+
- **URL**: `download.general.boundary.source_url``https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_20m.zip` (1:20M cartographic, 50 states + DC + PR).
68+
- **Auth**: none.
69+
- **Pipeline**: download ZIP → cache under `directories.boundary` (first-use) → dissolve → buffer outward by `coastline_buffer_m` (default 100 m) in EPSG:6933 (equal-area, so buffer accurate across CONUS/AK/HI/PR).
70+
- **Entry**: [src/openpois/io/boundary.py](../../src/openpois/io/boundary.py) (`get_us_pr_boundary`).
71+
- **Returns**: `(boundary_gdf, coarse_bboxes)` — single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown.
72+
- **Antimeridian**: Aleutians split into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes).
73+
74+
## Legacy: Overpass-based OSM history
75+
76+
Still wired up but superseded by the PBF pipeline. Queries Overpass API for element IDs in a bbox, then fetches per-element histories from the OSM API.
77+
78+
- **Config**: `download.osm.history_bbox` (Seattle-scoped; Overpass can't serve US-wide histories).
79+
- **Entry**: [src/openpois/io/osm_history.py](../../src/openpois/io/osm_history.py) (`download_element_histories`).
80+
- **Script**: `scripts/osm_data/download.py`.
81+
- **When to use**: city-scale testing, or if Geofabrik OAuth is unavailable.

.claude/docs/data-versioning.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Data versioning
2+
3+
Every pipeline output is versioned via a single `versions:` block in [config.yaml](../../config.yaml). The external `config_versioned` package resolves these into filesystem paths.
4+
5+
## Source of truth
6+
7+
```yaml
8+
versions:
9+
osm_data: "20260416" # historical PBF pipeline outputs
10+
model_output: "20260416_by_leisure" # fitted model artifacts (suffix indicates variant)
11+
snapshot_osm: "20260416" # OSM current-state snapshot
12+
snapshot_overture: "20260417" # Overture snapshot
13+
snapshot_foursquare: "20260416" # Foursquare snapshot
14+
aws: "20260416" # S3 prefix for uploaded data
15+
conflation: "20260417" # conflated output
16+
```
17+
18+
Each key corresponds to a `directories.<key>` entry in `config.yaml` with `versioned: true`.
19+
20+
## Path resolution
21+
22+
External `config_versioned.Config` API:
23+
24+
```python
25+
config.get_dir_path("osm_data")
26+
# → ~/data/openpois/osm_data/20260416/
27+
28+
config.get_file_path("osm_data", "osm_versions")
29+
# → ~/data/openpois/osm_data/20260416/osm_versions.parquet
30+
```
31+
32+
**Prefer `get_file_path` over composing `get_dir_path()` + `get()` manually.**
33+
34+
`.get()` raises `ValueError` on null values — pass `fail_if_none=False` for optional fields like `download.overture.release_date: null` and `download.foursquare.release_date: null`.
35+
36+
`config.write_self(section)` snapshots the effective config into the output directory — used by model and conflation scripts to record the state of a run.
37+
38+
## Naming conventions
39+
40+
- **Dates**: `YYYYMMDD`, e.g., `20260416`.
41+
- **Model variants**: `{date}_by_{group_key}` (e.g., `20260416_by_leisure`, `20260416_by_amenity`) or `{date}_constant`. See [skills/iterate-model-types](../skills/iterate-model-types/SKILL.md).
42+
- **Independent cadences**: snapshot versions can (and should) differ across sources — Overture releases ~monthly, Foursquare separately. Don't force them to match.
43+
44+
## External references (hand-update when bumping)
45+
46+
Version strings appear in these places outside `versions:` — grep before any cross-source version change:
47+
48+
| File | References |
49+
|---|---|
50+
| [config.yaml](../../config.yaml) | `upload.latest_url_osm`, `upload.latest_url_conflation` (full URL with date) |
51+
| [site/src/constants.js](../../site/src/constants.js) | `OSM_S3_BASE`, `FSQ_S3_BASE`, `CONFLATED_S3_BASE` |
52+
| [site/public/about.html](../../site/public/about.html) | Hardcoded S3 browse links in the data-access section |
53+
| `osm_data.apply_model.model_stub` (config.yaml) | Which model family [scripts/osm_snapshot/apply_model.py](../../scripts/osm_snapshot/apply_model.py) ingests |
54+
55+
[skills/update-site](../skills/update-site/SKILL.md) covers the frontend side; [skills/conflate-snapshots](../skills/conflate-snapshots/SKILL.md) covers the upload + config side.
56+
57+
## Workflow
58+
59+
1. Bump the relevant `versions.*` keys before running a pipeline.
60+
2. Run the pipeline — outputs land in the versioned directory.
61+
3. After upload, update `upload.latest_url_*` and the frontend references.
62+
4. Old versions stay on disk / S3 — delete manually when confident nothing references them.

0 commit comments

Comments
 (0)