|
1 | 1 | # CLAUDE.md |
2 | 2 |
|
3 | | -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 3 | +Guidance for Claude Code working in this repository. Deep-dives live in [skills/](skills/) and [docs/](docs/); this file is orientation. |
4 | 4 |
|
5 | | -## Environment Setup |
| 5 | +## Environment setup |
6 | 6 |
|
7 | 7 | ```bash |
8 | | -make build_env # Create conda environment from environment.yml |
9 | | -make install_package # Install openpois in editable mode (pip install -e .) |
| 8 | +make build_env # Create conda env from environment.yml (name: openpois, Python 3.10+) |
| 9 | +make install_package # pip install -e . (editable install) |
10 | 10 | ``` |
11 | 11 |
|
12 | | -The conda environment is named `openpois` and requires Python 3.10+. |
| 12 | +Python executable: `/home/nathenry/miniforge3/envs/openpois/bin/python`. |
13 | 13 |
|
14 | | -## Common Commands |
| 14 | +## Common commands |
15 | 15 |
|
16 | 16 | ```bash |
17 | 17 | pytest # Run tests |
18 | | -make export_env # Export conda environment to environment.yml after adding dependencies |
| 18 | +make export_env # Export conda env to environment.yml after adding deps |
19 | 19 | ``` |
20 | 20 |
|
21 | | -Code style is enforced by Black (format on save in VSCode). Linting via flake8 and pylint, both configured in `pyproject.toml`. |
| 21 | +Style: Black (format-on-save in VSCode). Lint: flake8 + pylint, configured in `pyproject.toml`. |
22 | 22 |
|
23 | | -## Architecture |
| 23 | +## Architecture at a glance |
24 | 24 |
|
25 | | -**openpois** models POI (Point of Interest) stability over time using historical OpenStreetMap data. The workflow is: |
| 25 | +**openpois** models POI stability over time from OpenStreetMap history, and produces unified OSM + Overture + Foursquare snapshots for web consumption. Work splits into four pipelines: |
26 | 26 |
|
27 | | -1. **Download OSM history** — two options depending on scope: |
28 | | - - **US + Puerto Rico (default, `src/openpois/io/osm_history_pbf.py`)** — downloads Geofabrik full-history PBFs (`us-internal.osh.pbf` + `puerto-rico-internal.osh.pbf`), runs `osmium tags-filter --omit-referenced` then `osmium time-filter`, and streams the result through pyosmium into `osm_versions.parquet` + `osm_changes.parquet`. Requires an OSM-account OAuth cookie jar for Geofabrik's internal server. Entry point: `scripts/osm_data/download_history.py`. |
29 | | - - **City-scale fallback (`src/openpois/io/osm_history.py`)** — queries the Overpass API for element IDs in a bounding box, then fetches per-element histories from the OSM API. Seattle-scoped by default; Overpass cannot serve US-wide histories. Entry point: `scripts/osm_data/download.py`. |
30 | | -2. **Format observations** (`src/openpois/osm/format_observations.py`) — converts raw OSM version histories into observation records (one row per version) with flags for tag changes and deletions |
31 | | -3. **Model change rates** (`src/openpois/models/`) — fits an empirical Bayes model using PyTorch to estimate per-group POI change rates (λ) as a Poisson process |
32 | | -4. **Visualize stability** (`src/openpois/osm/change_plots.py`) — plots how long POI tags remain unchanged |
| 27 | +| Pipeline | Skill | |
| 28 | +|---|---| |
| 29 | +| Fit λ from OSM history, rate current snapshots | [skills/model-history-pipeline](skills/model-history-pipeline/SKILL.md) | |
| 30 | +| Iterate model variants on a pinned history run | [skills/iterate-model-types](skills/iterate-model-types/SKILL.md) | |
| 31 | +| Refresh the three POI snapshots (OSM / Overture / FSQ) | [skills/full-data-pull](skills/full-data-pull/SKILL.md) | |
| 32 | +| Conflate OSM + Overture, partition, upload to S3 | [skills/conflate-snapshots](skills/conflate-snapshots/SKILL.md) | |
| 33 | +| Bump the frontend to the new data version | [skills/update-site](skills/update-site/SKILL.md) | |
| 34 | +| Post-run QA on any of the above | [skills/verify-pipeline-run](skills/verify-pipeline-run/SKILL.md) | |
33 | 35 |
|
34 | | -The **scripts/** directory contains end-to-end pipelines that call library functions using settings from `config.yaml`. They are not part of the installed package and serve as reference implementations. |
| 36 | +## Where things live |
35 | 37 |
|
36 | | -### Key classes and files |
| 38 | +| Path | Purpose | |
| 39 | +|---|---| |
| 40 | +| [src/openpois/io/](../src/openpois/io/) | I/O adapters: OSM history/snapshot, Overture, Foursquare, Census boundary | |
| 41 | +| [src/openpois/osm/](../src/openpois/osm/) | OSM-specific transforms: `format_observations`, `change_plots` | |
| 42 | +| [src/openpois/models/](../src/openpois/models/) | PyTorch empirical Bayes: `EventRate`, `ModelFitter`, model registry | |
| 43 | +| [src/openpois/conflation/](../src/openpois/conflation/) | OSM×Overture matching: `taxonomy`, `match`, `merge` | |
| 44 | +| [scripts/](../scripts/) | End-to-end pipelines using config.yaml — not installed, reference only | |
| 45 | +| [site/](../site/) | Vue 3 + Vite frontend | |
37 | 46 |
|
38 | | -- `EventRate` (`models/event_rate.py`) — wraps a constant or time-varying λ; computes change probabilities via integration |
39 | | -- `ModelFitter` (`models/model_fitter.py`) — fits λ using PyTorch L-BFGS optimizer with optional priors; supports parameter draws for uncertainty |
40 | | -- `pytorch_setup()` / `prepare_data_for_model()` (`models/setup.py`) — initializes torch (GPU/CPU) and prepares filtered, grouped observation data |
41 | | -- `download_osm_history()` (`io/osm_history_pbf.py`) — US+PR history pipeline entry: Geofabrik full-history PBFs → osmium tags-filter (`--omit-referenced`) → osmium time-filter → pyosmium stream → `osm_versions.parquet` + `osm_changes.parquet`. Requires `download.osm.history_cookie_file` to point at a Netscape-format cookie jar with valid Geofabrik OAuth cookies. |
42 | | -- `download_element_histories()` (`io/osm_history.py`) — legacy city-scale entry point (Overpass, `download.osm.history_bbox` config key, Seattle-scoped; Overpass cannot serve US-wide histories) |
| 47 | +## Reference docs |
43 | 48 |
|
44 | | -### Configuration |
| 49 | +- [docs/data-sources.md](docs/data-sources.md) — URLs, auth, schema quirks for every source |
| 50 | +- [docs/taxonomy-setup.md](docs/taxonomy-setup.md) — crosswalk CSVs, build_taxonomy.py, frontend sync |
| 51 | +- [docs/data-versioning.md](docs/data-versioning.md) — `versions:` block, path resolution, external references |
45 | 52 |
|
46 | | -`config.yaml` holds all shared settings (spatial boundary, date ranges, OSM tag keys, model hyperparameters, output directory paths with versioning). The `config_versioned` package (external dependency) reads this file. Scripts load config at startup; library functions accept parameters directly. |
| 53 | +## Running to-do |
47 | 54 |
|
48 | | -- `.get()` raises `ValueError` for null config values — pass `fail_if_none=False` for optional fields like `release_date: null` |
| 55 | +[TODO.md](TODO.md) — curated running list. Not auto-synced to git status. |
49 | 56 |
|
50 | | -## POI Snapshot Downloads |
| 57 | +## Config gotcha worth surfacing |
51 | 58 |
|
52 | | -Three separate utilities download current snapshots covering the 50 US states + DC + Puerto Rico (separate from the historical OSM workflow): |
53 | | - |
54 | | -### Spatial boundary (`src/openpois/io/boundary.py`) |
55 | | -- Single source of truth for the US+PR extent used by all three snapshot downloaders |
56 | | -- Downloads the Census 1:20M cartographic state shapefile (`cb_2023_us_state_20m`) on first use; cached under `directories.boundary` |
57 | | -- `get_us_pr_boundary()` returns `(boundary_gdf, coarse_bboxes)` — a single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown |
58 | | -- Buffering is done in `EPSG:6933` (World Equal-Area Cylindrical) so the `coastline_buffer_m` (default 100 m) is accurate across CONUS / AK / HI / PR. Because `.dissolve()` removes internal state borders, the uniform outward buffer effectively only expands coastline; land-border expansion into CA/MX is negligible. |
59 | | -- `coarse_bboxes` splits the Aleutians at the antimeridian into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes) |
60 | | - |
61 | | -### OSM (`src/openpois/io/osm_snapshot.py`) |
62 | | -- `download_pbf` / `filter_pbf` / `parse_pbf_to_geodataframe` / `download_osm_snapshot` |
63 | | -- Two Geofabrik extracts: `us-latest.osm.pbf` (~11 GB, 50 states incl. AK+HI) + `puerto-rico-latest.osm.pbf` (PR is NOT in the US extract) → osmium tags-filter → pyosmium parse → concat → GeoParquet |
64 | | -- Geofabrik extracts are pre-cut to admin boundaries, so no polygon post-filter is needed |
65 | | -- `osmium` is in the conda env bin but NOT on shell PATH; code resolves it via `Path(sys.executable).parent / "osmium"` |
66 | | -- Run: `python scripts/osm_snapshot/download.py` |
67 | | - |
68 | | -### Overture Maps (`src/openpois/io/overture.py`) |
69 | | -- DuckDB + httpfs + spatial extensions; queries public S3 directly, no auth |
70 | | -- **Two-stage spatial filter:** DuckDB `WHERE` clause ORs one disjunct per coarse bbox (predicate pushdown on Overture's `bbox` struct column), then a GeoPandas `sjoin(predicate='within')` post-filter against the exact US+PR polygon |
71 | | -- `taxonomy` field is a named STRUCT: use `taxonomy.hierarchy[1]` (not `taxonomy[1]`) |
72 | | -- `brand` is a singular struct (not array); geometry is native DuckDB GEOMETRY type requiring `LOAD spatial` and `ST_X()/ST_Y()` |
73 | | -- L0 category names (Feb 2026+): `food_and_drink`, `shopping`, `arts_and_entertainment`, `sports_and_recreation`, `health_care` |
74 | | -- Run: `python scripts/overture/download.py` |
75 | | - |
76 | | -### Foursquare OS Places (`src/openpois/io/foursquare.py`) |
77 | | -- PyIceberg `RestCatalog`; requires `warehouse="places"` parameter |
78 | | -- Catalog: `uri=https://catalog.h3-hub.foursquare.com/iceberg`, namespace=`datasets`, tables=`places_os` / `categories_os` |
79 | | -- Table is **unpartitioned** (no `dt` column); release date inferred from `last_updated_at` in partition metadata |
80 | | -- Row filter: `country IN ('US', 'PR') AND date_closed IS NULL` — Foursquare uses ISO alpha-2 codes, so PR must be listed explicitly; PyIceberg has no spatial predicate support, so an exact `sjoin(predicate='within')` post-filter runs after the rows are loaded |
81 | | -- `fsq_category_ids` arrives as numpy/pyarrow array — use `len(x) == 0` not `if not x:` |
82 | | -- Token in `FSQ_PORTAL_TOKEN` env var; run: `python scripts/foursquare/download.py` |
| 59 | +`config_versioned.Config.get()` raises `ValueError` on null values. For optional fields (e.g., `release_date: null`), pass `fail_if_none=False`. Prefer `config.get_file_path(section, file_key)` over composing `get_dir_path()` + `get()` manually. |
0 commit comments