Skip to content

Commit b78dcbb

Browse files
committed
Search across the entire US (including Puerto Rico) for data.
1 parent 59ef035 commit b78dcbb

15 files changed

Lines changed: 680 additions & 126 deletions

File tree

.claude/CLAUDE.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,47 +24,56 @@ Code style is enforced by Black (format on save in VSCode). Linting via flake8 a
2424

2525
**openpois** models POI (Point of Interest) stability over time using historical OpenStreetMap data. The workflow is:
2626

27-
1. **Download OSM history** (`src/openpois/osm/download.py`) — queries the Overpass API for element histories within a bounding box and date range, producing version/change tables
27+
1. **Download OSM history** (`src/openpois/io/osm_history.py`) — queries the Overpass API for element histories within a bounding box and date range, producing version/change tables
2828
2. **Format observations** (`src/openpois/osm/format_observations.py`) — converts raw OSM version histories into observation records (one row per version) with flags for tag changes and deletions
2929
3. **Model change rates** (`src/openpois/models/`) — fits an empirical Bayes model using PyTorch to estimate per-group POI change rates (λ) as a Poisson process
3030
4. **Visualize stability** (`src/openpois/osm/change_plots.py`) — plots how long POI tags remain unchanged
3131

32-
The **exploratory/** scripts are end-to-end pipelines that call library functions using settings from `config.yaml`. They are not part of the installed package and serve as reference implementations.
32+
The **scripts/** directory contains end-to-end pipelines that call library functions using settings from `config.yaml`. They are not part of the installed package and serve as reference implementations.
3333

3434
### Key classes and files
3535

3636
- `EventRate` (`models/event_rate.py`) — wraps a constant or time-varying λ; computes change probabilities via integration
3737
- `ModelFitter` (`models/model_fitter.py`) — fits λ using PyTorch L-BFGS optimizer with optional priors; supports parameter draws for uncertainty
3838
- `pytorch_setup()` / `prepare_data_for_model()` (`models/setup.py`) — initializes torch (GPU/CPU) and prepares filtered, grouped observation data
39-
- `download_element_histories()` (`osm/download.py`) — main entry point for OSM history acquisition (Overpass, Seattle bbox only — do NOT modify for nationwide use)
39+
- `download_element_histories()` (`io/osm_history.py`) — main entry point for OSM history acquisition (Overpass, `download.osm.history_bbox` config key, Seattle-scoped — do NOT repurpose for nationwide use; Overpass cannot serve US-wide histories)
4040

4141
### Configuration
4242

43-
`config.yaml` holds all shared settings (bounding box, date ranges, OSM tag keys, model hyperparameters, output directory paths with versioning). The `config_versioned` package (external dependency) reads this file. Exploratory scripts load config at startup; library functions accept parameters directly.
43+
`config.yaml` holds all shared settings (spatial boundary, date ranges, OSM tag keys, model hyperparameters, output directory paths with versioning). The `config_versioned` package (external dependency) reads this file. Scripts load config at startup; library functions accept parameters directly.
4444

4545
- `.get()` raises `ValueError` for null config values — pass `fail_if_none=False` for optional fields like `release_date: null`
4646

4747
## POI Snapshot Downloads
4848

49-
Three separate utilities download current US-wide snapshots (separate from the historical OSM workflow):
49+
Three separate utilities download current snapshots covering the 50 US states + DC + Puerto Rico (separate from the historical OSM workflow):
5050

51-
### OSM (`src/openpois/osm/snapshot.py`)
51+
### Spatial boundary (`src/openpois/io/boundary.py`)
52+
- Single source of truth for the US+PR extent used by all three snapshot downloaders
53+
- Downloads the Census 1:20M cartographic state shapefile (`cb_2023_us_state_20m`) on first use; cached under `directories.boundary`
54+
- `get_us_pr_boundary()` returns `(boundary_gdf, coarse_bboxes)` — a single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown
55+
- Buffering is done in `EPSG:6933` (World Equal-Area Cylindrical) so the `coastline_buffer_m` (default 100 m) is accurate across CONUS / AK / HI / PR. Because `.dissolve()` removes internal state borders, the uniform outward buffer effectively only expands coastline; land-border expansion into CA/MX is negligible.
56+
- `coarse_bboxes` splits the Aleutians at the antimeridian into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes)
57+
58+
### OSM (`src/openpois/io/osm_snapshot.py`)
5259
- `download_pbf` / `filter_pbf` / `parse_pbf_to_geodataframe` / `download_osm_snapshot`
53-
- Geofabrik US extract (~11 GB) → osmium tags-filter → pyosmium parse → GeoParquet
60+
- Two Geofabrik extracts: `us-latest.osm.pbf` (~11 GB, 50 states incl. AK+HI) + `puerto-rico-latest.osm.pbf` (PR is NOT in the US extract) → osmium tags-filter → pyosmium parse → concat → GeoParquet
61+
- Geofabrik extracts are pre-cut to admin boundaries, so no polygon post-filter is needed
5462
- `osmium` is in the conda env bin but NOT on shell PATH; code resolves it via `Path(sys.executable).parent / "osmium"`
55-
- Run: `python exploratory/osm_snapshot/download.py`
63+
- Run: `python scripts/osm_snapshot/download.py`
5664

57-
### Overture Maps (`src/openpois/overture/download.py`)
65+
### Overture Maps (`src/openpois/io/overture.py`)
5866
- DuckDB + httpfs + spatial extensions; queries public S3 directly, no auth
67+
- **Two-stage spatial filter:** DuckDB `WHERE` clause ORs one disjunct per coarse bbox (predicate pushdown on Overture's `bbox` struct column), then a GeoPandas `sjoin(predicate='within')` post-filter against the exact US+PR polygon
5968
- `taxonomy` field is a named STRUCT: use `taxonomy.hierarchy[1]` (not `taxonomy[1]`)
6069
- `brand` is a singular struct (not array); geometry is native DuckDB GEOMETRY type requiring `LOAD spatial` and `ST_X()/ST_Y()`
6170
- L0 category names (Feb 2026+): `food_and_drink`, `shopping`, `arts_and_entertainment`, `sports_and_recreation`, `health_care`
62-
- Run: `python exploratory/overture/download.py`
71+
- Run: `python scripts/overture/download.py`
6372

64-
### Foursquare OS Places (`src/openpois/foursquare/download.py`)
73+
### Foursquare OS Places (`src/openpois/io/foursquare.py`)
6574
- PyIceberg `RestCatalog`; requires `warehouse="places"` parameter
6675
- Catalog: `uri=https://catalog.h3-hub.foursquare.com/iceberg`, namespace=`datasets`, tables=`places_os` / `categories_os`
6776
- Table is **unpartitioned** (no `dt` column); release date inferred from `last_updated_at` in partition metadata
68-
- Row filter: `country = 'US' AND date_closed IS NULL` (no dt filter)
77+
- Row filter: `country IN ('US', 'PR') AND date_closed IS NULL` — Foursquare uses ISO alpha-2 codes, so PR must be listed explicitly; PyIceberg has no spatial predicate support, so an exact `sjoin(predicate='within')` post-filter runs after the rows are loaded
6978
- `fsq_category_ids` arrives as numpy/pyarrow array — use `len(x) == 0` not `if not x:`
70-
- Token in `FSQ_PORTAL_TOKEN` env var; run: `python exploratory/foursquare/download.py`
79+
- Token in `FSQ_PORTAL_TOKEN` env var; run: `python scripts/foursquare/download.py`

config.yaml

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,38 @@
11
# Versioned directories (used with config.get_dir_path())
22
versions:
3-
osm_data: "20260313"
4-
model_output: "20260315_constant"
5-
snapshot_osm: "20260313"
6-
snapshot_overture: "20260313"
7-
snapshot_foursquare: "20260313"
8-
aws: "20260318"
9-
conflation: "20260318"
3+
osm_data: "20260416"
4+
model_output: "20260315"
5+
snapshot_osm: "20260416"
6+
snapshot_overture: "20260416"
7+
snapshot_foursquare: "20260416"
8+
aws: "20260416"
9+
conflation: "20260416"
1010

1111
# Settings for downloading data
1212
download:
1313
general:
1414
timeout: 1_000
15-
bbox:
16-
xmin: -125.0
17-
ymin: 24.5
18-
xmax: -66.9
19-
ymax: 49.4
15+
# Census 1:20M cartographic state boundary file; includes 50 states + DC
16+
# + PR. Used by all three snapshot downloads to restrict POIs to the US
17+
# plus Puerto Rico. The coastline buffer expands the dissolved polygon
18+
# outward by N metres so near-shore POIs are retained; internal state
19+
# borders disappear on dissolve so the buffer only affects the coast.
20+
boundary:
21+
source_url: "https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_20m.zip"
22+
coastline_buffer_m: 100
2023
osm:
2124
start_date: 2016-01-01
2225
end_date: 2025-12-31
2326
date_interval_days: 7
27+
# Seattle-area bbox used only by the Overpass-based historical download.
28+
# Overpass cannot serve US-wide histories; keep this scoped to a city.
29+
history_bbox:
30+
xmin: -122.45
31+
ymin: 47.50
32+
xmax: -122.25
33+
ymax: 47.70
2434
pbf_url: "https://download.geofabrik.de/north-america/us-latest.osm.pbf"
35+
pr_pbf_url: "https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf"
2536
overwrite_download: false
2637
overwrite_filter: false
2738
source_label: "osm"
@@ -76,7 +87,7 @@ osm_data:
7687
- last_obs_timestamp
7788
- last_tag_timestamp
7889
apply_model:
79-
model_stub: '20260315'
90+
model_stub: '20260319_discounted'
8091

8192
# Settings for exploratory/models/pytorch_simple.py
8293
osm_turnover_model:
@@ -119,10 +130,15 @@ directories:
119130
files:
120131
raw_pbf: us-latest.osm.pbf
121132
filtered_pbf: us-pois.osm.pbf
133+
raw_pr_pbf: puerto-rico-latest.osm.pbf
134+
filtered_pr_pbf: puerto-rico-pois.osm.pbf
122135
snapshot: osm_snapshot.parquet
123136
rated_snapshot: osm_snapshot_rated.parquet
124137
partitioned: osm_snapshot_partitioned
125138
pmtiles: osm_snapshot.pmtiles
139+
boundary:
140+
versioned: false
141+
path: ~/data/openpois/boundary
126142
snapshot_overture:
127143
versioned: true
128144
path: ~/data/openpois/snapshots/overture

scripts/foursquare/download.py

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
"""
2-
Download the current US Foursquare OS Places snapshot as a GeoParquet file.
2+
Download the current US+PR Foursquare OS Places snapshot as a GeoParquet file.
33
44
Authenticates to the Foursquare Places Portal Apache Iceberg REST catalog
5-
using a portal token, loads the unpartitioned places_os table filtered to US
6-
records with no closed date, joins against categories_os to resolve L1
7-
category names, and saves the result as a GeoParquet file.
5+
using a portal token, loads the unpartitioned places_os table filtered to
6+
places whose country is 'US' or 'PR' with no closed date, joins against
7+
categories_os to resolve L1 category names, applies an exact within-polygon
8+
filter against the US+PR Census boundary, and saves the result as a
9+
GeoParquet file.
810
911
Authentication:
1012
Set the FSQ_PORTAL_TOKEN environment variable before running:
@@ -20,13 +22,17 @@
2022
download.foursquare.categories_table — categories table name ("categories_os")
2123
download.foursquare.token_env_var — env var name for the portal token
2224
download.foursquare.l1_category_names — L1 category filter list
25+
download.general.boundary.source_url — Census state-boundary zip URL
26+
download.general.boundary.coastline_buffer_m — outward coastline buffer (m)
27+
directories.boundary — cache directory for boundary file
2328
directories.snapshot_foursquare — output directory
2429
2530
Output file:
26-
foursquare_snapshot.parquet — GeoParquet with ~8.3M US POIs
31+
foursquare_snapshot.parquet — GeoParquet with US+PR POIs
2732
Columns: fsq_place_id, name, fsq_category_ids, geometry, source
2833
"""
2934
from config_versioned import Config
35+
from openpois.io.boundary import get_us_pr_boundary
3036
from openpois.io.foursquare import download_foursquare_snapshot
3137

3238
# -----------------------------------------------------------------------------
@@ -44,6 +50,11 @@
4450
CATEGORIES_TABLE = config.get("download", "foursquare", "categories_table")
4551
TOKEN_ENV_VAR = config.get("download", "foursquare", "token_env_var")
4652
L1_CATEGORIES = config.get("download", "foursquare", "l1_category_names")
53+
BOUNDARY_URL = config.get("download", "general", "boundary", "source_url")
54+
COASTLINE_BUFFER_M = config.get(
55+
"download", "general", "boundary", "coastline_buffer_m"
56+
)
57+
BOUNDARY_DIR = config.get_dir_path("boundary")
4758

4859
SAVE_DIR = config.get_dir_path("snapshot_foursquare")
4960
SAVE_DIR.mkdir(parents=True, exist_ok=True)
@@ -55,6 +66,11 @@
5566
# -----------------------------------------------------------------------------
5667

5768
if __name__ == "__main__":
69+
boundary_gdf, _ = get_us_pr_boundary(
70+
source_url = BOUNDARY_URL,
71+
cache_dir = BOUNDARY_DIR,
72+
coastline_buffer_m = COASTLINE_BUFFER_M,
73+
)
5874
gdf = download_foursquare_snapshot(
5975
output_path=OUTPUT_PATH,
6076
l1_category_names=L1_CATEGORIES,
@@ -64,6 +80,7 @@
6480
places_table=PLACES_TABLE,
6581
categories_table=CATEGORIES_TABLE,
6682
token_env_var=TOKEN_ENV_VAR,
83+
boundary_gdf=boundary_gdf,
6784
release_date=RELEASE_DATE,
6885
)
6986
print(f"Saved {len(gdf):,} Foursquare POIs to {OUTPUT_PATH}")

scripts/osm_data/data_viz.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@
4949
END_DATE = pd.Timestamp(config.get("download", "osm", "end_date"), tz='UTC')
5050
MODEL_BASE = config.get_dir_path("model_output").parent
5151
MODEL_STUB = config.get("osm_data", "apply_model", "model_stub")
52-
ADJ_FACTOR = 1.0
5352

5453
max_days = 365 * 10
5554
VIZ_DIR.mkdir(parents=True, exist_ok=True)
@@ -82,7 +81,7 @@ def fig_save(
8281
**kwargs
8382
)
8483

85-
def get_preds_dict(model_stub: str | None, adj_factor: float = 1.0) -> dict[str, pd.DataFrame]:
84+
def get_preds_dict(model_stub: str | None) -> dict[str, pd.DataFrame]:
8685
"""
8786
Load model predictions from the model output directory.
8887
"""
@@ -98,9 +97,9 @@ def get_preds_df(model_stub: str, subset: str | None = None) -> Path:
9897
return None
9998
return pd.read_csv(preds_fp).assign(
10099
year = pd.col('t2'),
101-
conf_mean = (1.0 - pd.col('p_mean')) * adj_factor,
102-
conf_lower = (1.0 - pd.col('p_upper')) * adj_factor,
103-
conf_upper = (1.0 - pd.col('p_lower')) * adj_factor,
100+
conf_mean = (1.0 - pd.col('p_mean')),
101+
conf_lower = (1.0 - pd.col('p_upper')),
102+
conf_upper = (1.0 - pd.col('p_lower')),
104103
)
105104
preds = dict()
106105
preds["constant"] = get_preds_df(model_stub)
@@ -115,7 +114,7 @@ def get_preds_df(model_stub: str, subset: str | None = None) -> Path:
115114

116115
if __name__ == "__main__":
117116
# Read model predictions
118-
preds = get_preds_dict(MODEL_STUB, adj_factor = ADJ_FACTOR)
117+
preds = get_preds_dict(MODEL_STUB)
119118
# Read observations
120119
# Drop the first observation for each POI (when the POI was first added) - the last
121120
# observation timestamp will be missing for these rows

scripts/osm_data/download.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
element via the OSM API.
77
88
Config keys used (config.yaml):
9-
download.general.bbox — WGS-84 bbox [xmin, ymin, xmax, ymax]
9+
download.osm.history_bbox — WGS-84 bbox [xmin, ymin, xmax, ymax]
1010
download.general.timeout — request timeout in seconds
1111
download.osm.start_date — earliest snapshot date (min: 2012-09-13)
1212
download.osm.end_date — latest snapshot date
@@ -35,7 +35,7 @@
3535
config = Config("~/repos/openpois/config.yaml")
3636

3737
TIMEOUT = config.get("download", "general", "timeout")
38-
BBOX = config.get("download", "general", "bbox")
38+
BBOX = config.get("download", "osm", "history_bbox")
3939
# Earliest option is September 13, 2012
4040
START_DATE = datetime.datetime.combine(
4141
config.get("download", "osm", "start_date"), datetime.time.min

scripts/osm_snapshot/download.py

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,23 @@
11
"""
2-
Download the current US OpenStreetMap POI snapshot as a GeoParquet file.
2+
Download the current US+PR OpenStreetMap POI snapshot as a GeoParquet file.
33
4-
Downloads the Geofabrik North America PBF extract (~11 GB), uses osmium
5-
tags-filter to extract nodes and ways matching the configured tag keys, then
6-
parses the result with pyosmium into a GeoDataFrame and saves as GeoParquet.
7-
Incremental: skips the PBF download or filter step if output files already
8-
exist (controlled by overwrite_download and overwrite_filter config flags).
4+
Downloads two Geofabrik PBF extracts — the US-mainland extract (~11 GB,
5+
covers all 50 states incl. AK + HI) and the Puerto Rico extract — uses
6+
osmium tags-filter to extract nodes and ways matching the configured tag
7+
keys, parses the result with pyosmium into GeoDataFrames, concatenates the
8+
US + PR results, and saves as GeoParquet. Incremental: skips any PBF
9+
download or filter step whose output file already exists (controlled by
10+
overwrite_download and overwrite_filter config flags).
911
1012
Note: osmium is resolved from the conda env bin rather than the shell PATH;
1113
no manual PATH modification is needed.
1214
1315
Config keys used (config.yaml):
14-
download.osm.pbf_url — Geofabrik PBF URL (North America extract)
16+
download.osm.pbf_url — Geofabrik US PBF URL (50 states)
17+
download.osm.pr_pbf_url — Geofabrik Puerto Rico PBF URL
1518
download.osm.filter_keys — OSM tag keys to retain (e.g. amenity, shop)
1619
download.osm.extract_keys — tag keys to include as output columns
17-
download.osm.overwrite_download — re-download PBF even if it already exists
20+
download.osm.overwrite_download — re-download PBFs even if they already exist
1821
download.osm.overwrite_filter — re-run osmium filter even if output exists
1922
download.osm.source_label — value written to the "source" column
2023
download.osm.keep_all_keys — retain all discovered tag columns in output
@@ -24,7 +27,7 @@
2427
directories.snapshot_osm — output directory; also used for temp PBF files
2528
2629
Output file:
27-
osm_snapshot.parquet — GeoParquet with ~7.8M US POIs (nodes + area centroids)
30+
osm_snapshot.parquet — GeoParquet with US+PR POIs (nodes + area centroids)
2831
Columns: osm_id, osm_type, name, geometry, last_edited, source,
2932
plus all extract_keys columns
3033
"""
@@ -38,6 +41,7 @@
3841
config = Config("~/repos/openpois/config.yaml")
3942

4043
PBF_URL = config.get("download", "osm", "pbf_url")
44+
PR_PBF_URL = config.get("download", "osm", "pr_pbf_url")
4145
FILTER_KEYS = config.get("download", "osm", "filter_keys")
4246
EXTRACT_KEYS = config.get("download", "osm", "extract_keys")
4347
OVERWRITE_DOWNLOAD = config.get("download", "osm", "overwrite_download")
@@ -54,6 +58,8 @@
5458

5559
RAW_PBF = config.get_file_path("snapshot_osm", "raw_pbf")
5660
FILTERED_PBF = config.get_file_path("snapshot_osm", "filtered_pbf")
61+
RAW_PR_PBF = config.get_file_path("snapshot_osm", "raw_pr_pbf")
62+
FILTERED_PR_PBF = config.get_file_path("snapshot_osm", "filtered_pr_pbf")
5763
OUTPUT_PATH = config.get_file_path("snapshot_osm", "snapshot")
5864

5965

@@ -66,6 +72,9 @@
6672
pbf_url = PBF_URL,
6773
raw_pbf_path = RAW_PBF,
6874
filtered_pbf_path = FILTERED_PBF,
75+
pr_pbf_url = PR_PBF_URL,
76+
raw_pr_pbf_path = RAW_PR_PBF,
77+
filtered_pr_pbf_path = FILTERED_PR_PBF,
6978
output_path = OUTPUT_PATH,
7079
filter_keys = FILTER_KEYS,
7180
extract_keys = EXTRACT_KEYS,

0 commit comments

Comments
 (0)