Skip to content

Commit 2335dfc

Browse files
authored
Merge pull request #31 from henryspatialanalysis/feature/us-overseas-territories
Expand project scope to US + all overseas territories.
2 parents dd2166a + 45cbd55 commit 2335dfc

15 files changed

Lines changed: 808 additions & 399 deletions

File tree

.claude/docs/data-sources.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,35 +6,40 @@ Reference for every external data source openpois ingests. For the workflow that
66

77
**Used by**: the historical modeling pipeline ([skills/model-history-pipeline](../skills/model-history-pipeline/SKILL.md)).
88

9-
- **URLs**:
9+
- **URLs** (passed via `HistoryExtract` specs from `scripts/osm_data/download_history.py`):
1010
- `download.osm.history_pbf_url``https://osm-internal.download.geofabrik.de/north-america/us-internal.osh.pbf`
11-
- `download.osm.pr_history_pbf_url``.../us/puerto-rico-internal.osh.pbf`
11+
- `download.osm.pr_history_pbf_url``.../north-america/us/puerto-rico-internal.osh.pbf`
12+
- `download.osm.usvi_history_pbf_url``.../north-america/us/us-virgin-islands-internal.osh.pbf`
13+
- `download.osm.american_oceania_history_pbf_url``.../australia-oceania/american-oceania-internal.osh.pbf` (covers Guam, NMI, American Samoa, plus uninhabited US Pacific possessions)
1214
- **Auth**: OAuth — any OSM account works. Produce a Netscape-format cookie jar (browser export or Geofabrik's `oauth_cookie_client.py`). Path: `download.osm.history_cookie_file` (default `~/data/openpois/.creds/geofabrik_cookies.txt`).
13-
- **Pipeline**: `osmium tags-filter --omit-referenced``osmium time-filter` → pyosmium streams to `osm_versions.parquet` + `osm_changes.parquet`.
15+
- **Pipeline**: per-extract loop — `osmium tags-filter --omit-referenced``osmium time-filter` → pyosmium streams to intermediate `*_versions.parquet` + `*_changes.parquet`. Iterative N-way dedup (`_concat_history`) drops `(type, id)` overlap across extracts before writing the final `osm_versions.parquet` + `osm_changes.parquet`.
16+
- **Per-extract failure tolerance**: if a territory's history PBF returns HTTP 404 (e.g. Geofabrik stops publishing it), the loader logs a warning and continues without that territory's history; the rater then falls back to the global-mean δ for that territory's `shared_label`s.
1417
- **Entry**: [src/openpois/io/osm_history_pbf.py](../../src/openpois/io/osm_history_pbf.py) (`download_osm_history`).
1518
- **Config**: `download.osm.start_date`, `end_date`, `filter_keys`, `extract_keys`.
1619

1720
## OSM snapshot (Geofabrik standard PBFs)
1821

1922
**Used by**: current-state snapshot (`osm_snapshot.parquet`).
2023

21-
- **URLs**:
24+
- **URLs** (passed via `SnapshotExtract` specs from `scripts/osm_snapshot/download.py`):
2225
- US: `https://download.geofabrik.de/north-america/us-latest.osm.pbf` (~11 GB, 50 states incl. AK+HI)
2326
- PR: `https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf`**PR is not in the US extract**
27+
- USVI: `https://download.geofabrik.de/north-america/us/us-virgin-islands-latest.osm.pbf`
28+
- American Oceania: `https://download.geofabrik.de/australia-oceania/american-oceania-latest.osm.pbf` (covers Guam, NMI, American Samoa, and uninhabited US Pacific possessions; Geofabrik does not publish per-territory PBFs for the inhabited western Pacific territories)
2429
- **Auth**: none (public).
25-
- **Pipeline**: `osmium tags-filter` → pyosmium parse → concat US+PR → GeoParquet.
30+
- **Pipeline**: per-extract loop — `osmium tags-filter` → pyosmium parse → write intermediate parquets → concat all intermediates → GeoParquet.
2631
- **Entry**: [src/openpois/io/osm_snapshot.py](../../src/openpois/io/osm_snapshot.py).
2732
- **Quirks**:
2833
- `osmium` is in the conda env's `bin/` but **not** on shell PATH. Code resolves via `Path(sys.executable).parent / "osmium"`.
29-
- Geofabrik extracts are pre-cut to admin boundaries → no polygon post-filter needed.
34+
- Geofabrik extracts are pre-cut to admin boundaries → no polygon post-filter needed. `american-oceania-latest.osm.pbf` ships a few non-target uninhabited US Pacific possessions (Wake, Midway, Howland, Baker, Jarvis, Palmyra, Kingman); they contain near-zero POIs and pass through as bonus coverage.
3035

3136
## Overture Maps
3237

3338
**Used by**: current-state Overture snapshot (`overture_snapshot.parquet`).
3439

3540
- **URL**: public S3 at `s3://overturemaps-us-west-2/`.
3641
- **Auth**: none (DuckDB + httpfs queries directly).
37-
- **Pipeline**: per-part resumable download → exact-polygon filter, all inside DuckDB. Each of the 16 `part-*.parquet` files streams through a fresh DuckDB connection into a local parquet intermediate under `.parts/<release>/`; coarse-bbox `WHERE` pushes down on Overture's `bbox` struct. Once every part is present, a final `COPY` applies `ST_Within` against the dissolved US+PR polygon and writes the GeoParquet. No pandas materialization; crashed runs resume by skipping existing intermediates.
42+
- **Pipeline**: per-part resumable download → exact-polygon filter, all inside DuckDB. Each of the 16 `part-*.parquet` files streams through a fresh DuckDB connection into a local parquet intermediate under `.parts/<release>/`; coarse-bbox `WHERE` pushes down on Overture's `bbox` struct. Once every part is present, a final `COPY` applies `ST_Within` against the dissolved US + territories polygon and writes the GeoParquet. No pandas materialization; crashed runs resume by skipping existing intermediates.
3843
- **Entry**: [src/openpois/io/overture.py](../../src/openpois/io/overture.py). Returns a `Path`, not a `GeoDataFrame`.
3944
- **DuckDB version pin**: `environment.yml` pins `duckdb==1.4.1`. 1.4.4+ and every 1.5.x crash mid-scan on WSL2 with "Information loss on integer cast" in `HTTPFileSystem::ReadInternal` — tracked as DuckDB issue #21669, fix merged to main but not in any tagged release as of 2026-04-17. See [memory: project_duckdb_pin.md] for the bump checklist.
4045
- **Schema quirks (as of Feb 2026 schema)**:
@@ -48,10 +53,10 @@ Reference for every external data source openpois ingests. For the workflow that
4853

4954
**Used by**: both snapshot downloaders (spatial clipping).
5055

51-
- **URL**: `download.general.boundary.source_url``https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_20m.zip` (1:20M cartographic, 50 states + DC + PR).
56+
- **URL**: `download.general.boundary.source_url``https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_5m.zip` (1:5M cartographic, 50 states + DC + 5 inhabited territories: PR, VI, GU, MP, AS). Note: the 1:20M variant used previously does **not** include territories.
5257
- **Auth**: none.
53-
- **Pipeline**: download ZIP → cache under `directories.boundary` (first-use) → dissolve → buffer outward by `coastline_buffer_m` (default 100 m) in EPSG:6933 (equal-area, so buffer accurate across CONUS/AK/HI/PR).
58+
- **Pipeline**: download ZIP → cache under `directories.boundary` (first-use) → dissolve → buffer outward by `coastline_buffer_m` (default 100 m) in EPSG:6933 (equal-area, so buffer accurate across CONUS / AK / HI / Caribbean territories / western Pacific territories).
5459
- **Entry**: [src/openpois/io/boundary.py](../../src/openpois/io/boundary.py) (`get_us_pr_boundary`).
5560
- **Returns**: `(boundary_gdf, coarse_bboxes)` — single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown.
56-
- **Antimeridian**: Aleutians split into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes).
61+
- **Antimeridian**: two bboxes returned, split via per-part centroid at lon=0. The negative-longitude bbox covers CONUS, AK mainland, HI, PR, USVI, and American Samoa (~-170°W). The positive-longitude bbox covers the Aleutian Near Islands (~+172°E), Guam (~+144°E), and the Northern Mariana Islands (~+145°E).
5762

.claude/skills/full-data-pull/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Use when the user wants to refresh the independent POI snapshots (O
55

66
# Full data pull
77

8-
Downloads the snapshot sources (50 US states + DC + PR) and applies the rating model to OSM so conflation can run.
8+
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS) and applies the rating model to OSM so conflation can run.
99

1010
## Prerequisites
1111

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# OpenPOIs
22

3-
A unified, confidence-scored open dataset of U.S. points of interest, built
4-
from [OpenStreetMap](https://www.openstreetmap.org) and
3+
A unified, confidence-scored open dataset of U.S. points of interest —
4+
covering the 50 states, DC, and the 5 inhabited U.S. territories (Puerto
5+
Rico, US Virgin Islands, Guam, Northern Mariana Islands, American Samoa) —
6+
built from [OpenStreetMap](https://www.openstreetmap.org) and
57
[Overture Maps](https://overturemaps.org).
68

79
![OpenPOIs interactive map](docs/_static/hero.png)
@@ -22,7 +24,8 @@ OpenPOIs conflates points of interest from OpenStreetMap and Overture Maps
2224
into a single unified dataset, then attaches a per-POI confidence score
2325
estimating the probability that the place still exists. Confidence comes from
2426
a Bayesian turnover model fit on OSM tag-edit history. The published dataset
25-
covers the United States and is refreshed monthly, following the Overture Maps monthly release cycle.
27+
covers the United States and its 5 inhabited territories, and is refreshed
28+
monthly, following the Overture Maps monthly release cycle.
2629

2730
This repository contains the Python library used to produce the data, the
2831
end-to-end pipelines that download and conflate sources, and the Vue

config.yaml

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,42 @@ versions:
1515
download:
1616
general:
1717
timeout: 1_000
18-
# Census 1:20M cartographic state boundary file; includes 50 states + DC
19-
# + PR. Used by all three snapshot downloads to restrict POIs to the US
20-
# plus Puerto Rico. The coastline buffer expands the dissolved polygon
21-
# outward by N metres so near-shore POIs are retained; internal state
22-
# borders disappear on dissolve so the buffer only affects the coast.
18+
# Census 1:5M cartographic state boundary file; includes 50 states + DC
19+
# + 5 inhabited US territories (PR, USVI, GU, MP, AS). The 1:20M variant
20+
# used previously only covered states + DC + PR. Used by all three
21+
# snapshot downloads to restrict POIs to the US footprint. The coastline
22+
# buffer expands the dissolved polygon outward by N metres so near-shore
23+
# POIs are retained; internal state borders disappear on dissolve so the
24+
# buffer only affects the coast.
2325
boundary:
24-
source_url: "https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_20m.zip"
26+
source_url: "https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_5m.zip"
2527
coastline_buffer_m: 100
2628
osm:
2729
start_date: 2016-01-01
2830
end_date: 2025-12-31
31+
# Snapshot extracts. Geofabrik publishes a dedicated PBF for PR and USVI
32+
# under north-america/us/, but the western-Pacific inhabited territories
33+
# (Guam, Northern Mariana Islands, American Samoa) have no per-territory
34+
# files. They are bundled into a single `american-oceania` extract that
35+
# also includes the uninhabited US Pacific possessions (Wake, Midway,
36+
# Howland, Baker, Jarvis, Palmyra, Kingman) — those contribute near-zero
37+
# POIs and are accepted as bonus coverage.
2938
pbf_url: "https://download.geofabrik.de/north-america/us-latest.osm.pbf"
3039
pr_pbf_url: "https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf"
40+
usvi_pbf_url: "https://download.geofabrik.de/north-america/us/us-virgin-islands-latest.osm.pbf"
41+
american_oceania_pbf_url: "https://download.geofabrik.de/australia-oceania/american-oceania-latest.osm.pbf"
3142
# Full-history PBFs live on Geofabrik's OAuth-protected internal server.
3243
# Any OSM account grants access; generate a Netscape-format cookie jar by
3344
# logging in at https://osm-internal.download.geofabrik.de/ and exporting
34-
# cookies, or by running Geofabrik's oauth_cookie_client.py.
45+
# cookies, or by running Geofabrik's oauth_cookie_client.py. The
46+
# `usvi_*` and `american_oceania_*` history URLs follow the same path
47+
# convention as the snapshot URLs; if any are missing on the server, the
48+
# history loader logs a warning and continues without that territory's
49+
# history (the rater then falls back to the global-mean delta).
3550
history_pbf_url: "https://osm-internal.download.geofabrik.de/north-america/us-internal.osh.pbf"
3651
pr_history_pbf_url: "https://osm-internal.download.geofabrik.de/north-america/us/puerto-rico-internal.osh.pbf"
52+
usvi_history_pbf_url: "https://osm-internal.download.geofabrik.de/north-america/us/us-virgin-islands-internal.osh.pbf"
53+
american_oceania_history_pbf_url: "https://osm-internal.download.geofabrik.de/australia-oceania/american-oceania-internal.osh.pbf"
3754
history_cookie_file: "~/data/openpois/.creds/geofabrik_cookies.txt"
3855
overwrite_download: true
3956
overwrite_filter: true
@@ -139,7 +156,7 @@ directories:
139156
versioned: true
140157
path: ~/data/openpois/osm_data
141158
files:
142-
# US+PR full-history pipeline (PBF-based)
159+
# US + territories full-history pipeline (PBF-based)
143160
osm_changes: osm_changes.parquet
144161
osm_versions: osm_versions.parquet
145162
raw_history_pbf: us-internal.osh.pbf
@@ -148,10 +165,20 @@ directories:
148165
raw_pr_history_pbf: puerto-rico-internal.osh.pbf
149166
filtered_pr_history_pbf: puerto-rico-pois.osh.pbf
150167
time_filtered_pr_history_pbf: puerto-rico-pois-timefilt.osh.pbf
168+
raw_usvi_history_pbf: us-virgin-islands-internal.osh.pbf
169+
filtered_usvi_history_pbf: us-virgin-islands-pois.osh.pbf
170+
time_filtered_usvi_history_pbf: us-virgin-islands-pois-timefilt.osh.pbf
171+
raw_american_oceania_history_pbf: american-oceania-internal.osh.pbf
172+
filtered_american_oceania_history_pbf: american-oceania-pois.osh.pbf
173+
time_filtered_american_oceania_history_pbf: american-oceania-pois-timefilt.osh.pbf
151174
us_versions: us_osm_versions.parquet
152175
us_changes: us_osm_changes.parquet
153176
pr_versions: pr_osm_versions.parquet
154177
pr_changes: pr_osm_changes.parquet
178+
usvi_versions: usvi_osm_versions.parquet
179+
usvi_changes: usvi_osm_changes.parquet
180+
american_oceania_versions: american_oceania_osm_versions.parquet
181+
american_oceania_changes: american_oceania_osm_changes.parquet
155182
# Modelling-ready observations (one row per POI version × shared_label)
156183
osm_observations: osm_observations.parquet
157184
model_output:
@@ -171,6 +198,10 @@ directories:
171198
filtered_pbf: us-pois.osm.pbf
172199
raw_pr_pbf: puerto-rico-latest.osm.pbf
173200
filtered_pr_pbf: puerto-rico-pois.osm.pbf
201+
raw_usvi_pbf: us-virgin-islands-latest.osm.pbf
202+
filtered_usvi_pbf: us-virgin-islands-pois.osm.pbf
203+
raw_american_oceania_pbf: american-oceania-latest.osm.pbf
204+
filtered_american_oceania_pbf: american-oceania-pois.osm.pbf
174205
snapshot: osm_snapshot.parquet
175206
rated_snapshot: osm_snapshot_rated.parquet
176207
partitioned: osm_snapshot_partitioned

docs/api.rst

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,14 @@ io
5454
openpois.io.osm_history_pbf
5555
~~~~~~~~~~~~~~~~~~~~~~~~~~~
5656

57-
Download Geofabrik full-history PBFs (US + Puerto Rico), filter to POI tags
58-
with ``osmium tags-filter``, time-window with ``osmium time-filter``, and parse
59-
with pyosmium into per-version and per-change Parquet tables suitable for the
60-
change-rate model. Uses an OAuth cookie jar against Geofabrik's internal
61-
server.
57+
Download Geofabrik full-history PBFs (US + inhabited territories: Puerto
58+
Rico, US Virgin Islands, plus Guam / NMI / American Samoa via the
59+
``american-oceania`` extract), filter to POI tags with ``osmium tags-filter``,
60+
time-window with ``osmium time-filter``, and parse with pyosmium into
61+
per-version and per-change Parquet tables suitable for the change-rate
62+
model. Uses an OAuth cookie jar against Geofabrik's internal server.
63+
Per-extract failure tolerance: missing-on-server (HTTP 404) territory PBFs
64+
are skipped with a warning rather than aborting the run.
6265

6366
.. automodule:: openpois.io.osm_history_pbf
6467
:members:

docs/workflows.rst

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,10 @@ saving snippet CSVs to the ``testing/`` directory for column inspection.
8383
Pipeline 2: OSM Historical Change-Rate Model
8484
--------------------------------------------
8585

86-
This pipeline downloads OpenStreetMap full-history PBFs (US + Puerto Rico)
87-
and fits a Poisson change-rate model to estimate how quickly different POI
88-
categories become outdated.
86+
This pipeline downloads OpenStreetMap full-history PBFs (US + inhabited
87+
territories: Puerto Rico, US Virgin Islands, plus Guam / NMI / American
88+
Samoa via the ``american-oceania`` extract) and fits a Poisson change-rate
89+
model to estimate how quickly different POI categories become outdated.
8990

9091
**Step 1 — Download full-history PBFs**
9192

@@ -94,11 +95,15 @@ categories become outdated.
9495
python scripts/osm_data/download_history.py
9596
9697
Requires the Geofabrik OAuth cookie jar described in *Prerequisites* above.
97-
Downloads the US-mainland and Puerto Rico full-history extracts, filters
98-
each with ``osmium tags-filter`` (POI tag keys only) and ``osmium
99-
time-filter`` (the ``download.osm.start_date`` / ``end_date`` window), then
100-
parses with pyosmium into per-version and per-change Parquet tables.
101-
Outputs: ``osm_versions.parquet`` and ``osm_changes.parquet``.
98+
Downloads each Geofabrik full-history extract in turn, filters each with
99+
``osmium tags-filter`` (POI tag keys only) and ``osmium time-filter`` (the
100+
``download.osm.start_date`` / ``end_date`` window), then parses with
101+
pyosmium into per-extract Parquet tables and concatenates with iterative
102+
``(type, id)`` dedup. If a territory's history PBF is missing on the server
103+
(HTTP 404), the loader logs a warning and continues; that territory's
104+
snapshot/Overture coverage is unaffected and the rater falls back to the
105+
global-mean delta. Outputs: ``osm_versions.parquet`` and
106+
``osm_changes.parquet``.
102107

103108
See :mod:`openpois.io.osm_history_pbf`.
104109

0 commit comments

Comments
 (0)