You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`download.osm.american_oceania_history_pbf_url` → `.../australia-oceania/american-oceania-internal.osh.pbf` (covers Guam, NMI, American Samoa, plus uninhabited US Pacific possessions)
12
14
-**Auth**: OAuth — any OSM account works. Produce a Netscape-format cookie jar (browser export or Geofabrik's `oauth_cookie_client.py`). Path: `download.osm.history_cookie_file` (default `~/data/openpois/.creds/geofabrik_cookies.txt`).
-**Pipeline**: per-extract loop — `osmium tags-filter --omit-referenced` → `osmium time-filter` → pyosmium streams to intermediate `*_versions.parquet` + `*_changes.parquet`. Iterative N-way dedup (`_concat_history`) drops `(type, id)` overlap across extracts before writing the final `osm_versions.parquet` + `osm_changes.parquet`.
16
+
-**Per-extract failure tolerance**: if a territory's history PBF returns HTTP 404 (e.g. Geofabrik stops publishing it), the loader logs a warning and continues without that territory's history; the rater then falls back to the global-mean δ for that territory's `shared_label`s.
- American Oceania: `https://download.geofabrik.de/australia-oceania/american-oceania-latest.osm.pbf` (covers Guam, NMI, American Samoa, and uninhabited US Pacific possessions; Geofabrik does not publish per-territory PBFs for the inhabited western Pacific territories)
-`osmium` is in the conda env's `bin/` but **not** on shell PATH. Code resolves via `Path(sys.executable).parent / "osmium"`.
29
-
- Geofabrik extracts are pre-cut to admin boundaries → no polygon post-filter needed.
34
+
- Geofabrik extracts are pre-cut to admin boundaries → no polygon post-filter needed.`american-oceania-latest.osm.pbf` ships a few non-target uninhabited US Pacific possessions (Wake, Midway, Howland, Baker, Jarvis, Palmyra, Kingman); they contain near-zero POIs and pass through as bonus coverage.
-**Pipeline**: per-part resumable download → exact-polygon filter, all inside DuckDB. Each of the 16 `part-*.parquet` files streams through a fresh DuckDB connection into a local parquet intermediate under `.parts/<release>/`; coarse-bbox `WHERE` pushes down on Overture's `bbox` struct. Once every part is present, a final `COPY` applies `ST_Within` against the dissolved US+PR polygon and writes the GeoParquet. No pandas materialization; crashed runs resume by skipping existing intermediates.
42
+
-**Pipeline**: per-part resumable download → exact-polygon filter, all inside DuckDB. Each of the 16 `part-*.parquet` files streams through a fresh DuckDB connection into a local parquet intermediate under `.parts/<release>/`; coarse-bbox `WHERE` pushes down on Overture's `bbox` struct. Once every part is present, a final `COPY` applies `ST_Within` against the dissolved US + territories polygon and writes the GeoParquet. No pandas materialization; crashed runs resume by skipping existing intermediates.
38
43
-**Entry**: [src/openpois/io/overture.py](../../src/openpois/io/overture.py). Returns a `Path`, not a `GeoDataFrame`.
39
44
-**DuckDB version pin**: `environment.yml` pins `duckdb==1.4.1`. 1.4.4+ and every 1.5.x crash mid-scan on WSL2 with "Information loss on integer cast" in `HTTPFileSystem::ReadInternal` — tracked as DuckDB issue #21669, fix merged to main but not in any tagged release as of 2026-04-17. See [memory: project_duckdb_pin.md] for the bump checklist.
40
45
-**Schema quirks (as of Feb 2026 schema)**:
@@ -48,10 +53,10 @@ Reference for every external data source openpois ingests. For the workflow that
48
53
49
54
**Used by**: both snapshot downloaders (spatial clipping).
50
55
51
-
-**URL**: `download.general.boundary.source_url` → `https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_20m.zip` (1:20M cartographic, 50 states + DC + PR).
56
+
-**URL**: `download.general.boundary.source_url` → `https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_5m.zip` (1:5M cartographic, 50 states + DC + 5 inhabited territories: PR, VI, GU, MP, AS). Note: the 1:20M variant used previously does **not** include territories.
52
57
-**Auth**: none.
53
-
-**Pipeline**: download ZIP → cache under `directories.boundary` (first-use) → dissolve → buffer outward by `coastline_buffer_m` (default 100 m) in EPSG:6933 (equal-area, so buffer accurate across CONUS/AK/HI/PR).
58
+
-**Pipeline**: download ZIP → cache under `directories.boundary` (first-use) → dissolve → buffer outward by `coastline_buffer_m` (default 100 m) in EPSG:6933 (equal-area, so buffer accurate across CONUS / AK / HI / Caribbean territories / western Pacific territories).
-**Returns**: `(boundary_gdf, coarse_bboxes)` — single-row dissolved+buffered polygon (EPSG:4326) plus a list of bboxes for predicate pushdown.
56
-
-**Antimeridian**: Aleutians split into two bboxes (Near Islands at +172°E vs. rest of AK at negative longitudes).
61
+
-**Antimeridian**: two bboxes returned, split via per-part centroid at lon=0. The negative-longitude bbox covers CONUS, AK mainland, HI, PR, USVI, and American Samoa (~-170°W). The positive-longitude bbox covers the Aleutian Near Islands (~+172°E), Guam (~+144°E), and the Northern Mariana Islands (~+145°E).
Copy file name to clipboardExpand all lines: .claude/skills/full-data-pull/SKILL.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ description: Use when the user wants to refresh the independent POI snapshots (O
5
5
6
6
# Full data pull
7
7
8
-
Downloads the snapshot sources (50 US states + DC + PR) and applies the rating model to OSM so conflation can run.
8
+
Downloads the snapshot sources (50 US states + DC + 5 inhabited territories: PR, VI, GU, MP, AS) and applies the rating model to OSM so conflation can run.
0 commit comments