Skip to content

Commit 383e0b1

Browse files
committed
Update Overture download to resolve httpfs bug in DuckDB 1.4.1 through 1.5.2.
1 parent 746ed86 commit 383e0b1

11 files changed

Lines changed: 834 additions & 213 deletions

File tree

.claude/TODO.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,18 @@ Short running list of in-progress / upcoming work. Edit freely; trim older compl
44

55
## In progress
66

7-
_(no items — add some when work is underway)_
8-
97
## Upcoming
108

9+
- [ ] Watch for a DuckDB release that fixes the WSL2 httpfs "Information loss on integer cast" crash (issue #21669, fix PR #21395). Once a tagged release ships with the fix and a full `scripts/overture/download.py` run on WSL2 completes, we can unpin from `duckdb==1.4.1` and revert the per-part download to a single-query DuckDB scan. Added 2026-04-17.
10+
- [ ] Auto-check taxonomy changes whenever we switch to a new Overture Maps version (detect new/removed L0/L1/L2 categories vs. `taxonomy_crosswalk_overture_maps.csv` and flag gaps). Added 2026-04-16.
1111
- [ ] Watch for Overture L0/L1 → flat `basic_category` migration (~June 2026). Crosswalk CSV + `assign_overture_shared_label` will need updating. See [docs/taxonomy-setup.md](docs/taxonomy-setup.md).
1212

1313
## Recently done
1414

1515
_(trim after a few weeks)_
1616

17+
- [x] Fix: CONUS Overture download crashed DuckDB on httpfs scans — 2026-04-17. Refactored [src/openpois/io/overture.py](../src/openpois/io/overture.py) to per-part resumable download + final filter-in-DuckDB; pinned `duckdb==1.4.1` to dodge bug #21669. Full run produced 13,054,244 POIs.
18+
1719
---
1820

1921
**Agent note:** When uncommitted changes are present in the repo, do not assume they belong in "In progress" here — confirm with the user first. This file is curated, not auto-synced to git status.

.claude/docs/data-sources.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,9 @@ Reference for every external data source openpois ingests. For the workflow that
3434

3535
- **URL**: public S3 at `s3://overturemaps-us-west-2/`.
3636
- **Auth**: none (DuckDB + httpfs queries directly).
37-
- **Pipeline**: two-stage spatial filter — DuckDB `WHERE` clause ORs one disjunct per coarse bbox (predicate pushdown on Overture's `bbox` struct), then GeoPandas `sjoin(predicate='within')` against the exact US+PR polygon.
38-
- **Entry**: [src/openpois/io/overture.py](../../src/openpois/io/overture.py).
37+
- **Pipeline**: per-part resumable download → exact-polygon filter, all inside DuckDB. Each of the 16 `part-*.parquet` files streams through a fresh DuckDB connection into a local parquet intermediate under `.parts/<release>/`; coarse-bbox `WHERE` pushes down on Overture's `bbox` struct. Once every part is present, a final `COPY` applies `ST_Within` against the dissolved US+PR polygon and writes the GeoParquet. No pandas materialization; crashed runs resume by skipping existing intermediates.
38+
- **Entry**: [src/openpois/io/overture.py](../../src/openpois/io/overture.py). Returns a `Path`, not a `GeoDataFrame`.
39+
- **DuckDB version pin**: `environment.yml` pins `duckdb==1.4.1`. 1.4.4+ and every 1.5.x crash mid-scan on WSL2 with "Information loss on integer cast" in `HTTPFileSystem::ReadInternal` — tracked as DuckDB issue #21669, fix merged to main but not in any tagged release as of 2026-04-17. See [memory: project_duckdb_pin.md] for the bump checklist.
3940
- **Schema quirks (as of Feb 2026 schema)**:
4041
- `taxonomy` is a named STRUCT `{primary, hierarchy[], alternates[]}` — use `taxonomy.hierarchy[1]` **not** `taxonomy[1]`.
4142
- `brand` is a singular struct, **not** a `brands[]` array.

.claude/scheduled_tasks.lock

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"sessionId":"6fe31aa6-c607-4ab5-b54d-07e6afb28372","pid":37405,"acquiredAt":1776408925552}

.claude/skills/full-data-pull/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ Downloads the three snapshot sources (50 US states + DC + PR) and applies the ra
4848

4949
## Verification
5050

51-
Hand off to [skills/verify-pipeline-run](../verify-pipeline-run/SKILL.md). Baseline totals (as of 2026-04-16):
51+
Hand off to [skills/verify-pipeline-run](../verify-pipeline-run/SKILL.md). Baseline totals (as of 2026-04-17):
5252
- OSM: ~7.78M POIs
53-
- Overture: ~7.23M POIs
53+
- Overture: ~13.05M POIs (jumped from ~7.23M after widening `download.overture.taxonomy_allowlist` to include `services_and_business` + `lifestyle_services` sub-branches)
5454
- Foursquare: ~8.32M POIs
5555

5656
Flag >5% drops — Foursquare in particular has had silent country-filter regressions (PR alpha-2 code quirk).

.claude/skills/verify-pipeline-run/SKILL.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ Post-run QA runbook. Pick the subsection that matches what just ran.
99

1010
## Snapshots (OSM / Overture / Foursquare)
1111

12-
Baseline row counts (2026-04-16):
12+
Baseline row counts (2026-04-17):
1313
- OSM: ~7.78M
14-
- Overture: ~7.23M
14+
- Overture: ~13.05M (up from ~7.23M after widening `taxonomy_allowlist`; pre-2026-04-17 runs will be lower)
1515
- Foursquare: ~8.32M
1616

1717
Check:
@@ -23,7 +23,7 @@ pd.read_parquet(path).shape[0]
2323
Flag >5% drops. Known regression patterns:
2424
- **Foursquare**: PR alpha-2 code — filter must be `country IN ('US', 'PR')`, not `'US'` only.
2525
- **OSM**: PR is a *separate* PBF — confirm both `us-latest.osm.pbf` and `puerto-rico-latest.osm.pbf` got downloaded, filtered, and concat'd.
26-
- **Overture**: coarse-bbox pushdown + exact `sjoin` — drop means the Aleutian antimeridian split was lost or the Census boundary failed to load.
26+
- **Overture**: coarse-bbox pushdown + final DuckDB `ST_Within` — drop means the Aleutian antimeridian split was lost or the Census boundary failed to load. If the run crashed with "Information loss on integer cast", the DuckDB pin was bumped off 1.4.1 (see [docs/data-sources.md](../../docs/data-sources.md) → Overture Maps).
2727

2828
## Model output
2929

config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,13 @@ download:
6565
release_date: null # null = auto-detect latest
6666
s3_bucket: "overturemaps-us-west-2"
6767
s3_region: "us-west-2"
68+
# DuckDB resource caps for the per-part S3 scans and the final polygon
69+
# filter. Peak host RAM ~= workers * memory_limit, peak CPU ~= workers *
70+
# threads. Scale per-worker values down if raising workers above 1.
71+
duckdb:
72+
memory_limit: "4GB"
73+
threads: 2
74+
workers: 2
6875
# (L0, L1) allowlist. L1 = null means "all of this L0".
6976
# Entries intentionally exclude office/B2B-style L1s (corporate offices,
7077
# media services, etc.), transit/parking/airports (covered elsewhere), and

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ dependencies:
196196
- coverage==7.13.4
197197
- cryptography==46.0.5
198198
- dill==0.4.1
199-
- duckdb==1.5.0
199+
- duckdb==1.4.1
200200
- et-xmlfile==2.0.0
201201
- filelock==3.20.0
202202
- flake8==7.3.0

scripts/overture/download.py

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@
22
Download the current US+PR Overture Maps Places snapshot as a GeoParquet file.
33
44
Queries Overture Maps GeoParquet files on public S3 using DuckDB's httpfs and
5-
spatial extensions, filtering with a two-stage spatial filter (coarse bbox
6-
prefilter in the DuckDB query, then exact within-polygon filter in Python
7-
against the US+PR Census boundary) and by L0 taxonomy category. No
8-
authentication required — Overture Maps data is publicly accessible.
5+
spatial extensions. Iterates the release's ``part-*.parquet`` files, writing a
6+
bounded-memory DuckDB COPY per part into a ``.parts/<release>/`` directory.
7+
Once every part is present, a single DuckDB COPY applies the exact US+PR
8+
polygon filter and writes the final GeoParquet without materializing rows in
9+
Python. Interrupted runs resume by skipping parts whose intermediates already
10+
exist. No authentication required — Overture Maps data is publicly accessible.
911
1012
Auto-detects the latest available Overture release from S3 unless a specific
1113
release_date is pinned in config.yaml.
@@ -15,6 +17,9 @@
1517
download.overture.s3_bucket — Overture Maps S3 bucket name
1618
download.overture.s3_region — AWS region of the Overture bucket
1719
download.overture.taxonomy_allowlist — list of [L0, L1] pairs; L1 null = any
20+
download.overture.duckdb.memory_limit — per-connection DuckDB memory cap
21+
download.overture.duckdb.threads — per-connection DuckDB thread count
22+
download.overture.duckdb.workers — parallel part downloads (must be 1)
1823
download.general.boundary.source_url — Census state-boundary zip URL
1924
download.general.boundary.coastline_buffer_m — outward coastline buffer (m)
2025
directories.boundary — cache directory for boundary file
@@ -25,6 +30,7 @@
2530
Columns: overture_id, overture_name, taxonomy_l0, taxonomy_l1,
2631
taxonomy_l2, brand_name, confidence, geometry, source
2732
"""
33+
import pyarrow.parquet as pq
2834
from config_versioned import Config
2935
from openpois.io.boundary import get_us_pr_boundary
3036
from openpois.io.overture import download_overture_snapshot
@@ -40,6 +46,15 @@
4046
S3_BUCKET = config.get("download", "overture", "s3_bucket")
4147
S3_REGION = config.get("download", "overture", "s3_region")
4248
TAXONOMY_ALLOWLIST = config.get("download", "overture", "taxonomy_allowlist")
49+
DUCKDB_MEMORY_LIMIT = config.get(
50+
"download", "overture", "duckdb", "memory_limit", fail_if_none=False
51+
) or "4GB"
52+
DUCKDB_THREADS = config.get(
53+
"download", "overture", "duckdb", "threads", fail_if_none=False
54+
) or 2
55+
DUCKDB_WORKERS = config.get(
56+
"download", "overture", "duckdb", "workers", fail_if_none=False
57+
) or 2
4358
BOUNDARY_URL = config.get("download", "general", "boundary", "source_url")
4459
COASTLINE_BUFFER_M = config.get(
4560
"download", "general", "boundary", "coastline_buffer_m"
@@ -62,13 +77,17 @@
6277
cache_dir = BOUNDARY_DIR,
6378
coastline_buffer_m = COASTLINE_BUFFER_M,
6479
)
65-
gdf = download_overture_snapshot(
66-
output_path=OUTPUT_PATH,
67-
taxonomy_allowlist=TAXONOMY_ALLOWLIST,
68-
boundary_gdf=boundary_gdf,
69-
coarse_bboxes=coarse_bboxes,
70-
bucket=S3_BUCKET,
71-
s3_region=S3_REGION,
72-
release_date=RELEASE_DATE,
80+
output_path = download_overture_snapshot(
81+
output_path = OUTPUT_PATH,
82+
taxonomy_allowlist = TAXONOMY_ALLOWLIST,
83+
boundary_gdf = boundary_gdf,
84+
coarse_bboxes = coarse_bboxes,
85+
bucket = S3_BUCKET,
86+
s3_region = S3_REGION,
87+
release_date = RELEASE_DATE,
88+
duckdb_memory_limit = DUCKDB_MEMORY_LIMIT,
89+
duckdb_threads = DUCKDB_THREADS,
90+
workers = DUCKDB_WORKERS,
7391
)
74-
print(f"Saved {len(gdf):,} Overture POIs to {OUTPUT_PATH}")
92+
n_rows = pq.read_metadata(output_path).num_rows
93+
print(f"Saved {n_rows:,} Overture POIs to {output_path}")

site/src/constants.js

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,19 +66,22 @@ export const CONFLATED_LABELS = [
6666
'Alternative Medicine',
6767
'Arcade',
6868
'Arts Venue',
69+
'ATM',
6970
'Bakery',
7071
'Bank',
7172
'Bar',
7273
'Bike shop',
7374
'Bookstore',
7475
'Bowling Alley',
7576
'Cafe',
77+
'Campground',
7678
'Car Dealer',
7779
'Car Rental',
7880
'Car Repair',
7981
'Car Wash',
8082
'Casino',
8183
'Cell Phone Store',
84+
'Cemetery',
8285
'Charging Station',
8386
'Clinic',
8487
'Clothing Store',
@@ -99,10 +102,15 @@ export const CONFLATED_LABELS = [
99102
'Garden Store',
100103
'Gas Station',
101104
'Golf Course',
105+
'Government Office',
106+
'Hair and Beauty',
102107
'Hardware',
108+
'Home Service',
109+
'Hotel',
103110
'Jewelry Store',
104111
'Kindergarten',
105112
'Laundromat',
113+
'Legal Service',
106114
'Library',
107115
'Liquor Store',
108116
'Marina',
@@ -119,11 +127,14 @@ export const CONFLATED_LABELS = [
119127
'Pet Store',
120128
'Pharmacy',
121129
'Physical Therapy',
130+
'Place of Worship',
122131
'Playground',
123132
'Post Office',
133+
'Public Restroom',
134+
'Public Safety',
135+
'Real Estate',
124136
'Recreation',
125137
'Restaurant',
126-
'Salon and Hair',
127138
'School',
128139
'Shoe Store',
129140
'Shopping Center',
@@ -141,7 +152,9 @@ export const CONFLATED_LABELS = [
141152
'Wholesale Store',
142153
// "Other" categories last, unchecked by default
143154
'Other Amenity',
155+
'Other Financial',
144156
'Other Healthcare',
157+
'Other Professional',
145158
'Other Shop',
146159
]
147160

0 commit comments

Comments
 (0)