Skip to content

Commit 76da170

Browse files
committed
Add version 1 of a conflation script, which matches on (a) proximity, (b) name, (c) data type, and (d) unique identifiers.
1 parent 5674d7a commit 76da170

17 files changed

Lines changed: 3078 additions & 0 deletions

config.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ versions:
66
snapshot_overture: "20260313"
77
snapshot_foursquare: "20260313"
88
aws: "20260313"
9+
conflation: "20260316"
910

1011
# Settings for downloading data
1112
download:
@@ -111,6 +112,7 @@ directories:
111112
path: ~/data/openpois/snapshots/foursquare
112113
files:
113114
snapshot: foursquare_snapshot.parquet
115+
partitioned: foursquare_snapshot_partitioned
114116
snapshot_osm:
115117
versioned: true
116118
path: ~/data/openpois/snapshots/osm
@@ -120,11 +122,20 @@ directories:
120122
snapshot: osm_snapshot.parquet
121123
rated_snapshot: osm_snapshot_rated.parquet
122124
partitioned: osm_snapshot_partitioned
125+
pmtiles: osm_snapshot.pmtiles
123126
snapshot_overture:
124127
versioned: true
125128
path: ~/data/openpois/snapshots/overture
126129
files:
127130
snapshot: overture_snapshot.parquet
131+
conflation:
132+
versioned: true
133+
path: ~/data/openpois/conflation
134+
files:
135+
conflated: conflated.parquet
136+
match_diagnostics: match_diagnostics.parquet
137+
partitioned: conflated_partitioned
138+
summary_by_label: summary_by_label.csv
128139
testing:
129140
versioned: false
130141
path: ~/data/openpois/testing
@@ -133,6 +144,23 @@ directories:
133144
overture_snippet: overture_snippet.csv
134145
foursquare_snippet: foursquare_snippet.csv
135146

147+
# Settings for POI conflation
148+
conflation:
149+
overture_confidence_weight: 0.7
150+
min_match_score: 0.50
151+
max_radius_m: 200
152+
default_radius_m: 100
153+
distance_weight: 0.0
154+
name_weight: 0.50
155+
type_weight: 0.30
156+
identifier_weight: 0.20
157+
chunk_size: 500_000
158+
test_bbox:
159+
xmin: -122.45
160+
ymin: 47.50
161+
xmax: -122.25
162+
ymax: 47.70
163+
136164
# Settings for uploading snapshots to S3
137165
upload:
138166
s3_region: "us-west-2"
@@ -141,3 +169,6 @@ upload:
141169
full_url: "https://openpois-public.s3.us-west-2.amazonaws.com/snapshots/osm/20260313/osm_snapshot_partitioned/"
142170
geohash_precision_partition: 4 # ~39 km x 20 km cells; ~1,000–3,000 cells over CONUS
143171
geohash_precision_sort: 6 # ~0.6 km x 1.2 km; fine-grained sort within each partition
172+
pmtiles:
173+
min_zoom: 14
174+
max_zoom: 16

environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,7 @@ dependencies:
201201
- filelock==3.20.0
202202
- flake8==7.3.0
203203
- flake8-pyproject==1.2.4
204+
- freestiler==0.1.5
204205
- fsspec==2025.12.0
205206
- furo==2025.12.19
206207
- fuzzywuzzy==0.18.0

exploratory/conflation/README.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# POI Conflation: OSM + Overture Maps
2+
3+
This pipeline conflates rated OpenStreetMap POIs with Overture Maps POIs into a
4+
unified superset with blended confidence scores. The output includes matched
5+
pairs, unmatched OSM POIs, and unmatched Overture POIs.
6+
7+
## Usage
8+
9+
```bash
10+
# Full run (~15M POIs, ~16 GB RAM)
11+
python exploratory/conflation/conflate.py
12+
13+
# Test mode (Seattle bbox, ~30k + ~19k POIs)
14+
python exploratory/conflation/conflate.py --test
15+
```
16+
17+
Output: `~/data/openpois/conflation/{version}/conflated.parquet`
18+
19+
## Algorithm Overview
20+
21+
### 1. Taxonomy Crosswalk
22+
23+
A CSV file (`src/openpois/conflation/data/taxonomy_crosswalk.csv`) maps between
24+
OSM tag key/value pairs and Overture taxonomy categories. Each row defines:
25+
26+
| Column | Description |
27+
|--------|-------------|
28+
| `osm_key` | OSM tag key (amenity, shop, healthcare, leisure) |
29+
| `osm_value` | OSM tag value (e.g. restaurant, supermarket) |
30+
| `overture_l0` | Overture L0 category (e.g. food_and_drink) |
31+
| `overture_l1` | Overture L1 category (e.g. restaurant) |
32+
| `poi_category` | Unified category label |
33+
| `match_radius_m` | Maximum spatial match distance (meters) |
34+
35+
OSM POIs are assigned categories using the filter key priority order (shop >
36+
healthcare > leisure > amenity). If the specific tag value is not in the
37+
crosswalk, a wildcard (`*`) fallback for that key is used (default 50m radius).
38+
39+
Overture POIs are matched by (taxonomy_l0, taxonomy_l1), falling back to
40+
l0-only match when l1 is missing.
41+
42+
### 2. Spatial Candidate Search
43+
44+
A scikit-learn `BallTree` with haversine metric is built on all Overture POI
45+
centroids. OSM POIs are queried in chunks of 500k to control memory. For each
46+
OSM POI, all Overture POIs within that POI's category-specific match radius are
47+
returned as candidates.
48+
49+
Match radii vary by POI type:
50+
- Private businesses (restaurants, shops): ~50m
51+
- Mid-size facilities (clinics, sports centres): ~75-100m
52+
- Areal features (parks, hospitals, stadiums): ~150-200m
53+
54+
### 3. Match Scoring
55+
56+
Each candidate pair receives a composite score from four weighted components:
57+
58+
**Distance score (weight: 0.25)**
59+
Linear decay from 1.0 (0 meters) to 0.0 (at the match radius threshold).
60+
61+
```
62+
distance_score = 1.0 - (distance_m / match_radius_m)
63+
```
64+
65+
**Name score (weight: 0.30)**
66+
The maximum `rapidfuzz.fuzz.token_set_ratio` across up to four comparisons:
67+
- OSM `name` vs Overture `overture_name`
68+
- OSM `brand` vs Overture `brand_name`
69+
- OSM `name` vs Overture `brand_name` (cross-compare)
70+
- OSM `brand` vs Overture `overture_name` (cross-compare)
71+
72+
Token set ratio handles brand-as-subset patterns well (e.g. "Starbucks" vs
73+
"Starbucks Coffee" = 100%). When all names and brands are null on both sides,
74+
the score is set to 0.5 (neutral) rather than 0 to avoid penalizing unnamed
75+
POIs.
76+
77+
**Type taxonomy score (weight: 0.25)**
78+
Compares the unified `poi_category` assigned to each POI:
79+
- Exact category match: 1.0
80+
- Same broad group (both food_and_drink, both shopping, etc.): 0.5
81+
- Different broad groups: 0.0
82+
- One or both unmapped: 0.5 (neutral)
83+
84+
**Identifier score (weight: 0.20)**
85+
Reserved for exact matches on `website`, `phone`, and `brand:wikidata`. Since
86+
Overture's current schema does not expose these fields, this component returns
87+
a neutral 0.5 for all pairs. It will become active when Overture adds these
88+
identifiers.
89+
90+
**Composite score:**
91+
```
92+
composite = 0.25 * distance + 0.30 * name + 0.25 * type + 0.20 * identifier
93+
```
94+
95+
### 4. One-to-One Match Selection
96+
97+
Candidate pairs with composite score >= 0.67 (configurable) are eligible. A
98+
greedy algorithm assigns matches:
99+
100+
1. Sort all eligible pairs by composite score (descending).
101+
2. Iterate: assign the pair if neither the OSM POI nor the Overture POI has
102+
been assigned yet.
103+
3. Skip pairs where either side is already taken.
104+
105+
This produces a strict one-to-one mapping. The greedy approach is O(n log n)
106+
and produces near-optimal results since most POIs have a clearly dominant match.
107+
108+
### 5. Confidence Merging
109+
110+
Let `w` = `overture_confidence_weight` (default 0.7, configurable). This
111+
represents our trust in Overture relative to OSM.
112+
113+
**Matched pairs:**
114+
```
115+
osm_weight = 1 / (1 + w) # ~0.77 with w=0.3
116+
overture_weight = w / (1 + w) # ~0.23 with w=0.3
117+
118+
conf_mean = osm_conf_mean * osm_weight + overture_confidence * overture_weight
119+
conf_lower = osm_conf_lower * osm_weight + overture_confidence * overture_weight
120+
conf_upper = osm_conf_upper * osm_weight + overture_confidence * overture_weight
121+
```
122+
123+
**Unmatched OSM:** Confidence scores are carried through as-is.
124+
125+
**Unmatched Overture:** `conf_mean = overture_confidence * w`. No confidence
126+
interval bounds (set to null).
127+
128+
### 6. Geometry Selection
129+
130+
When a matched pair has different geometry types (e.g. OSM Polygon vs Overture
131+
Point), the higher-level geometry is preferred:
132+
133+
```
134+
MultiPolygon > Polygon > LineString > Point
135+
```
136+
137+
On ties (both Points, both Polygons), the OSM geometry is used.
138+
139+
### 7. Tag Conflict Resolution
140+
141+
For name and brand, the source with higher confidence for that specific POI
142+
determines the value. Both original values are preserved with source-prefixed
143+
columns (`osm_name`, `overture_name`, `osm_brand`, `overture_brand`).
144+
145+
## Output Schema
146+
147+
| Column | Type | Description |
148+
|--------|------|-------------|
149+
| `unified_id` | str | `"matched:{osm_id}_{overture_id}"`, `"osm:{osm_id}"`, or `"overture:{overture_id}"` |
150+
| `source` | str | `"matched"`, `"osm"`, or `"overture"` |
151+
| `osm_id` | int64, nullable | OpenStreetMap element ID |
152+
| `overture_id` | str, nullable | Overture place ID |
153+
| `name` | str, nullable | Best name (from higher-confidence source) |
154+
| `brand` | str, nullable | Best brand (from higher-confidence source) |
155+
| `poi_category` | str | Unified category from crosswalk |
156+
| `conf_mean` | float64 | Blended confidence score |
157+
| `conf_lower` | float64, nullable | Lower confidence bound |
158+
| `conf_upper` | float64, nullable | Upper confidence bound |
159+
| `match_score` | float64, nullable | Composite match score (matched pairs only) |
160+
| `match_distance_m` | float64, nullable | Distance between matched geometries |
161+
| `osm_name` | str, nullable | Original OSM name |
162+
| `overture_name` | str, nullable | Original Overture name |
163+
| `osm_brand` | str, nullable | Original OSM brand |
164+
| `overture_brand` | str, nullable | Original Overture brand name |
165+
| `osm_conf_mean` | float64, nullable | Original OSM confidence |
166+
| `overture_confidence` | float64, nullable | Original Overture confidence |
167+
| `geometry` | Point/Polygon/MultiPolygon | EPSG:4326 |
168+
169+
## Configuration
170+
171+
All parameters are configurable in `config.yaml` under the `conflation` section:
172+
173+
```yaml
174+
conflation:
175+
overture_confidence_weight: 0.7 # Trust in Overture relative to OSM
176+
min_match_score: 0.67 # Minimum composite score for a match
177+
max_radius_m: 200 # Global upper bound on search radius
178+
default_radius_m: 50 # Default radius for unmapped categories
179+
distance_weight: 0.25 # Weight for distance score component
180+
name_weight: 0.30 # Weight for name score component
181+
type_weight: 0.25 # Weight for type taxonomy score component
182+
identifier_weight: 0.20 # Weight for identifier score component
183+
chunk_size: 500_000 # OSM rows per BallTree query batch
184+
test_bbox: # Geographic filter for --test mode
185+
xmin: -122.45
186+
ymin: 47.50
187+
xmax: -122.25
188+
ymax: 47.70
189+
```
190+
191+
## Taxonomy Migration Notice
192+
193+
Overture Maps is transitioning from the current hierarchical `categories` taxonomy
194+
(L0/L1) to a flat `basic_category` system. The old taxonomy is scheduled for
195+
deprecation around **June 2026**. When that migration completes, the crosswalk CSV
196+
and the `assign_overture_poi_category()` function will need to be updated to use the
197+
new field names and category values. Track the migration status in the Overture Maps
198+
changelog.
199+
200+
## Memory and Performance
201+
202+
The pipeline is designed to run within ~16 GB RAM on ~15M total POIs:
203+
204+
- Only columns needed for matching are loaded from parquet files
205+
- BallTree is built once on Overture centroids (~0.4 GB for 7M points)
206+
- OSM queries are chunked (500k rows per batch)
207+
- Name scoring uses rapidfuzz (C++ backend, ~100x faster than difflib)
208+
- Output is Hilbert-sorted for efficient cloud-native range reads
209+
210+
## Code Structure
211+
212+
```
213+
src/openpois/conflation/
214+
__init__.py
215+
data/taxonomy_crosswalk.csv # Category mapping between OSM and Overture
216+
taxonomy.py # Crosswalk loading and category assignment
217+
match.py # Spatial search, scoring, match selection
218+
merge.py # Confidence blending and output assembly
219+
220+
exploratory/conflation/
221+
conflate.py # Driver script (loads config, calls library)
222+
README.md # This file
223+
224+
tests/
225+
test_taxonomy.py # 12 tests
226+
test_match.py # 19 tests
227+
test_merge.py # 7 tests
228+
```

0 commit comments

Comments
 (0)