|
| 1 | +# POI Conflation: OSM + Overture Maps |
| 2 | + |
| 3 | +This pipeline conflates rated OpenStreetMap POIs with Overture Maps POIs into a |
| 4 | +unified superset with blended confidence scores. The output includes matched |
| 5 | +pairs, unmatched OSM POIs, and unmatched Overture POIs. |
| 6 | + |
| 7 | +## Usage |
| 8 | + |
| 9 | +```bash |
| 10 | +# Full run (~15M POIs, ~16 GB RAM) |
| 11 | +python exploratory/conflation/conflate.py |
| 12 | + |
| 13 | +# Test mode (Seattle bbox, ~30k + ~19k POIs) |
| 14 | +python exploratory/conflation/conflate.py --test |
| 15 | +``` |
| 16 | + |
| 17 | +Output: `~/data/openpois/conflation/{version}/conflated.parquet` |
| 18 | + |
| 19 | +## Algorithm Overview |
| 20 | + |
| 21 | +### 1. Taxonomy Crosswalk |
| 22 | + |
| 23 | +A CSV file (`src/openpois/conflation/data/taxonomy_crosswalk.csv`) maps between |
| 24 | +OSM tag key/value pairs and Overture taxonomy categories. Each row defines: |
| 25 | + |
| 26 | +| Column | Description | |
| 27 | +|--------|-------------| |
| 28 | +| `osm_key` | OSM tag key (amenity, shop, healthcare, leisure) | |
| 29 | +| `osm_value` | OSM tag value (e.g. restaurant, supermarket) | |
| 30 | +| `overture_l0` | Overture L0 category (e.g. food_and_drink) | |
| 31 | +| `overture_l1` | Overture L1 category (e.g. restaurant) | |
| 32 | +| `poi_category` | Unified category label | |
| 33 | +| `match_radius_m` | Maximum spatial match distance (meters) | |
| 34 | + |
| 35 | +OSM POIs are assigned categories using the filter key priority order (shop > |
| 36 | +healthcare > leisure > amenity). If the specific tag value is not in the |
| 37 | +crosswalk, a wildcard (`*`) fallback for that key is used (default 50m radius). |
| 38 | + |
| 39 | +Overture POIs are matched by (taxonomy_l0, taxonomy_l1), falling back to |
| 40 | +l0-only match when l1 is missing. |
| 41 | + |
| 42 | +### 2. Spatial Candidate Search |
| 43 | + |
| 44 | +A scikit-learn `BallTree` with haversine metric is built on all Overture POI |
| 45 | +centroids. OSM POIs are queried in chunks of 500k to control memory. For each |
| 46 | +OSM POI, all Overture POIs within that POI's category-specific match radius are |
| 47 | +returned as candidates. |
| 48 | + |
| 49 | +Match radii vary by POI type: |
| 50 | +- Private businesses (restaurants, shops): ~50m |
| 51 | +- Mid-size facilities (clinics, sports centres): ~75-100m |
| 52 | +- Areal features (parks, hospitals, stadiums): ~150-200m |
| 53 | + |
| 54 | +### 3. Match Scoring |
| 55 | + |
| 56 | +Each candidate pair receives a composite score from four weighted components: |
| 57 | + |
| 58 | +**Distance score (weight: 0.25)** |
| 59 | +Linear decay from 1.0 (0 meters) to 0.0 (at the match radius threshold). |
| 60 | + |
| 61 | +``` |
| 62 | +distance_score = 1.0 - (distance_m / match_radius_m) |
| 63 | +``` |
| 64 | + |
| 65 | +**Name score (weight: 0.30)** |
| 66 | +The maximum `rapidfuzz.fuzz.token_set_ratio` across up to four comparisons: |
| 67 | +- OSM `name` vs Overture `overture_name` |
| 68 | +- OSM `brand` vs Overture `brand_name` |
| 69 | +- OSM `name` vs Overture `brand_name` (cross-compare) |
| 70 | +- OSM `brand` vs Overture `overture_name` (cross-compare) |
| 71 | + |
| 72 | +Token set ratio handles brand-as-subset patterns well (e.g. "Starbucks" vs |
| 73 | +"Starbucks Coffee" = 100%). When all names and brands are null on both sides, |
| 74 | +the score is set to 0.5 (neutral) rather than 0 to avoid penalizing unnamed |
| 75 | +POIs. |
| 76 | + |
| 77 | +**Type taxonomy score (weight: 0.25)** |
| 78 | +Compares the unified `poi_category` assigned to each POI: |
| 79 | +- Exact category match: 1.0 |
| 80 | +- Same broad group (both food_and_drink, both shopping, etc.): 0.5 |
| 81 | +- Different broad groups: 0.0 |
| 82 | +- One or both unmapped: 0.5 (neutral) |
| 83 | + |
| 84 | +**Identifier score (weight: 0.20)** |
| 85 | +Reserved for exact matches on `website`, `phone`, and `brand:wikidata`. Since |
| 86 | +Overture's current schema does not expose these fields, this component returns |
| 87 | +a neutral 0.5 for all pairs. It will become active when Overture adds these |
| 88 | +identifiers. |
| 89 | + |
| 90 | +**Composite score:** |
| 91 | +``` |
| 92 | +composite = 0.25 * distance + 0.30 * name + 0.25 * type + 0.20 * identifier |
| 93 | +``` |
| 94 | + |
| 95 | +### 4. One-to-One Match Selection |
| 96 | + |
| 97 | +Candidate pairs with composite score >= 0.67 (configurable) are eligible. A |
| 98 | +greedy algorithm assigns matches: |
| 99 | + |
| 100 | +1. Sort all eligible pairs by composite score (descending). |
| 101 | +2. Iterate: assign the pair if neither the OSM POI nor the Overture POI has |
| 102 | + been assigned yet. |
| 103 | +3. Skip pairs where either side is already taken. |
| 104 | + |
| 105 | +This produces a strict one-to-one mapping. The greedy approach is O(n log n) |
| 106 | +and produces near-optimal results since most POIs have a clearly dominant match. |
| 107 | + |
| 108 | +### 5. Confidence Merging |
| 109 | + |
| 110 | +Let `w` = `overture_confidence_weight` (default 0.7, configurable). This |
| 111 | +represents our trust in Overture relative to OSM. |
| 112 | + |
| 113 | +**Matched pairs:** |
| 114 | +``` |
| 115 | +osm_weight = 1 / (1 + w) # ~0.77 with w=0.3 |
| 116 | +overture_weight = w / (1 + w) # ~0.23 with w=0.3 |
| 117 | +
|
| 118 | +conf_mean = osm_conf_mean * osm_weight + overture_confidence * overture_weight |
| 119 | +conf_lower = osm_conf_lower * osm_weight + overture_confidence * overture_weight |
| 120 | +conf_upper = osm_conf_upper * osm_weight + overture_confidence * overture_weight |
| 121 | +``` |
| 122 | + |
| 123 | +**Unmatched OSM:** Confidence scores are carried through as-is. |
| 124 | + |
| 125 | +**Unmatched Overture:** `conf_mean = overture_confidence * w`. No confidence |
| 126 | +interval bounds (set to null). |
| 127 | + |
| 128 | +### 6. Geometry Selection |
| 129 | + |
| 130 | +When a matched pair has different geometry types (e.g. OSM Polygon vs Overture |
| 131 | +Point), the higher-level geometry is preferred: |
| 132 | + |
| 133 | +``` |
| 134 | +MultiPolygon > Polygon > LineString > Point |
| 135 | +``` |
| 136 | + |
| 137 | +On ties (both Points, both Polygons), the OSM geometry is used. |
| 138 | + |
| 139 | +### 7. Tag Conflict Resolution |
| 140 | + |
| 141 | +For name and brand, the source with higher confidence for that specific POI |
| 142 | +determines the value. Both original values are preserved with source-prefixed |
| 143 | +columns (`osm_name`, `overture_name`, `osm_brand`, `overture_brand`). |
| 144 | + |
| 145 | +## Output Schema |
| 146 | + |
| 147 | +| Column | Type | Description | |
| 148 | +|--------|------|-------------| |
| 149 | +| `unified_id` | str | `"matched:{osm_id}_{overture_id}"`, `"osm:{osm_id}"`, or `"overture:{overture_id}"` | |
| 150 | +| `source` | str | `"matched"`, `"osm"`, or `"overture"` | |
| 151 | +| `osm_id` | int64, nullable | OpenStreetMap element ID | |
| 152 | +| `overture_id` | str, nullable | Overture place ID | |
| 153 | +| `name` | str, nullable | Best name (from higher-confidence source) | |
| 154 | +| `brand` | str, nullable | Best brand (from higher-confidence source) | |
| 155 | +| `poi_category` | str | Unified category from crosswalk | |
| 156 | +| `conf_mean` | float64 | Blended confidence score | |
| 157 | +| `conf_lower` | float64, nullable | Lower confidence bound | |
| 158 | +| `conf_upper` | float64, nullable | Upper confidence bound | |
| 159 | +| `match_score` | float64, nullable | Composite match score (matched pairs only) | |
| 160 | +| `match_distance_m` | float64, nullable | Distance between matched geometries | |
| 161 | +| `osm_name` | str, nullable | Original OSM name | |
| 162 | +| `overture_name` | str, nullable | Original Overture name | |
| 163 | +| `osm_brand` | str, nullable | Original OSM brand | |
| 164 | +| `overture_brand` | str, nullable | Original Overture brand name | |
| 165 | +| `osm_conf_mean` | float64, nullable | Original OSM confidence | |
| 166 | +| `overture_confidence` | float64, nullable | Original Overture confidence | |
| 167 | +| `geometry` | Point/Polygon/MultiPolygon | EPSG:4326 | |
| 168 | + |
| 169 | +## Configuration |
| 170 | + |
| 171 | +All parameters are configurable in `config.yaml` under the `conflation` section: |
| 172 | + |
| 173 | +```yaml |
| 174 | +conflation: |
| 175 | + overture_confidence_weight: 0.7 # Trust in Overture relative to OSM |
| 176 | + min_match_score: 0.67 # Minimum composite score for a match |
| 177 | + max_radius_m: 200 # Global upper bound on search radius |
| 178 | + default_radius_m: 50 # Default radius for unmapped categories |
| 179 | + distance_weight: 0.25 # Weight for distance score component |
| 180 | + name_weight: 0.30 # Weight for name score component |
| 181 | + type_weight: 0.25 # Weight for type taxonomy score component |
| 182 | + identifier_weight: 0.20 # Weight for identifier score component |
| 183 | + chunk_size: 500_000 # OSM rows per BallTree query batch |
| 184 | + test_bbox: # Geographic filter for --test mode |
| 185 | + xmin: -122.45 |
| 186 | + ymin: 47.50 |
| 187 | + xmax: -122.25 |
| 188 | + ymax: 47.70 |
| 189 | +``` |
| 190 | +
|
| 191 | +## Taxonomy Migration Notice |
| 192 | +
|
| 193 | +Overture Maps is transitioning from the current hierarchical `categories` taxonomy |
| 194 | +(L0/L1) to a flat `basic_category` system. The old taxonomy is scheduled for |
| 195 | +deprecation around **June 2026**. When that migration completes, the crosswalk CSV |
| 196 | +and the `assign_overture_poi_category()` function will need to be updated to use the |
| 197 | +new field names and category values. Track the migration status in the Overture Maps |
| 198 | +changelog. |
| 199 | + |
| 200 | +## Memory and Performance |
| 201 | + |
| 202 | +The pipeline is designed to run within ~16 GB RAM on ~15M total POIs: |
| 203 | + |
| 204 | +- Only columns needed for matching are loaded from parquet files |
| 205 | +- BallTree is built once on Overture centroids (~0.4 GB for 7M points) |
| 206 | +- OSM queries are chunked (500k rows per batch) |
| 207 | +- Name scoring uses rapidfuzz (C++ backend, ~100x faster than difflib) |
| 208 | +- Output is Hilbert-sorted for efficient cloud-native range reads |
| 209 | + |
| 210 | +## Code Structure |
| 211 | + |
| 212 | +``` |
| 213 | +src/openpois/conflation/ |
| 214 | + __init__.py |
| 215 | + data/taxonomy_crosswalk.csv # Category mapping between OSM and Overture |
| 216 | + taxonomy.py # Crosswalk loading and category assignment |
| 217 | + match.py # Spatial search, scoring, match selection |
| 218 | + merge.py # Confidence blending and output assembly |
| 219 | + |
| 220 | +exploratory/conflation/ |
| 221 | + conflate.py # Driver script (loads config, calls library) |
| 222 | + README.md # This file |
| 223 | + |
| 224 | +tests/ |
| 225 | + test_taxonomy.py # 12 tests |
| 226 | + test_match.py # 19 tests |
| 227 | + test_merge.py # 7 tests |
| 228 | +``` |
0 commit comments