Skip to content

Latest commit

 

History

History
436 lines (349 loc) · 15.3 KB

File metadata and controls

436 lines (349 loc) · 15.3 KB

DESIGN.md - Geo-Bucket Architecture

Overview

This system solves location search inconsistency by grouping nearby properties into geo-buckets and normalizing location names.

Problem: "Sangotedo" returns 0 results, but "Sangotedo" (different spelling) returns 47 results.

Solution: Grid-based buckets + fuzzy name matching = consistent results.


1. Geo-Bucket Strategy

How It Works

Grid-Based Bucketing:

Split the map into fixed-size grid cells (0.01° × 0.01° ≈ 1.1km squares)

Each cell = One Bucket

Example:
┌─────────┬─────────┬─────────┐
│ Lekki   │ VI      │ Ikoyi   │
│ Cell 1  │ Cell 2  │ Cell 3  │
├─────────┼─────────┼─────────┤
│Sangotedo│ Ajah    │ Badore  │
│ Cell 4  │ Cell 5  │ Cell 6  │  ← All Sangotedo properties in Cell 4
└─────────┴─────────┴─────────┘

Bucket Definition

Grid Size: 0.01 degrees (~1.1 km)
Coordinate System: WGS84 (standard GPS)

Calculate Grid Cell:
grid_x = floor(longitude / 0.01)
grid_y = floor(latitude / 0.01)

Example - Sangotedo Coordinates:
(6.4698, 3.6285) → Grid (362, 646)
(6.4720, 3.6301) → Grid (362, 646)  ← Same cell!
(6.4705, 3.6290) → Grid (362, 646)  ← Same cell!

Why Grid-Based?

Fast: O(1) lookup - just calculate grid coordinates
Simple: Easy to implement and debug
Predictable: Fixed boundaries, consistent behavior

Trade-off: Properties on cell boundaries might be in different buckets

  • Mitigation: Fuzzy name matching catches these cases

2. Database Schema

Tables

┌─────────────────────────────────────────────────────────────┐
│                        geo_buckets                          │
├─────────────────────────────────────────────────────────────┤
│ id               SERIAL PRIMARY KEY                         │
│ grid_x           INTEGER (indexed)                          │
│ grid_y           INTEGER (indexed)                          │
│ canonical_name   VARCHAR(255)                               │
│ aliases          JSONB (e.g., ["sangotedo", "sangotedo ajah"])│
│ created_at       TIMESTAMP                                  │
│                                                             │
│ UNIQUE(grid_x, grid_y)  ← Only one bucket per grid cell   │
└─────────────────────────────────────────────────────────────┘
                              │
                              │ 1:N
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                        properties                           │
├─────────────────────────────────────────────────────────────┤
│ id               SERIAL PRIMARY KEY                         │
│ title            VARCHAR(255)                               │
│ location_name    VARCHAR(255)                               │
│ latitude         FLOAT                                      │
│ longitude        FLOAT                                      │
│ geo_bucket_id    INTEGERreferences geo_buckets(id)       │
│ price            DECIMAL(12,2)                              │
│ bedrooms         INTEGER                                    │
│ bathrooms        INTEGER                                    │
│ created_at       TIMESTAMP                                  │
└─────────────────────────────────────────────────────────────┘

Key Indexes

-- Fast bucket lookup
CREATE INDEX idx_buckets_grid ON geo_buckets(grid_x, grid_y);

-- Fast property → bucket join
CREATE INDEX idx_properties_bucket ON properties(geo_bucket_id);

Relationship

One Bucket ←→ Many Properties

Example:
Bucket "sangotedo" (grid 362, 646)
  ├─ Property 1: "Sangotedo"
  ├─ Property 2: "Sangotedo, Ajah"
  └─ Property 3: "sangotedo lagos"

3. Location Matching Logic

Three-Layer Approach

Layer 1: Normalize Text

Remove differences in spelling, case, punctuation:

InputNormalized
────────────────────────────────────
"Sangotedo""sangotedo"
"Sangotedo, Ajah""sangotedo ajah"
"SANGOTEDO  LAGOS!!""sangotedo lagos"

Steps:
1. Convert to lowercase
2. Remove punctuation (, . ! ?)
3. Collapse whitespace

Layer 2: Grid Grouping

Properties at nearby coordinates automatically share a bucket:

Coordinate Check:
─────────────────
Sangotedo        (6.4698, 3.6285) → Grid (362, 646)
Sangotedo, Ajah  (6.4720, 3.6301) → Grid (362, 646)  ✓ Same!
sangotedo lagos  (6.4705, 3.6290) → Grid (362, 646)  ✓ Same!

Result: All 3 properties in ONE bucket

Layer 3: Fuzzy Matching

Handle typos and variations:

Search Query: "sangotedo"
─────────────────────────────────────────────
Match Type         Example              Distance
─────────────────────────────────────────────
Exact             "sangotedo"           0 ✓
Substring         "sangotedo ajah"      Contains ✓
Small typo        "sangotdo" (missing e) 1 ✓
Small typo        "sangotedo" (extra o)  1 ✓
Too different     "sngtd"                6 ✗

Threshold: Distance ≤ 3 characters

Complete Matching Process

User searches "sangotedo"
        ↓
1. Normalize: "sangotedo"
        ↓
2. Check ALL buckets for matches:
   - Bucket "sangotedo": Distance 0 ✓ MATCH
   - Bucket "lekki": Distance 7 ✗ No match
   - Bucket "vi": Distance 8 ✗ No match
        ↓
3. Get properties from matched bucket(s)
        ↓
4. Return ALL properties (3 found!)

4. System Flow Diagram

Property Creation Flow

┌─────────┐
│  USER   │ POST /api/properties/
└────┬────┘ {title, location_name, lat, lng, price, ...}
     │
     ▼
┌──────────────────────────────────────────────┐
│  API: Receive property data                  │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Calculate Grid Cell                         │
│  grid_x = floor(lng / 0.01) = 362           │
│  grid_y = floor(lat / 0.01) = 646           │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Check: Does bucket (362, 646) exist?        │
│  ├─ YES: Get existing bucket                 │
│  │       Add "sangotedo" to aliases          │
│  └─ NO:  Create new bucket                   │
│          canonical_name = "sangotedo"        │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Save Property with geo_bucket_id            │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌─────────┐
│ RETURN  │ 201 Created {id, bucket_id, ...}
└─────────┘

Search Flow

┌─────────┐
│  USER   │ GET /api/properties/search?location=sangotedo
└────┬────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Normalize Query                             │
│  "sangotedo" → "sangotedo"                   │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Find Matching Buckets                       │
│                                              │
│  For each bucket in database:                │
│    Check canonical_name: "sangotedo"         │
│    ├─ Distance = 0 ✓ MATCH                  │
│    │                                         │
│    Check aliases: ["sangotedo", ...]         │
│    ├─ "sangotedo ajah" contains query ✓     │
│    │                                         │
│  Matched: [Bucket #1]                        │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────────┐
│  Query Database                              │
│  SELECT * FROM properties                    │
│  WHERE geo_bucket_id IN (1)                  │
│                                              │
│  Results: 3 properties found                 │
└────┬─────────────────────────────────────────┘
     │
     ▼
┌─────────┐
│ RETURN  │ 200 OK [{property1}, {property2}, {property3}]
└─────────┘

Visual Flow Summary

User Input          System Processing           Output
──────────          ─────────────────           ──────

"sangotedo"    →    Normalize                  
                    ↓
               →    Find Buckets               
                    ↓
                    Bucket #1 (sangotedo)      
                    - grid: (362, 646)
                    - aliases: [sangotedo, 
                               sangotedo ajah,
                               sangotedo lagos]
                    ↓
               →    Get Properties             →   3 Properties
                    ↓                               - Property 1
                    3 properties in bucket          - Property 2
                                                    - Property 3

Performance Considerations

Query Speed

Operation            Time       Why
─────────────────────────────────────────────
Create Property      ~5ms       Simple grid calculation + INSERT
Search (exact)       ~10ms      Index lookup on bucket + JOIN
Search (fuzzy)       ~30ms      Scan buckets (fast with few buckets)
Bucket Stats         ~50ms      Aggregation query

Scalability

Current:
- 500 buckets (Lagos area)
- 500,000 properties
- Database: ~1GB
- Performance: ✓ Good

Future (1M+ properties):
- Add trigram index for faster fuzzy search
- Cache frequent searches (Redis)
- Consider adjacent bucket search for boundaries

Trade-offs & Design Decisions

Why Grid vs Alternatives?

Approach Pros Cons Decision
Grid Simple, O(1) lookup, predictable Fixed boundaries ✓ Chosen
H3 Hexagonal Better coverage, no edges Complex, overkill
Radius clustering Natural circles O(n) search, slow
Geohash Hierarchical Less intuitive

Key Assumptions

  1. Grid size 0.01°: Good balance for Lagos neighborhoods
  2. Levenshtein distance ≤ 3: Catches most typos without false positives
  3. In-memory bucket scan: Fast enough for <1000 buckets
  4. Substring matching: Handles "Sangotedo" in "Sangotedo, Ajah"

Accepted Limitations

  • Properties exactly on grid boundaries might be in different buckets
    • Mitigation: Fuzzy name matching finds them anyway
  • One fixed grid size for all areas
    • Future: Could adjust per region (urban vs rural)
  • Simple typo handling (Levenshtein)
    • Future: Could add phonetic matching (Soundex)

Example: The Sangotedo Test Case

Initial State

Database: Empty

Step 1: Create Property 1

Input: {location_name: "Sangotedo", lat: 6.4698, lng: 3.6285}

Calculate: grid_x=362, grid_y=646
Create: Bucket #1 {canonical_name: "sangotedo", aliases: ["sangotedo"]}
Create: Property #1 {geo_bucket_id: 1}

Database:
  Bucket #1: "sangotedo" (1 property)

Step 2: Create Property 2

Input: {location_name: "Sangotedo, Ajah", lat: 6.4720, lng: 3.6301}

Calculate: grid_x=362, grid_y=646  ← Same grid!
Find: Bucket #1 already exists
Update: Bucket #1 aliases += "sangotedo ajah"
Create: Property #2 {geo_bucket_id: 1}

Database:
  Bucket #1: "sangotedo" 
             aliases: ["sangotedo", "sangotedo ajah"]
             (2 properties)

Step 3: Create Property 3

Input: {location_name: "sangotedo lagos", lat: 6.4705, lng: 3.6290}

Calculate: grid_x=362, grid_y=646  ← Same grid!
Find: Bucket #1 already exists
Update: Bucket #1 aliases += "sangotedo lagos"
Create: Property #3 {geo_bucket_id: 1}

Database:
  Bucket #1: "sangotedo"
             aliases: ["sangotedo", "sangotedo ajah", "sangotedo lagos"]
             (3 properties)

Step 4: Search

Query: "sangotedo"
Normalize: "sangotedo"
Match: Bucket #1 (exact match on canonical_name)
Result: Return all 3 properties ✓

Database Query:
SELECT * FROM properties WHERE geo_bucket_id = 1
→ Returns: Property #1, #2, #3

Summary

Problem Solved:

  • ✓ "Sangotedo" = "Sangotedo, Ajah" = "sangotedo lagos"
  • ✓ Case-insensitive
  • ✓ Typo-tolerant
  • ✓ Fast (<50ms)

How:

  1. Grid cells group nearby properties
  2. Name normalization removes formatting differences
  3. Fuzzy matching handles typos
  4. Aliases track all name variations

Result: Consistent, reliable property search regardless of spelling or formatting.