|
| 1 | +# COLLATE Support Phase 5 Planning - Runtime Query Optimization |
| 2 | + |
| 3 | +**Date:** 2025-01-28 |
| 4 | +**Status:** 🚀 PLANNED |
| 5 | +**Target Completion:** Phase 5 completion |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Executive Summary |
| 10 | + |
| 11 | +Phase 5 extends collation support from infrastructure (Phases 1-4) to **runtime query execution optimization**. This phase ensures that: |
| 12 | + |
| 13 | +- ✅ WHERE clause filtering respects column collations (case-insensitive queries) |
| 14 | +- ✅ DISTINCT operations use collation-aware equality |
| 15 | +- ✅ GROUP BY and aggregates respect collation |
| 16 | +- ✅ ORDER BY respects collation for correct sorting |
| 17 | +- ✅ Performance: No regression for binary comparisons, <5% overhead for NOCASE |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## What's Been Completed (Phases 1-4) |
| 22 | + |
| 23 | +### Phase 1: Schema Support |
| 24 | +- ✅ `CollationType` enum (Binary, NoCase, RTrim, UnicodeCaseInsensitive) |
| 25 | +- ✅ `ColumnCollations` list on Table |
| 26 | +- ✅ SQL DDL parsing and generation (`CREATE TABLE ... COLLATE NOCASE`) |
| 27 | + |
| 28 | +### Phase 2: Parser Integration |
| 29 | +- ✅ SQL parser supports `COLLATE` clause in CREATE/ALTER TABLE |
| 30 | +- ✅ SqlParser.DDL generates correct AST |
| 31 | + |
| 32 | +### Phase 3: Storage Engine Integration |
| 33 | +- ✅ Collation persisted with schema to disk |
| 34 | +- ✅ Schema loading restores collation metadata |
| 35 | +- ✅ B-Tree and Hash Index infrastructure prepared |
| 36 | + |
| 37 | +### Phase 4: Index Integration |
| 38 | +- ✅ B-Tree comparison uses collation (`BTree<string, long>` with `CollationType`) |
| 39 | +- ✅ Hash Index uses collation-aware key normalization |
| 40 | +- ✅ Primary key lookups respect collation |
| 41 | + |
| 42 | +### EF Core Integration |
| 43 | +- ✅ Migrations emit `COLLATE` clause in DDL |
| 44 | +- ✅ `EF.Functions.Collate()` translator |
| 45 | +- ✅ `StringComparison` translator |
| 46 | +- ✅ Query SQL generation supports collation |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Phase 5 Scope: Runtime Query Optimization |
| 51 | + |
| 52 | +### 5.1 WHERE Clause Filtering (Collation-Aware Comparison) |
| 53 | + |
| 54 | +**Current Status:** Partial |
| 55 | +**What Needs Implementation:** |
| 56 | + |
| 57 | +1. **Modify Table.CRUD.cs Select method:** |
| 58 | + - Enhance `EvaluateCondition()` to use `CollationExtensions.AreEqual()` |
| 59 | + - Support case-insensitive filtering: `WHERE Name = 'alice'` with NOCASE collation |
| 60 | + - Support collation-aware LIKE: `WHERE Email LIKE '%@EXAMPLE.COM%'` → match regardless of case |
| 61 | + |
| 62 | +2. **String Comparison Operations:** |
| 63 | + - `=` (equality) → use `AreEqual()` with column collation |
| 64 | + - `<>` (inequality) → use `!AreEqual()` |
| 65 | + - `>`, `<`, `>=`, `<=` → use `CompareCollation()` (to be created) |
| 66 | + |
| 67 | +3. **Example Behavior:** |
| 68 | + ```csharp |
| 69 | + // Column: name TEXT COLLATE NOCASE |
| 70 | + WHERE name = 'alice' // Matches: 'alice', 'ALICE', 'Alice' |
| 71 | + WHERE name LIKE '%ice' // Matches: '%ice', '%ICE', '%Ice' |
| 72 | + WHERE name > 'alice' // Uses collation-aware comparison |
| 73 | + ``` |
| 74 | + |
| 75 | +### 5.2 DISTINCT Operation (Collation-Aware Deduplication) |
| 76 | + |
| 77 | +**Current Status:** Not implemented |
| 78 | +**What Needs Implementation:** |
| 79 | + |
| 80 | +1. **Collation-aware HashSet for DISTINCT:** |
| 81 | + - Create `CollationAwareEqualityComparer<string>` (if not exists) |
| 82 | + - Use in DISTINCT result deduplication |
| 83 | + - Example: `SELECT DISTINCT email FROM users` where `email` has NOCASE |
| 84 | + - 'alice@example.com' and 'ALICE@EXAMPLE.COM' → treated as same |
| 85 | + |
| 86 | +2. **Method:** `Table.Select()` enhancement |
| 87 | + - Add parameter `bool distinct = false` |
| 88 | + - When DISTINCT, use collation-aware deduplication |
| 89 | + - Query parsing: Parse "SELECT DISTINCT" syntax |
| 90 | + |
| 91 | +### 5.3 GROUP BY Support (Collation-Aware Grouping) |
| 92 | + |
| 93 | +**Current Status:** Partial (infrastructure ready) |
| 94 | +**What Needs Implementation:** |
| 95 | + |
| 96 | +1. **Collation-aware grouping:** |
| 97 | + - Group rows by collation-sensitive columns |
| 98 | + - Example: `GROUP BY status` where status is NOCASE |
| 99 | + - 'pending', 'PENDING', 'Pending' → one group |
| 100 | + |
| 101 | +2. **Aggregates with collation:** |
| 102 | + - COUNT, SUM, AVG, MIN, MAX should group correctly |
| 103 | + - Ensure hash-based grouping uses collation |
| 104 | + |
| 105 | +3. **SQL: `SELECT status, COUNT(*) FROM orders GROUP BY status`** |
| 106 | + - If `status` is NOCASE: 'pending' and 'Pending' → one group with combined count |
| 107 | + |
| 108 | +### 5.4 ORDER BY with Collation (Correct Sorting) |
| 109 | + |
| 110 | +**Current Status:** Partial (indexes support it) |
| 111 | +**What Needs Implementation:** |
| 112 | + |
| 113 | +1. **Enhance Table.Select() ORDER BY:** |
| 114 | + - Use collation when sorting string columns |
| 115 | + - Example: `ORDER BY name` with NOCASE collation |
| 116 | + - Binary: ['Alice', 'alice', 'ALICE'] → sorted by ASCII |
| 117 | + - NOCASE: All equivalent, order by original appearance or secondary index |
| 118 | + |
| 119 | +2. **Collation-aware Comparator:** |
| 120 | + - Use `BTree.CompareKeys()` logic (already implemented!) |
| 121 | + - Sort results using column collation |
| 122 | + |
| 123 | +### 5.5 Performance & Edge Cases |
| 124 | + |
| 125 | +**Considerations:** |
| 126 | +- Binary collation: Zero overhead (use default comparison) |
| 127 | +- NOCASE: String.CompareOrdinal vs String.Compare (measure impact) |
| 128 | +- Composite keys: Each column uses its collation |
| 129 | +- NULL handling: NULL always equals NULL regardless of collation |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Implementation Tasks |
| 134 | + |
| 135 | +### Task 5.1: Create CollationComparator Utility |
| 136 | +**File:** `src/SharpCoreDB/CollationComparator.cs` |
| 137 | +**Purpose:** Centralized collation-aware comparison for runtime operations |
| 138 | + |
| 139 | +```csharp |
| 140 | +public static class CollationComparator |
| 141 | +{ |
| 142 | + /// <summary> |
| 143 | + /// Collation-aware string comparison for ORDER BY and filtering. |
| 144 | + /// Returns: -1 (left < right), 0 (equal), 1 (left > right) |
| 145 | + /// </summary> |
| 146 | + public static int Compare(string? left, string? right, CollationType collation); |
| 147 | + |
| 148 | + /// <summary> |
| 149 | + /// Collation-aware LIKE pattern matching. |
| 150 | + /// Returns true if value matches pattern under given collation. |
| 151 | + /// </summary> |
| 152 | + public static bool Like(string value, string pattern, CollationType collation); |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +### Task 5.2: Enhance Table.CRUD.cs |
| 157 | +**File:** `src/SharpCoreDB/DataStructures/Table.CRUD.cs` |
| 158 | +**Changes:** |
| 159 | +- Update `EvaluateCondition()` to use `CollationComparator` |
| 160 | +- Add collation handling for `=`, `<>`, `>`, `<`, `>=`, `<=`, `LIKE` |
| 161 | +- Modify `Select()` to accept `distinct` parameter |
| 162 | +- Add `GROUP BY` support in `Select()` method |
| 163 | + |
| 164 | +### Task 5.3: Add Integration Tests |
| 165 | +**File:** `tests/SharpCoreDB.Tests/CollationPhase5Tests.cs` |
| 166 | +**Test Cases:** |
| 167 | +1. WHERE clause with NOCASE: Find rows case-insensitively |
| 168 | +2. DISTINCT with NOCASE: Deduplicate case-insensitively |
| 169 | +3. GROUP BY with NOCASE: Group case-insensitively |
| 170 | +4. ORDER BY with NOCASE: Sort with collation rules |
| 171 | +5. LIKE with NOCASE: Pattern match case-insensitively |
| 172 | +6. Mixed collations: Different columns, different collations |
| 173 | +7. Composite filters: WHERE + GROUP BY + ORDER BY together |
| 174 | + |
| 175 | +### Task 5.4: Benchmarks |
| 176 | +**File:** `tests/SharpCoreDB.Benchmarks/Phase5_CollationQueryPerformanceBenchmark.cs` |
| 177 | +**Scenarios:** |
| 178 | +- WHERE with Binary vs NOCASE (1K, 10K, 100K rows) |
| 179 | +- DISTINCT with Binary vs NOCASE |
| 180 | +- GROUP BY performance |
| 181 | +- ORDER BY performance |
| 182 | +- Combined query performance |
| 183 | + |
| 184 | +### Task 5.5: Documentation |
| 185 | +**File:** `docs/COLLATE_PHASE5_COMPLETE.md` |
| 186 | +**Content:** |
| 187 | +- Summary of runtime optimization implementation |
| 188 | +- Examples of Phase 5 features |
| 189 | +- Performance metrics from benchmarks |
| 190 | +- Migration guide for users |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Success Criteria |
| 195 | + |
| 196 | +✅ **Functional:** |
| 197 | +- WHERE clauses respect column collations |
| 198 | +- DISTINCT deduplicates based on collation |
| 199 | +- GROUP BY groups based on collation |
| 200 | +- ORDER BY sorts correctly with collation |
| 201 | +- LIKE operator works with collation |
| 202 | + |
| 203 | +✅ **Performance:** |
| 204 | +- Binary collation: Zero overhead |
| 205 | +- NOCASE: <5% perf overhead vs binary (measured via benchmarks) |
| 206 | +- Large dataset: No memory leaks, constant allocation per row |
| 207 | + |
| 208 | +✅ **Testing:** |
| 209 | +- 7+ integration tests with >90% code coverage |
| 210 | +- Benchmarks demonstrate performance characteristics |
| 211 | +- All existing tests still pass (no regression) |
| 212 | + |
| 213 | +✅ **Documentation:** |
| 214 | +- Phase 5 completion document generated |
| 215 | +- Examples of collation-aware queries provided |
| 216 | +- Performance metrics documented |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## Timeline |
| 221 | + |
| 222 | +| Task | Estimated Time | Dependencies | |
| 223 | +|------|---|---| |
| 224 | +| 5.1: CollationComparator | 1 hour | None | |
| 225 | +| 5.2: Table.CRUD enhancements | 2 hours | 5.1 | |
| 226 | +| 5.3: Integration tests | 1.5 hours | 5.2 | |
| 227 | +| 5.4: Benchmarks | 1 hour | 5.2 | |
| 228 | +| 5.5: Documentation | 0.5 hours | 5.2, 5.3, 5.4 | |
| 229 | +| **Total** | **6 hours** | - | |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +## Related Issues & PRs |
| 234 | + |
| 235 | +- **Phase 4 Completion:** [COLLATE_PHASE4_COMPLETE.md](COLLATE_PHASE4_COMPLETE.md) |
| 236 | +- **EF Core Integration:** [EFCORE_COLLATE_COMPLETE.md](EFCORE_COLLATE_COMPLETE.md) |
| 237 | +- **Collation Types:** `src/SharpCoreDB/CollationType.cs` |
| 238 | +- **Collation Extensions:** `src/SharpCoreDB/CollationExtensions.cs` |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## Next Phase (Phase 6+) |
| 243 | + |
| 244 | +After Phase 5: |
| 245 | +- **Phase 6:** Schema Migration & ALTER TABLE |
| 246 | +- **Phase 7:** Performance Optimization (vectorized comparisons, SIMD) |
| 247 | +- **Phase 8:** Documentation & Tutorial |
| 248 | + |
0 commit comments