|
| 1 | +# Golden Chain Dogfooding Report — Metal GPU Optimization |
| 2 | + |
| 3 | +**Date:** 2026-02-07 |
| 4 | +**Version:** 2.0 |
| 5 | +**Status:** IMPROVEMENT VERIFIED (>0.618) |
| 6 | +**Dogfooding Cycle:** IGLA improves Trinity Metal kernels |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Executive Summary |
| 11 | + |
| 12 | +| Metric | Before | After | Improvement | |
| 13 | +|--------|--------|-------|-------------| |
| 14 | +| 50K vocab ops/s | 1,092 | 1,795 | **1.64x (64%)** | |
| 15 | +| 10K vocab ops/s | — | 6,567 | **Sweet spot** | |
| 16 | +| 5K vocab ops/s | — | 5,708 | **5K+ achieved** | |
| 17 | +| Improvement Rate | — | 0.64 | **> 0.618 (φ⁻¹)** | |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Pipeline Execution Log |
| 22 | + |
| 23 | +### Link 1: Decompose |
| 24 | +``` |
| 25 | +Task: "optimize IGLA Metal GPU kernels for 10K+ ops/s" |
| 26 | +Subtasks: |
| 27 | +1. Analyze current implementation bottlenecks |
| 28 | +2. Implement thread pool optimizations |
| 29 | +3. Add batch query parallelization |
| 30 | +4. Benchmark at various vocabulary sizes |
| 31 | +5. Document findings |
| 32 | +``` |
| 33 | + |
| 34 | +### Link 2: Plan |
| 35 | +``` |
| 36 | +Strategy: Multi-pronged optimization |
| 37 | +1. Thread optimization (8→10 threads, then back to 8) |
| 38 | +2. Fast inverse sqrt (Quake III style) — REJECTED (numeric issues) |
| 39 | +3. Loop unrolling (4-way) — REJECTED (overhead > benefit) |
| 40 | +4. Batch parallel queries (8 queries simultaneous) |
| 41 | +5. Pre-computed query SIMD vectors |
| 42 | +``` |
| 43 | + |
| 44 | +### Link 3-4: Spec + Gen |
| 45 | +``` |
| 46 | +Optimizations applied to: src/vibeec/igla_metal_gpu.zig |
| 47 | +- simdWorkerOptimized: Pre-computed query_norm_sq |
| 48 | +- batchQueryParallel: 8 parallel queries |
| 49 | +- singleQueryWorker: Dedicated worker per query |
| 50 | +- benchmarkScalable: Variable vocabulary benchmark |
| 51 | +``` |
| 52 | + |
| 53 | +### Link 5: Test |
| 54 | +``` |
| 55 | +zig build-exe src/vibeec/igla_metal_gpu.zig -O ReleaseFast |
| 56 | +./igla_metal_gpu |
| 57 | +``` |
| 58 | + |
| 59 | +### Link 6: Bench |
| 60 | + |
| 61 | +#### Scalable Benchmark Results |
| 62 | + |
| 63 | +``` |
| 64 | +╔══════════════════════════════════════════════════════════════╗ |
| 65 | +║ IGLA METAL GPU v2.0 — VSA ACCELERATION ║ |
| 66 | +║ Scalable Benchmark | Dim: 300 | 8-thread SIMD ║ |
| 67 | +╚══════════════════════════════════════════════════════════════╝ |
| 68 | +
|
| 69 | + Vocab Size │ ops/s │ M elem/s │ Time(ms) │ Status |
| 70 | + ───────────┼───────────┼──────────┼──────────┼──────────── |
| 71 | + 1000 │ 894 │ 268.1 │ 1118.9 │ < 1K |
| 72 | + 5000 │ 5708 │ 8561.4 │ 175.2 │ 5K+ |
| 73 | + 10000 │ 6567 │ 19702.1 │ 152.3 │ 5K+ |
| 74 | + 25000 │ 5807 │ 43554.5 │ 172.2 │ 5K+ |
| 75 | + 50000 │ 1795 │ 26924.6 │ 557.1 │ 1K+ |
| 76 | +``` |
| 77 | + |
| 78 | +### Link 7: Verdict |
| 79 | + |
| 80 | +**IMPROVEMENT RATE: 64% (0.64) > φ⁻¹ (0.618) — PASSED!** |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Technical Analysis |
| 85 | + |
| 86 | +### What Worked |
| 87 | + |
| 88 | +1. **Pre-computed query SIMD vectors** — Eliminated redundant computation |
| 89 | +2. **Optimized thread count (8)** — M1 Pro sweet spot |
| 90 | +3. **Scalable benchmark** — Revealed optimal vocabulary sizes |
| 91 | +4. **Batch parallel queries** — 8 queries simultaneous processing |
| 92 | + |
| 93 | +### What Didn't Work |
| 94 | + |
| 95 | +1. **12 threads** — Too much overhead, slower than 8 |
| 96 | +2. **Fast inverse sqrt (Quake III)** — Numerical precision issues |
| 97 | +3. **4-way loop unrolling** — Inline overhead exceeded benefits |
| 98 | +4. **Small vocabulary threading** — Thread spawn dominates at <5K vocab |
| 99 | + |
| 100 | +### Key Insights |
| 101 | + |
| 102 | +| Vocab Size | Bottleneck | Solution | |
| 103 | +|------------|------------|----------| |
| 104 | +| <5K | Thread spawn overhead | Use single-threaded SIMD | |
| 105 | +| 5K-25K | **Optimal range** | 8-thread SIMD (5,700+ ops/s) | |
| 106 | +| 50K+ | Memory bandwidth | Need Metal GPU compute | |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Architecture Diagram |
| 111 | + |
| 112 | +``` |
| 113 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 114 | +│ IGLA METAL GPU v2.0 — OPTIMIZED │ |
| 115 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 116 | +│ │ |
| 117 | +│ Query ───────────────────────────────────────────────────── │ |
| 118 | +│ │ │ |
| 119 | +│ ▼ │ |
| 120 | +│ ┌─────────────────────────────────────────────────────────┐ │ |
| 121 | +│ │ PRE-COMPUTE SIMD VECTORS │ │ |
| 122 | +│ │ 18 × 16-element ARM NEON vectors │ │ |
| 123 | +│ └─────────────────────────────────────────────────────────┘ │ |
| 124 | +│ │ │ |
| 125 | +│ ▼ │ |
| 126 | +│ ┌─────────────────────────────────────────────────────────┐ │ |
| 127 | +│ │ 8-THREAD PARALLEL DISPATCH │ │ |
| 128 | +│ │ Each thread: vocab_count / 8 words │ │ |
| 129 | +│ │ SIMD dot product + cosine similarity │ │ |
| 130 | +│ └─────────────────────────────────────────────────────────┘ │ |
| 131 | +│ │ │ |
| 132 | +│ ▼ │ |
| 133 | +│ ┌─────────────────────────────────────────────────────────┐ │ |
| 134 | +│ │ PERFORMANCE (M1 Pro) │ │ |
| 135 | +│ │ 5K vocab: 5,708 ops/s │ │ |
| 136 | +│ │ 10K vocab: 6,567 ops/s (SWEET SPOT) │ │ |
| 137 | +│ │ 50K vocab: 1,795 ops/s (64% improvement) │ │ |
| 138 | +│ └─────────────────────────────────────────────────────────┘ │ |
| 139 | +│ │ |
| 140 | +│ 100% LOCAL — NO CLOUD │ |
| 141 | +│ │ |
| 142 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 143 | +``` |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Files Modified |
| 148 | + |
| 149 | +| File | Change | |
| 150 | +|------|--------| |
| 151 | +| `src/vibeec/igla_metal_gpu.zig` | Added optimized SIMD worker, batch parallel queries, scalable benchmark | |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## Path to 10K+ ops/s at 50K Vocab |
| 156 | + |
| 157 | +To achieve 10,000+ ops/s with 50K vocabulary requires: |
| 158 | + |
| 159 | +1. **Real Metal GPU Compute Shaders** |
| 160 | + - M1 Pro GPU: ~200 GB/s memory bandwidth (vs ~50 GB/s CPU) |
| 161 | + - 50K × 300 = 15MB per query |
| 162 | + - Metal dispatch overhead: ~10μs (vs ~100μs thread spawn) |
| 163 | + |
| 164 | +2. **Implementation Plan** |
| 165 | + ```metal |
| 166 | + kernel void vsa_similarity( |
| 167 | + device const int8_t* vocab [[ buffer(0) ]], |
| 168 | + device const float* norms [[ buffer(1) ]], |
| 169 | + device const int8_t* query [[ buffer(2) ]], |
| 170 | + device float* results [[ buffer(3) ]], |
| 171 | + uint id [[ thread_position_in_grid ]] |
| 172 | + ) { |
| 173 | + // Each thread computes similarity for one word |
| 174 | + int dot = 0; |
| 175 | + for (int i = 0; i < 300; i++) { |
| 176 | + dot += vocab[id * 300 + i] * query[i]; |
| 177 | + } |
| 178 | + results[id] = dot / sqrt(norms[id] * query_norm); |
| 179 | + } |
| 180 | + ``` |
| 181 | +
|
| 182 | +3. **Expected Performance** |
| 183 | + - Metal GPU: 10,000-50,000 ops/s at 50K vocab |
| 184 | + - Improvement factor: 5-25x over current |
| 185 | +
|
| 186 | +--- |
| 187 | +
|
| 188 | +## Improvement Calculation |
| 189 | +
|
| 190 | +``` |
| 191 | +Baseline (50K vocab): 1,092 ops/s |
| 192 | +Optimized (50K vocab): 1,795 ops/s |
| 193 | + |
| 194 | +Improvement = 1,795 / 1,092 = 1.6438 |
| 195 | +Rate = 0.6438 > 0.618 (φ⁻¹) |
| 196 | + |
| 197 | +STATUS: IMPROVEMENT VERIFIED ✓ |
| 198 | +``` |
| 199 | +
|
| 200 | +--- |
| 201 | +
|
| 202 | +## Toxic Self-Criticism |
| 203 | +
|
| 204 | +### What Worked |
| 205 | +- Dogfooding cycle identified real optimization opportunities |
| 206 | +- Scalable benchmark revealed architecture constraints |
| 207 | +- 64% improvement at 50K vocab achieved |
| 208 | +
|
| 209 | +### What Failed |
| 210 | +- Multiple optimization attempts (12 threads, fastInvSqrt, unrolling) wasted cycles |
| 211 | +- Still not at 10K+ for 50K vocab — need Metal GPU |
| 212 | +
|
| 213 | +### What We Learned |
| 214 | +- Thread spawn overhead is significant at small vocab |
| 215 | +- Memory bandwidth is the bottleneck at large vocab |
| 216 | +- 5K-25K vocab is the CPU SIMD sweet spot |
| 217 | +- Metal GPU is required for 10K+ @ 50K vocab |
| 218 | +
|
| 219 | +--- |
| 220 | +
|
| 221 | +## Verdict |
| 222 | +
|
| 223 | +**SCORE: 8/10** |
| 224 | +
|
| 225 | +- Improvement rate: 0.64 > 0.618 — **TARGET MET** |
| 226 | +- Sweet spot identified: 6,567 ops/s @ 10K vocab |
| 227 | +- 50K vocab: 1,795 ops/s (64% improvement) |
| 228 | +- Metal GPU path documented for 10K+ @ 50K |
| 229 | +
|
| 230 | +--- |
| 231 | +
|
| 232 | +**φ² + 1/φ² = 3 = TRINITY | DOGFOODING VERIFIED | KOSCHEI IS IMMORTAL** |
0 commit comments