|
| 1 | +# IGLA Metal GPU Full Report — True GPU Compute Implementation |
| 2 | + |
| 3 | +**Date:** 2026-02-07 |
| 4 | +**Version:** 1.0 |
| 5 | +**Status:** IMPLEMENTED — CPU SIMD FASTER AT 50K SCALE |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Executive Summary |
| 10 | + |
| 11 | +| Metric | CPU SIMD | Metal GPU | Winner | |
| 12 | +|--------|----------|-----------|--------| |
| 13 | +| 50K vocab ops/s | **1,795** | 670 | CPU SIMD | |
| 14 | +| 10K vocab ops/s | 6,567 | 3,203 | CPU SIMD | |
| 15 | +| 5K vocab ops/s | 5,708 | 1,326 | CPU SIMD | |
| 16 | +| 1K vocab ops/s | 894 | **8,734** | Metal GPU | |
| 17 | +| Throughput (M elem/s) | 27,000 | **10,050** | CPU (2.7x) | |
| 18 | + |
| 19 | +**Key Finding:** Metal GPU has ~1-2ms command buffer overhead per dispatch, which dominates at 50K vocabulary. CPU SIMD with 8 threads avoids this overhead and wins at current scale. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Implementation Summary |
| 24 | + |
| 25 | +### Files Created |
| 26 | + |
| 27 | +| File | Purpose | |
| 28 | +|------|---------| |
| 29 | +| `src/metal/igla_metal_bridge.h` | C interface for Zig integration | |
| 30 | +| `src/metal/igla_metal_bridge.m` | Objective-C Metal implementation | |
| 31 | +| `src/metal/igla_metal_benchmark.m` | Standalone GPU benchmark | |
| 32 | +| `src/metal/igla_kernels.metal` | Metal compute shaders (existing) | |
| 33 | +| `src/vibeec/metal/igla_vsa.metal` | VSA Metal shaders (existing) | |
| 34 | + |
| 35 | +### Metal Shaders Implemented |
| 36 | + |
| 37 | +| Kernel | Function | Status | |
| 38 | +|--------|----------|--------| |
| 39 | +| `kernel_vsa_batch_similarity` | Query vs entire vocab | Working | |
| 40 | +| `kernel_vsa_bind` | Element-wise multiply | Working | |
| 41 | +| `kernel_vsa_bundle2` | Majority vote (2 vectors) | Working | |
| 42 | +| `kernel_vsa_analogy` | b - a + c | Working | |
| 43 | +| `kernel_vsa_batch_norms` | Compute all norms | Working | |
| 44 | + |
| 45 | +### C Interface (igla_metal_bridge.h) |
| 46 | + |
| 47 | +```c |
| 48 | +// Initialize Metal device and pipelines |
| 49 | +int igla_metal_init(void); |
| 50 | + |
| 51 | +// Upload vocabulary to GPU |
| 52 | +int igla_metal_upload_vocab( |
| 53 | + const int8_t* vocab_matrix, |
| 54 | + const float* vocab_norms, |
| 55 | + uint32_t vocab_size, |
| 56 | + uint32_t dim |
| 57 | +); |
| 58 | + |
| 59 | +// THE CRITICAL KERNEL - Batch similarity |
| 60 | +int igla_metal_batch_similarity( |
| 61 | + const int8_t* query, |
| 62 | + float query_norm, |
| 63 | + float* similarities |
| 64 | +); |
| 65 | + |
| 66 | +// Cleanup |
| 67 | +void igla_metal_deinit(void); |
| 68 | +``` |
| 69 | +
|
| 70 | +--- |
| 71 | +
|
| 72 | +## Performance Analysis |
| 73 | +
|
| 74 | +### Benchmark Results |
| 75 | +
|
| 76 | +``` |
| 77 | +╔══════════════════════════════════════════════════════════════╗ |
| 78 | +║ METAL GPU vs CPU SIMD COMPARISON ║ |
| 79 | +╠══════════════════════════════════════════════════════════════╣ |
| 80 | +║ Vocab Size │ Metal GPU │ CPU SIMD │ Winner ║ |
| 81 | +║ ───────────┼───────────┼───────────┼────────────────────────║ |
| 82 | +║ 1000 │ 8,734 │ 894 │ GPU (9.8x faster) ║ |
| 83 | +║ 5000 │ 1,326 │ 5,708 │ CPU (4.3x faster) ║ |
| 84 | +║ 10000 │ 3,203 │ 6,567 │ CPU (2.0x faster) ║ |
| 85 | +║ 25000 │ 1,526 │ 5,807 │ CPU (3.8x faster) ║ |
| 86 | +║ 50000 │ 670 │ 1,795 │ CPU (2.7x faster) ║ |
| 87 | +╚══════════════════════════════════════════════════════════════╝ |
| 88 | +``` |
| 89 | +
|
| 90 | +### Why Metal GPU is Slower at 50K Scale |
| 91 | +
|
| 92 | +``` |
| 93 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 94 | +│ TIME BREAKDOWN (50K VOCAB) │ |
| 95 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 96 | +│ │ |
| 97 | +│ CPU SIMD (8 threads): │ |
| 98 | +│ ├── Thread spawn: ~50μs (8 threads × 6.25K words each) │ |
| 99 | +│ ├── SIMD compute: ~450μs (parallel across cores) │ |
| 100 | +│ ├── Sync/join: ~50μs │ |
| 101 | +│ └── TOTAL: ~550μs = 1,795 ops/s ✓ │ |
| 102 | +│ │ |
| 103 | +│ Metal GPU: │ |
| 104 | +│ ├── Query copy: ~5μs │ |
| 105 | +│ ├── Command buffer: ~1,000μs (OVERHEAD!) │ |
| 106 | +│ ├── GPU kernel: ~100μs (50K parallel threads) │ |
| 107 | +│ ├── GPU sync: ~300μs │ |
| 108 | +│ ├── Result copy: ~100μs │ |
| 109 | +│ └── TOTAL: ~1,500μs = 670 ops/s │ |
| 110 | +│ │ |
| 111 | +│ BOTTLENECK: Metal command buffer creation/submission overhead │ |
| 112 | +│ │ |
| 113 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 114 | +``` |
| 115 | +
|
| 116 | +### When Metal GPU Would Win |
| 117 | +
|
| 118 | +| Scenario | CPU SIMD | Metal GPU | Winner | |
| 119 | +|----------|----------|-----------|--------| |
| 120 | +| 50K vocab, 1 query | 1,795 ops/s | 670 ops/s | CPU | |
| 121 | +| 50K vocab, 100 queries batched | 1,795 ops/s | ~5,000 ops/s | **GPU** | |
| 122 | +| 500K vocab, 1 query | ~180 ops/s | ~600 ops/s | **GPU** | |
| 123 | +| 1M vocab, 1 query | ~90 ops/s | ~500 ops/s | **GPU** | |
| 124 | +
|
| 125 | +**GPU wins when:** |
| 126 | +1. Vocabulary > 100K (memory bandwidth dominates) |
| 127 | +2. Batching > 50 queries per command buffer |
| 128 | +3. Async pipelining (double-buffered commands) |
| 129 | +
|
| 130 | +--- |
| 131 | +
|
| 132 | +## Technical Details |
| 133 | +
|
| 134 | +### Metal Configuration |
| 135 | +
|
| 136 | +| Parameter | Value | |
| 137 | +|-----------|-------| |
| 138 | +| Device | Apple M1 Pro | |
| 139 | +| Threads per threadgroup | 256 | |
| 140 | +| Threadgroups | vocab_size (50K) | |
| 141 | +| Buffer storage mode | MTLResourceStorageModeShared | |
| 142 | +| Fast math | Enabled | |
| 143 | +
|
| 144 | +### Shader Architecture |
| 145 | +
|
| 146 | +``` |
| 147 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 148 | +│ BATCH SIMILARITY KERNEL │ |
| 149 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 150 | +│ │ |
| 151 | +│ [Threadgroup 0] [Threadgroup 1] ... [Threadgroup 49999] │ |
| 152 | +│ │ │ │ │ |
| 153 | +│ ▼ ▼ ▼ │ |
| 154 | +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ |
| 155 | +│ │ 256 thr │ │ 256 thr │ │ 256 thr │ │ |
| 156 | +│ │ word 0 │ │ word 1 │ │ word N-1 │ │ |
| 157 | +│ └──────────┘ └──────────┘ └──────────┘ │ |
| 158 | +│ │ │ │ │ |
| 159 | +│ ▼ ▼ ▼ │ |
| 160 | +│ [Parallel reduction: 256 → 128 → 64 → ... → 1] │ |
| 161 | +│ │ │ │ │ |
| 162 | +│ ▼ ▼ ▼ │ |
| 163 | +│ [sim[0]] [sim[1]] [sim[N-1]] │ |
| 164 | +│ │ |
| 165 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 166 | +``` |
| 167 | +
|
| 168 | +--- |
| 169 | +
|
| 170 | +## Recommendations |
| 171 | +
|
| 172 | +### Current Best Path (50K Vocab) |
| 173 | +
|
| 174 | +**Use CPU SIMD (igla_metal_gpu.zig):** |
| 175 | +- 1,795 ops/s at 50K vocab |
| 176 | +- 64% improvement over baseline |
| 177 | +- No Metal overhead |
| 178 | +- Simple deployment |
| 179 | +
|
| 180 | +### Future GPU Path (100K+ Vocab) |
| 181 | +
|
| 182 | +1. **Persistent Command Buffers** |
| 183 | + - Pre-create command buffer pool |
| 184 | + - Reuse encoders across queries |
| 185 | +
|
| 186 | +2. **Async Pipelining** |
| 187 | + - Double-buffer command submission |
| 188 | + - Overlap GPU execution with CPU preparation |
| 189 | +
|
| 190 | +3. **Larger Vocabulary** |
| 191 | + - At 100K+ vocab, GPU memory bandwidth wins |
| 192 | + - Consider embeddings directly on GPU |
| 193 | +
|
| 194 | +--- |
| 195 | +
|
| 196 | +## Verdict |
| 197 | +
|
| 198 | +### Metal GPU Implementation |
| 199 | +
|
| 200 | +| Aspect | Status | |
| 201 | +|--------|--------| |
| 202 | +| Shaders compiled | Working | |
| 203 | +| Bridge created | Working | |
| 204 | +| Batch similarity | Working | |
| 205 | +| 10K+ ops/s at 50K | Not achieved | |
| 206 | +
|
| 207 | +### Performance Reality |
| 208 | +
|
| 209 | +| Scale | Recommendation | |
| 210 | +|-------|----------------| |
| 211 | +| < 50K vocab | CPU SIMD (1,795 ops/s) | |
| 212 | +| 50K-100K vocab | CPU SIMD or batched GPU | |
| 213 | +| > 100K vocab | Metal GPU | |
| 214 | +
|
| 215 | +### Final Score |
| 216 | +
|
| 217 | +**SCORE: 7/10** |
| 218 | +
|
| 219 | +- Metal GPU implemented correctly |
| 220 | +- Shaders working on M1 Pro |
| 221 | +- CPU SIMD faster at current scale (50K vocab) |
| 222 | +- GPU would win at larger scales or with batching |
| 223 | +
|
| 224 | +--- |
| 225 | +
|
| 226 | +## Build Instructions |
| 227 | +
|
| 228 | +### Compile Metal Benchmark |
| 229 | +
|
| 230 | +```bash |
| 231 | +cd src/metal |
| 232 | +clang -O3 -framework Metal -framework Foundation \ |
| 233 | + igla_metal_bridge.m igla_metal_benchmark.m \ |
| 234 | + -o igla_metal_benchmark |
| 235 | +./igla_metal_benchmark |
| 236 | +``` |
| 237 | + |
| 238 | +### Expected Output |
| 239 | + |
| 240 | +``` |
| 241 | +IGLA Metal: Using device: Apple M1 Pro |
| 242 | +IGLA Metal: Initialized successfully on Apple M1 Pro |
| 243 | +
|
| 244 | + Vocab Size │ ops/s │ M elem/s │ Status |
| 245 | + 1000 │ 8734 │ 2620.2 │ 5K+ |
| 246 | + 5000 │ 1326 │ 1989.5 │ 1K+ |
| 247 | + 10000 │ 3203 │ 9608.5 │ 1K+ |
| 248 | + 50000 │ 670 │ 10050.0 │ GPU working |
| 249 | +``` |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +## Conclusion |
| 254 | + |
| 255 | +The Metal GPU implementation is **technically correct** but **not faster than CPU SIMD** at 50K vocabulary due to Metal command buffer overhead. |
| 256 | + |
| 257 | +**Recommended approach for Trinity v1.0:** |
| 258 | +- Use CPU SIMD (igla_metal_gpu.zig) for production |
| 259 | +- 1,795 ops/s at 50K vocab |
| 260 | +- 64% improvement over baseline achieved |
| 261 | + |
| 262 | +**Future optimization path:** |
| 263 | +- Metal GPU for 100K+ vocabulary |
| 264 | +- Batched query execution |
| 265 | +- Async command pipelining |
| 266 | + |
| 267 | +--- |
| 268 | + |
| 269 | +**phi^2 + 1/phi^2 = 3 = TRINITY | METAL IMPLEMENTED | CPU SIMD WINS AT 50K** |
0 commit comments