feat(gen): implement speculative decoding (OPT-S01)

gHashTag · ona-agent · gHashTag · commit 49123d77a68d · 2026-02-02T10:29:35.000Z
- Add SpeculativeDecoder with self-speculation (early exit)
- Add forwardDraft for fast draft generation using first N layers
- Implement acceptance/rejection sampling with adjusted distribution
- Expected speedup: 2-3x for generation throughput
- Mathematically equivalent to standard sampling

Co-authored-by: Ona &lt;no-reply@ona.com&gt;
diff --git a/docs/DISCOVERIES.md b/docs/DISCOVERIES.md
@@ -81,6 +81,7 @@ Where:
 | OPT-T07 | Batch Ternary MatMul | N/A | 2.28x | ✅ Implemented |
 | OPT-M01 | Memory-Mapped Loading | N/A | 30x load | ✅ Implemented |
 | OPT-C01 | KV Cache Compression | 5-16x | 1x | ✅ Implemented |
+| OPT-S01 | Speculative Decoding | N/A | 2-3x gen | ✅ Implemented |
 
 ### Business Value
 
@@ -507,6 +508,67 @@ var cache = try RingKVCache.init(allocator, num_heads, head_dim, 2048, config);
 kv_cache.streamingAttention(output, query, &cache, head_idx, scores, scale);
 ```
 
+### Speculative Decoding (OPT-S01)
+
+**Status**: ✅ Implemented
+
+| Component | File | Description |
+|-----------|------|-------------|
+| SpeculativeConfig | `tri_inference.zig` | Configuration for speculation |
+| SpeculativeDecoder | `tri_inference.zig` | Main speculative decoder |
+| forwardDraft | `tri_inference.zig` | Early-exit forward for draft |
+| verifyAndAccept | `tri_inference.zig` | Token verification logic |
+
+**Algorithm:**
+```
+┌─────────────────────────────────────────────────────────────┐
+│              SPECULATIVE DECODING                           │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  1. DRAFT: Generate K tokens with early-exit model          │
+│     draft_tokens = [t1, t2, t3, t4]  (fast, ~10ms)          │
+│                                                             │
+│  2. VERIFY: Run full model on each token                    │
+│     For each draft token:                                   │
+│       - Compute target probability                          │
+│       - Accept with prob min(1, p_target/p_draft)           │
+│       - On reject: sample from adjusted distribution        │
+│                                                             │
+│  3. BONUS: If all K accepted, sample K+1 from target        │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Self-Speculation (Early Exit):**
+- Uses first N layers as draft model (default: 4 layers)
+- No separate draft model needed
+- Draft is ~4-8x faster than full model
+
+**Expected Speedup:**
+```
+Speedup = K / (1 + (1-α)K)
+where α = acceptance rate, K = speculation length
+
+For α=0.8, K=4: Speedup = 4 / 1.8 = 2.2x
+For α=0.9, K=4: Speedup = 4 / 1.4 = 2.9x
+```
+
+**Usage:**
+```zig
+const config = SpeculativeConfig{
+    .speculation_length = 4,
+    .draft_layers = 4,
+    .temperature = 1.0,
+};
+
+var decoder = try SpeculativeDecoder.init(allocator, model, config);
+defer decoder.deinit();
+
+const result = try decoder.generate(start_token, 0, 100);
+std.debug.print("Generated {d} tokens, acceptance rate: {d:.1}%\n", 
+    .{result.tokens.len, result.acceptance_rate * 100});
+```
+
 ### Batch Processing (INF-004)
 
 **Status**: ✅ Implemented
diff --git a/specs/tri/speculative_decoding.vibee b/specs/tri/speculative_decoding.vibee
@@ -0,0 +1,88 @@
+# speculative_decoding.vibee
+# Speculative Decoding for faster autoregressive generation
+# Generate multiple tokens per target model forward pass
+
+name: speculative_decoding
+version: "1.0.0"
+language: zig
+module: speculative_decoding
+
+types:
+  SpeculativeConfig:
+    description: "Configuration for speculative decoding"
+    fields:
+      speculation_length: Int    # K: number of tokens to speculate
+      temperature: Float         # Sampling temperature
+      use_tree_attention: Bool   # Enable tree-based speculation
+
+  DraftResult:
+    description: "Result from draft model speculation"
+    fields:
+      tokens: List<Int>          # K speculated tokens
+      probs: List<Float>         # Draft probabilities for each token
+
+  VerificationResult:
+    description: "Result from target model verification"
+    fields:
+      accepted_count: Int        # Number of accepted tokens
+      accepted_tokens: List<Int> # Accepted token sequence
+      next_token: Int            # Token sampled after rejection
+      acceptance_rate: Float     # Running acceptance rate
+
+behaviors:
+  - name: draft_speculate
+    given: draft model, input token, position, K
+    when: generating K candidate tokens
+    then: returns DraftResult with tokens and probabilities
+
+  - name: target_verify
+    given: target model, input sequence, draft tokens
+    when: verifying draft tokens in parallel
+    then: returns logits for all K+1 positions
+
+  - name: speculative_sample
+    given: draft probs, target probs, draft token
+    when: deciding to accept or reject
+    then: accepts with prob min(1, p_target/p_draft), else samples correction
+
+  - name: speculative_generate
+    given: target model, draft model, prompt, max_tokens
+    when: generating with speculation
+    then: returns generated tokens with speedup
+
+# Algorithm:
+#
+# ┌─────────────────────────────────────────────────────────────┐
+# │              SPECULATIVE DECODING                           │
+# ├─────────────────────────────────────────────────────────────┤
+# │                                                             │
+# │  1. DRAFT: Generate K tokens with small model               │
+# │     draft_tokens = [t1, t2, t3, t4]  (fast, ~10ms)          │
+# │     draft_probs  = [p1, p2, p3, p4]                         │
+# │                                                             │
+# │  2. VERIFY: Run target model on all K tokens (parallel)     │
+# │     target_logits = target.forward([t0, t1, t2, t3, t4])    │
+# │     (single forward pass, ~100ms)                           │
+# │                                                             │
+# │  3. ACCEPT/REJECT: For each position i:                     │
+# │     r = uniform(0, 1)                                       │
+# │     if r < min(1, target_prob[i] / draft_prob[i]):          │
+# │       ACCEPT token i                                        │
+# │     else:                                                   │
+# │       REJECT: sample from (target - draft) distribution     │
+# │       STOP speculation                                      │
+# │                                                             │
+# │  4. BONUS: If all K accepted, sample K+1 from target        │
+# │                                                             │
+# └─────────────────────────────────────────────────────────────┘
+#
+# Speedup Analysis:
+#   Without speculation: 1 token per forward pass
+#   With speculation (K=4, α=0.8):
+#     Expected tokens = 1 + α + α² + α³ + α⁴ = 3.36
+#     Cost = 1 target + K draft ≈ 1.1 target (if draft is 10x faster)
+#     Speedup = 3.36 / 1.1 ≈ 3x
+#
+# Self-Speculation (no draft model):
+#   Use early exit from target model as draft
+#   Or use same model with reduced layers
diff --git a/src/vibeec/tri_inference.zig b/src/vibeec/tri_inference.zig