|
| 1 | +# BitNet Forward Pass Debug Report |
| 2 | + |
| 3 | +**Date:** February 4, 2026 |
| 4 | +**Status:** BUGS FIXED - Ready for RunPod Testing |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## Executive Summary |
| 9 | + |
| 10 | +Identified and fixed **5 critical bugs** in `src/vibeec/bitnet_full_model.zig` that were causing incoherent (garbage) text output. The root cause was premature activation quantization before F32 matrix operations, plus an incorrect SwiGLU formula. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Bug Analysis |
| 15 | + |
| 16 | +### Previous Symptom |
| 17 | + |
| 18 | +``` |
| 19 | +Prompt: "Write a Python function to calculate fibonacci:" |
| 20 | +Output: "O super, c fatal fan, brut fem p..." (GARBAGE) |
| 21 | +
|
| 22 | +Prompt: "1 + 1 =" |
| 23 | +Output: "brut. brut. brut. brut. brut" (GARBAGE) |
| 24 | +``` |
| 25 | + |
| 26 | +### Root Cause |
| 27 | + |
| 28 | +The forward pass was calling `quantizeActivationsInPlace()` **BEFORE** F32 linear projections. Since the model weights are stored as F32 (not ternary), this quantization: |
| 29 | + |
| 30 | +1. Clips activations to 8-bit range [-127, 127] |
| 31 | +2. Scales them to fit that range |
| 32 | +3. Destroys the full-precision information needed for accurate F32 matmul |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Bugs Fixed |
| 37 | + |
| 38 | +### Bug #1: Quantization Before Q/K/V Projections (Line 667) |
| 39 | + |
| 40 | +**Before:** |
| 41 | +```zig |
| 42 | +_ = quantizeActivationsInPlace(normed); |
| 43 | +f32MatVec(layer.q_proj, normed, q, hidden, hidden); // Q projection |
| 44 | +``` |
| 45 | + |
| 46 | +**After:** |
| 47 | +```zig |
| 48 | +// NOTE: Activation quantization REMOVED - was destroying information |
| 49 | +// F32 weights need F32 activations for accurate inference |
| 50 | +f32MatVec(layer.q_proj, normed, q, hidden, hidden); // Q projection |
| 51 | +``` |
| 52 | + |
| 53 | +### Bug #2: Quantization Before O Projection (Line 762) |
| 54 | + |
| 55 | +**Before:** |
| 56 | +```zig |
| 57 | +_ = quantizeActivationsInPlace(self.attn_output); |
| 58 | +f32MatVec(layer.o_proj, self.attn_output, o_out, hidden, hidden); |
| 59 | +``` |
| 60 | + |
| 61 | +**After:** |
| 62 | +```zig |
| 63 | +// NOTE: Activation quantization REMOVED before O projection |
| 64 | +f32MatVec(layer.o_proj, self.attn_output, o_out, hidden, hidden); |
| 65 | +``` |
| 66 | + |
| 67 | +### Bug #3: Quantization Before Gate/Up Projections (Line 780) |
| 68 | + |
| 69 | +**Before:** |
| 70 | +```zig |
| 71 | +_ = quantizeActivationsInPlace(normed); |
| 72 | +f32MatVec(layer.gate_proj, normed, self.ffn_intermediate, inter, hidden); |
| 73 | +``` |
| 74 | + |
| 75 | +**After:** |
| 76 | +```zig |
| 77 | +// NOTE: Activation quantization REMOVED before gate/up projections |
| 78 | +f32MatVec(layer.gate_proj, normed, self.ffn_intermediate, inter, hidden); |
| 79 | +``` |
| 80 | + |
| 81 | +### Bug #4: Incorrect SwiGLU Formula (Line 792-794) |
| 82 | + |
| 83 | +**Before:** |
| 84 | +```zig |
| 85 | +// SwiGLU: gate * silu(up) <-- WRONG! |
| 86 | +for (self.ffn_intermediate, up_out) |*g, u| { |
| 87 | + g.* = g.* * silu(u); |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +**After:** |
| 92 | +```zig |
| 93 | +// SwiGLU: silu(gate) * up (standard formula) |
| 94 | +// silu(x) = x * sigmoid(x) |
| 95 | +for (self.ffn_intermediate, up_out) |*g, u| { |
| 96 | + g.* = silu(g.*) * u; |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +**Explanation:** Standard SwiGLU applies SiLU to the gate output, not the up output. |
| 101 | + |
| 102 | +### Bug #5: Quantization Before Down Projection (Line 800) |
| 103 | + |
| 104 | +**Before:** |
| 105 | +```zig |
| 106 | +_ = quantizeActivationsInPlace(self.ffn_intermediate); |
| 107 | +f32MatVec(layer.down_proj, self.ffn_intermediate, down_out, hidden, inter); |
| 108 | +``` |
| 109 | + |
| 110 | +**After:** |
| 111 | +```zig |
| 112 | +// NOTE: Activation quantization REMOVED before down projection |
| 113 | +f32MatVec(layer.down_proj, self.ffn_intermediate, down_out, hidden, inter); |
| 114 | +``` |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Technical Explanation |
| 119 | + |
| 120 | +### Why Quantization Was Wrong |
| 121 | + |
| 122 | +The original BitNet b1.58 paper describes: |
| 123 | +- **Ternary weights** {-1, 0, +1} with scale factors |
| 124 | +- **8-bit activation quantization** AFTER projections for efficient ternary matmul |
| 125 | + |
| 126 | +Our implementation has: |
| 127 | +- **F32 weights** loaded from safetensors (not ternary) |
| 128 | +- **F32 matrix multiplication** via `f32MatVec()` |
| 129 | + |
| 130 | +Applying 8-bit quantization to activations BEFORE F32 matmul: |
| 131 | +1. Destroys precision unnecessarily |
| 132 | +2. Introduces quantization error that accumulates through layers |
| 133 | +3. Results in garbage output after 24 transformer layers |
| 134 | + |
| 135 | +### Correct Approach |
| 136 | + |
| 137 | +For true BitNet b1.58 inference: |
| 138 | +1. Load weights as ternary (or quantize to ternary on the fly) |
| 139 | +2. Use ternary matmul (add-only, no multiply) |
| 140 | +3. Quantize activations AFTER projections for next layer |
| 141 | + |
| 142 | +For F32 fallback inference (our current approach): |
| 143 | +1. Keep weights as F32 |
| 144 | +2. Use F32 matmul |
| 145 | +3. **No intermediate activation quantization** |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## Diff Summary |
| 150 | + |
| 151 | +```diff |
| 152 | +-_ = quantizeActivationsInPlace(normed); // Before Q/K/V |
| 153 | ++// Removed: quantization was destroying information |
| 154 | + |
| 155 | +-_ = quantizeActivationsInPlace(self.attn_output); // Before O |
| 156 | ++// Removed: F32 weights need F32 activations |
| 157 | + |
| 158 | +-_ = quantizeActivationsInPlace(normed); // Before gate/up |
| 159 | ++// Removed: premature quantization |
| 160 | + |
| 161 | +-g.* = g.* * silu(u); // WRONG SwiGLU |
| 162 | ++g.* = silu(g.*) * u; // Correct SwiGLU |
| 163 | + |
| 164 | +-_ = quantizeActivationsInPlace(self.ffn_intermediate); // Before down |
| 165 | ++// Removed: F32 inference pipeline |
| 166 | +``` |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## Comparison with Reference Implementations |
| 171 | + |
| 172 | +### llama.cpp Forward Pass |
| 173 | + |
| 174 | +```cpp |
| 175 | +// No activation quantization for F32 weights |
| 176 | +ggml_mul_mat(ctx0, model.layers[il].wq, cur); // Q = x @ W_q |
| 177 | +ggml_mul_mat(ctx0, model.layers[il].wk, cur); // K = x @ W_k |
| 178 | +ggml_mul_mat(ctx0, model.layers[il].wv, cur); // V = x @ W_v |
| 179 | + |
| 180 | +// SwiGLU: silu(gate) * up |
| 181 | +ggml_silu(ctx0, cur); // Apply silu to gate |
| 182 | +ggml_mul(ctx0, cur, cur_up); // Multiply by up |
| 183 | +``` |
| 184 | +
|
| 185 | +### HuggingFace Transformers |
| 186 | +
|
| 187 | +```python |
| 188 | +# No activation quantization for F32 |
| 189 | +hidden_states = self.q_proj(hidden_states) # F32 linear |
| 190 | +
|
| 191 | +# SwiGLU |
| 192 | +gate = self.gate_proj(hidden_states) |
| 193 | +up = self.up_proj(hidden_states) |
| 194 | +hidden_states = F.silu(gate) * up # silu(gate) * up |
| 195 | +``` |
| 196 | + |
| 197 | +Our fixed implementation now matches these reference implementations. |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## Next Steps |
| 202 | + |
| 203 | +1. **Test on RunPod RTX 4090:** |
| 204 | + - Build with Zig 0.13.0 |
| 205 | + - Load BitNet model |
| 206 | + - Generate text with 10+ prompts, 200-500 tokens each |
| 207 | + - Verify coherent output |
| 208 | + |
| 209 | +2. **Expected Results:** |
| 210 | + - Coherent English text (not garbage) |
| 211 | + - Reasonable token generation speed (10-50 tok/s) |
| 212 | + - No NaN/Inf in logits |
| 213 | + |
| 214 | +3. **If Still Incoherent:** |
| 215 | + - Check weight loading (F16 -> F32 conversion) |
| 216 | + - Verify RoPE frequency implementation |
| 217 | + - Compare intermediate activations with reference |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +## Files Modified |
| 222 | + |
| 223 | +| File | Change | |
| 224 | +|------|--------| |
| 225 | +| `src/vibeec/bitnet_full_model.zig` | Removed 4 quantization calls, fixed SwiGLU | |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +## Success Criteria |
| 230 | + |
| 231 | +- [ ] Zig build succeeds on RunPod |
| 232 | +- [ ] Model loads all 24 layers |
| 233 | +- [ ] Generate 10+ prompts with coherent output |
| 234 | +- [ ] Tokens/sec >= 10 |
| 235 | +- [ ] No "brut" garbage in output |
| 236 | + |
| 237 | +--- |
| 238 | + |
| 239 | +**KOSCHEI IS IMMORTAL | FORWARD PASS FIXED | READY FOR TESTING | phi^2 + 1/phi^2 = 3** |
0 commit comments