docs: add BitNet full E2E report - model loads, output needs fix

gHashTag · ona-agent · gHashTag · commit a51f95dc2b95 · 2026-02-04T10:18:21.000Z
- Full 30/30 layers load successfully on L40S (503GB RAM)
- Load time: 6.5 seconds
- Inference: 2.2 tok/s
- Output: garbage (forward pass issue, not dequantization)
- Cost: ~sh.15

Next: debug transformer forward pass or use BitNet.cpp

Co-authored-by: Ona &lt;no-reply@ona.com&gt;
diff --git a/docs/bitnet_full_e2e_report.md b/docs/bitnet_full_e2e_report.md
@@ -0,0 +1,138 @@
+# BitNet Full E2E Report - L40S (503GB RAM)
+
+**Date:** February 4, 2026  
+**Model:** microsoft/bitnet-b1.58-2B-4T-gguf (1.2GB)  
+**GPU:** NVIDIA L40S (48GB VRAM, 503GB RAM)  
+**Status:** Model Loads Fully, Output Quality Issue
+
+---
+
+## Executive Summary
+
+Successfully loaded **all 30 layers** of BitNet 2B model on L40S with 503GB RAM. Model runs inference at **2.2 tokens/sec**, but output is garbage (not coherent). Issue is likely in forward pass implementation, not dequantization.
+
+---
+
+## Load Results
+
+### Model Loading
+```
+Loading model: bitnet-2b/ggml-model-i2_s.gguf
+
+MODEL CONFIG
+  Vocab size:       128256
+  Hidden size:      2560
+  Intermediate:     6912
+  Num layers:       30
+  Num heads:        20
+  Num KV heads:     5
+  Head dim:         128
+  Context length:   4096
+
+Loading weights...
+  Loading layer 1/30... ✅
+  Loading layer 2/30... ✅
+  ...
+  Loading layer 30/30... ✅
+  Loaded 30 layers ✅
+```
+
+### Load Profiling
+| Component | Time | % |
+|-----------|------|---|
+| Thread pool init | 4.12 ms | 0.1% |
+| Embeddings | 1417.86 ms | 21.7% |
+| RoPE init | 14.26 ms | 0.2% |
+| KV cache init | 0.13 ms | 0.0% |
+| **Layer weights** | **5099.80 ms** | **78.0%** |
+| Buffer alloc | 0.02 ms | 0.0% |
+| **TOTAL** | **6536.21 ms** | 100% |
+
+---
+
+## Inference Results
+
+### Performance
+| Metric | Value |
+|--------|-------|
+| Prefill speed | 2.1-2.4 tok/s |
+| Generation speed | 1.92-2.37 tok/s |
+| Prefill time (36 tokens) | 14-17 seconds |
+| Generation time (50 tokens) | 21-26 seconds |
+
+### Output Quality
+**Status: GARBAGE** - Output is random tokens, not coherent text.
+
+Example outputs:
+```
+Prompt: "Write a Python function to calculate fibonacci:"
+Output: "iumardiÄĵÄĵÄĵvialerbgt.jsÃŃÄĵvialerbityReference..."
+
+Prompt: "What is the capital of France?"
+Output: "ialialialiumolentolewiseÌerciseiumernercise..."
+
+Prompt: "Explain quantum computing in simple terms:"
+Output: "iumlicer900ntntatchatchoremernitnessitness..."
+```
+
+---
+
+## Analysis
+
+### What Works
+1. ✅ Full model loading (30/30 layers)
+2. ✅ I2_S dequantization (no errors)
+3. ✅ Tokenizer (128K vocab)
+4. ✅ Inference runs (no crashes)
+5. ✅ Memory sufficient (503GB RAM)
+
+### What Doesn't Work
+1. ❌ Output quality (garbage)
+2. ❌ Coherent text generation
+
+### Likely Causes
+1. **Forward pass bug** - Attention or FFN implementation may have issues
+2. **Scale factor** - BitNet may need specific scale values per layer
+3. **Weight layout** - Interleaved pattern may be wrong
+4. **RoPE implementation** - Rotary embeddings may be incorrect
+
+---
+
+## Comparison
+
+| Model | Load | Output |
+|-------|------|--------|
+| TinyLlama (Q8_0→ternary) | ✅ | Garbage |
+| BitNet 2B (I2_S native) | ✅ | Garbage |
+| Test model (synthetic) | ✅ | Coherent |
+
+**Conclusion:** Issue is in transformer implementation, not quantization format.
+
+---
+
+## Recommendations
+
+### Option A: Debug Forward Pass
+- Add logging to attention/FFN
+- Compare intermediate values with reference
+- Estimated: 4-8 hours
+
+### Option B: Use BitNet.cpp
+- Microsoft's official inference engine
+- Known to produce coherent output
+- Requires C++ compilation
+
+### Option C: Use llama.cpp with BitNet
+- llama.cpp supports I2_S format
+- May work out of the box
+
+---
+
+## Cost
+- RunPod L40S: ~$0.59/hour
+- Time used: ~15 minutes
+- **Cost: ~$0.15**
+
+---
+
+**KOSCHEI IS IMMORTAL | MODEL LOADS FULLY | φ² + 1/φ² = 3**