|
| 1 | +# BitNet Full E2E Report - L40S (503GB RAM) |
| 2 | + |
| 3 | +**Date:** February 4, 2026 |
| 4 | +**Model:** microsoft/bitnet-b1.58-2B-4T-gguf (1.2GB) |
| 5 | +**GPU:** NVIDIA L40S (48GB VRAM, 503GB RAM) |
| 6 | +**Status:** Model Loads Fully, Output Quality Issue |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Executive Summary |
| 11 | + |
| 12 | +Successfully loaded **all 30 layers** of BitNet 2B model on L40S with 503GB RAM. Model runs inference at **2.2 tokens/sec**, but output is garbage (not coherent). Issue is likely in forward pass implementation, not dequantization. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Load Results |
| 17 | + |
| 18 | +### Model Loading |
| 19 | +``` |
| 20 | +Loading model: bitnet-2b/ggml-model-i2_s.gguf |
| 21 | +
|
| 22 | +MODEL CONFIG |
| 23 | + Vocab size: 128256 |
| 24 | + Hidden size: 2560 |
| 25 | + Intermediate: 6912 |
| 26 | + Num layers: 30 |
| 27 | + Num heads: 20 |
| 28 | + Num KV heads: 5 |
| 29 | + Head dim: 128 |
| 30 | + Context length: 4096 |
| 31 | +
|
| 32 | +Loading weights... |
| 33 | + Loading layer 1/30... ✅ |
| 34 | + Loading layer 2/30... ✅ |
| 35 | + ... |
| 36 | + Loading layer 30/30... ✅ |
| 37 | + Loaded 30 layers ✅ |
| 38 | +``` |
| 39 | + |
| 40 | +### Load Profiling |
| 41 | +| Component | Time | % | |
| 42 | +|-----------|------|---| |
| 43 | +| Thread pool init | 4.12 ms | 0.1% | |
| 44 | +| Embeddings | 1417.86 ms | 21.7% | |
| 45 | +| RoPE init | 14.26 ms | 0.2% | |
| 46 | +| KV cache init | 0.13 ms | 0.0% | |
| 47 | +| **Layer weights** | **5099.80 ms** | **78.0%** | |
| 48 | +| Buffer alloc | 0.02 ms | 0.0% | |
| 49 | +| **TOTAL** | **6536.21 ms** | 100% | |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## Inference Results |
| 54 | + |
| 55 | +### Performance |
| 56 | +| Metric | Value | |
| 57 | +|--------|-------| |
| 58 | +| Prefill speed | 2.1-2.4 tok/s | |
| 59 | +| Generation speed | 1.92-2.37 tok/s | |
| 60 | +| Prefill time (36 tokens) | 14-17 seconds | |
| 61 | +| Generation time (50 tokens) | 21-26 seconds | |
| 62 | + |
| 63 | +### Output Quality |
| 64 | +**Status: GARBAGE** - Output is random tokens, not coherent text. |
| 65 | + |
| 66 | +Example outputs: |
| 67 | +``` |
| 68 | +Prompt: "Write a Python function to calculate fibonacci:" |
| 69 | +Output: "iumardiÄĵÄĵÄĵvialerbgt.jsÃŃÄĵvialerbityReference..." |
| 70 | +
|
| 71 | +Prompt: "What is the capital of France?" |
| 72 | +Output: "ialialialiumolentolewiseÌerciseiumernercise..." |
| 73 | +
|
| 74 | +Prompt: "Explain quantum computing in simple terms:" |
| 75 | +Output: "iumlicer900ntntatchatchoremernitnessitness..." |
| 76 | +``` |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Analysis |
| 81 | + |
| 82 | +### What Works |
| 83 | +1. ✅ Full model loading (30/30 layers) |
| 84 | +2. ✅ I2_S dequantization (no errors) |
| 85 | +3. ✅ Tokenizer (128K vocab) |
| 86 | +4. ✅ Inference runs (no crashes) |
| 87 | +5. ✅ Memory sufficient (503GB RAM) |
| 88 | + |
| 89 | +### What Doesn't Work |
| 90 | +1. ❌ Output quality (garbage) |
| 91 | +2. ❌ Coherent text generation |
| 92 | + |
| 93 | +### Likely Causes |
| 94 | +1. **Forward pass bug** - Attention or FFN implementation may have issues |
| 95 | +2. **Scale factor** - BitNet may need specific scale values per layer |
| 96 | +3. **Weight layout** - Interleaved pattern may be wrong |
| 97 | +4. **RoPE implementation** - Rotary embeddings may be incorrect |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## Comparison |
| 102 | + |
| 103 | +| Model | Load | Output | |
| 104 | +|-------|------|--------| |
| 105 | +| TinyLlama (Q8_0→ternary) | ✅ | Garbage | |
| 106 | +| BitNet 2B (I2_S native) | ✅ | Garbage | |
| 107 | +| Test model (synthetic) | ✅ | Coherent | |
| 108 | + |
| 109 | +**Conclusion:** Issue is in transformer implementation, not quantization format. |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## Recommendations |
| 114 | + |
| 115 | +### Option A: Debug Forward Pass |
| 116 | +- Add logging to attention/FFN |
| 117 | +- Compare intermediate values with reference |
| 118 | +- Estimated: 4-8 hours |
| 119 | + |
| 120 | +### Option B: Use BitNet.cpp |
| 121 | +- Microsoft's official inference engine |
| 122 | +- Known to produce coherent output |
| 123 | +- Requires C++ compilation |
| 124 | + |
| 125 | +### Option C: Use llama.cpp with BitNet |
| 126 | +- llama.cpp supports I2_S format |
| 127 | +- May work out of the box |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Cost |
| 132 | +- RunPod L40S: ~$0.59/hour |
| 133 | +- Time used: ~15 minutes |
| 134 | +- **Cost: ~$0.15** |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +**KOSCHEI IS IMMORTAL | MODEL LOADS FULLY | φ² + 1/φ² = 3** |
0 commit comments