Skip to content

Commit a51f95d

Browse files
gHashTagona-agent
andcommitted
docs: add BitNet full E2E report - model loads, output needs fix
- Full 30/30 layers load successfully on L40S (503GB RAM) - Load time: 6.5 seconds - Inference: 2.2 tok/s - Output: garbage (forward pass issue, not dequantization) - Cost: ~sh.15 Next: debug transformer forward pass or use BitNet.cpp Co-authored-by: Ona <no-reply@ona.com>
1 parent 52fe797 commit a51f95d

1 file changed

Lines changed: 138 additions & 0 deletions

File tree

docs/bitnet_full_e2e_report.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# BitNet Full E2E Report - L40S (503GB RAM)
2+
3+
**Date:** February 4, 2026
4+
**Model:** microsoft/bitnet-b1.58-2B-4T-gguf (1.2GB)
5+
**GPU:** NVIDIA L40S (48GB VRAM, 503GB RAM)
6+
**Status:** Model Loads Fully, Output Quality Issue
7+
8+
---
9+
10+
## Executive Summary
11+
12+
Successfully loaded **all 30 layers** of BitNet 2B model on L40S with 503GB RAM. Model runs inference at **2.2 tokens/sec**, but output is garbage (not coherent). Issue is likely in forward pass implementation, not dequantization.
13+
14+
---
15+
16+
## Load Results
17+
18+
### Model Loading
19+
```
20+
Loading model: bitnet-2b/ggml-model-i2_s.gguf
21+
22+
MODEL CONFIG
23+
Vocab size: 128256
24+
Hidden size: 2560
25+
Intermediate: 6912
26+
Num layers: 30
27+
Num heads: 20
28+
Num KV heads: 5
29+
Head dim: 128
30+
Context length: 4096
31+
32+
Loading weights...
33+
Loading layer 1/30... ✅
34+
Loading layer 2/30... ✅
35+
...
36+
Loading layer 30/30... ✅
37+
Loaded 30 layers ✅
38+
```
39+
40+
### Load Profiling
41+
| Component | Time | % |
42+
|-----------|------|---|
43+
| Thread pool init | 4.12 ms | 0.1% |
44+
| Embeddings | 1417.86 ms | 21.7% |
45+
| RoPE init | 14.26 ms | 0.2% |
46+
| KV cache init | 0.13 ms | 0.0% |
47+
| **Layer weights** | **5099.80 ms** | **78.0%** |
48+
| Buffer alloc | 0.02 ms | 0.0% |
49+
| **TOTAL** | **6536.21 ms** | 100% |
50+
51+
---
52+
53+
## Inference Results
54+
55+
### Performance
56+
| Metric | Value |
57+
|--------|-------|
58+
| Prefill speed | 2.1-2.4 tok/s |
59+
| Generation speed | 1.92-2.37 tok/s |
60+
| Prefill time (36 tokens) | 14-17 seconds |
61+
| Generation time (50 tokens) | 21-26 seconds |
62+
63+
### Output Quality
64+
**Status: GARBAGE** - Output is random tokens, not coherent text.
65+
66+
Example outputs:
67+
```
68+
Prompt: "Write a Python function to calculate fibonacci:"
69+
Output: "iumardiÄĵÄĵÄĵvialerbgt.jsÃŃÄĵvialerbityReference..."
70+
71+
Prompt: "What is the capital of France?"
72+
Output: "ialialialiumolentolewiseÌerciseiumernercise..."
73+
74+
Prompt: "Explain quantum computing in simple terms:"
75+
Output: "iumlicer900ntntatchatchoremernitnessitness..."
76+
```
77+
78+
---
79+
80+
## Analysis
81+
82+
### What Works
83+
1. ✅ Full model loading (30/30 layers)
84+
2. ✅ I2_S dequantization (no errors)
85+
3. ✅ Tokenizer (128K vocab)
86+
4. ✅ Inference runs (no crashes)
87+
5. ✅ Memory sufficient (503GB RAM)
88+
89+
### What Doesn't Work
90+
1. ❌ Output quality (garbage)
91+
2. ❌ Coherent text generation
92+
93+
### Likely Causes
94+
1. **Forward pass bug** - Attention or FFN implementation may have issues
95+
2. **Scale factor** - BitNet may need specific scale values per layer
96+
3. **Weight layout** - Interleaved pattern may be wrong
97+
4. **RoPE implementation** - Rotary embeddings may be incorrect
98+
99+
---
100+
101+
## Comparison
102+
103+
| Model | Load | Output |
104+
|-------|------|--------|
105+
| TinyLlama (Q8_0→ternary) || Garbage |
106+
| BitNet 2B (I2_S native) || Garbage |
107+
| Test model (synthetic) || Coherent |
108+
109+
**Conclusion:** Issue is in transformer implementation, not quantization format.
110+
111+
---
112+
113+
## Recommendations
114+
115+
### Option A: Debug Forward Pass
116+
- Add logging to attention/FFN
117+
- Compare intermediate values with reference
118+
- Estimated: 4-8 hours
119+
120+
### Option B: Use BitNet.cpp
121+
- Microsoft's official inference engine
122+
- Known to produce coherent output
123+
- Requires C++ compilation
124+
125+
### Option C: Use llama.cpp with BitNet
126+
- llama.cpp supports I2_S format
127+
- May work out of the box
128+
129+
---
130+
131+
## Cost
132+
- RunPod L40S: ~$0.59/hour
133+
- Time used: ~15 minutes
134+
- **Cost: ~$0.15**
135+
136+
---
137+
138+
**KOSCHEI IS IMMORTAL | MODEL LOADS FULLY | φ² + 1/φ² = 3**

0 commit comments

Comments
 (0)