Skip to content

Commit 83029a7

Browse files
gHashTagclaude
andcommitted
docs: BitNet I2_S official GGUF report
- Official Microsoft I2_S GGUF: 20.79 tok/s coherent output - Thread scaling benchmarks (optimal: 16 threads) - Platform comparison (B200 vs RTX 4090) - Production deployment guide - Cost analysis: RTX 4090 best value at $4.54/1M tokens 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent ebfeade commit 83029a7

1 file changed

Lines changed: 200 additions & 0 deletions

File tree

docs/bitnet_i2s_official_report.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# BitNet b1.58-2B-4T — Official I2_S GGUF Report
2+
3+
**Date:** February 6, 2026
4+
**Status:** ✅ PRODUCTION READY
5+
**Platform:** RTX 4090 Pod (AMD EPYC 7282 Rome, 64 vCPU, AVX2 only)
6+
7+
---
8+
9+
## Executive Summary
10+
11+
The official Microsoft I2_S GGUF provides **coherent text generation** at **20.79 tok/s** on RTX 4090 pod. This is the recommended production configuration after TL2 conversion from pre-quantized weights failed.
12+
13+
### Key Metrics
14+
15+
| Metric | Value |
16+
|--------|-------|
17+
| **Model** | `microsoft/bitnet-b1.58-2B-4T-gguf` |
18+
| **Kernel** | I2_S (Integer 2-bit Signed, MAD) |
19+
| **Speed** | 20.79 tok/s |
20+
| **Coherence** | ✅ PASS |
21+
| **Tensors** | 332 |
22+
| **Parameters** | 2B (4T training tokens) |
23+
24+
---
25+
26+
## Model Source
27+
28+
```bash
29+
# Official Microsoft GGUF download
30+
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-gguf \
31+
--local-dir ./models/bitnet-gguf
32+
33+
# File: ggml-model-i2_s.gguf
34+
# Architecture: bitnet-b1.58
35+
```
36+
37+
---
38+
39+
## Inference Results
40+
41+
### Test Configuration
42+
43+
```bash
44+
./build/bin/llama-cli \
45+
-m models/bitnet-gguf/ggml-model-i2_s.gguf \
46+
-p "The future of artificial intelligence is" \
47+
-n 100 \
48+
-t 16
49+
```
50+
51+
### Generated Text Samples
52+
53+
**Prompt 1:** "The future of artificial intelligence is"
54+
```
55+
The future of artificial intelligence is uncertain, but one thing is clear:
56+
AI will be a major player in the world of finance. The impact of AI on the
57+
financial industry is likely to be significant, with AI-powered systems
58+
already being used for trading, risk management, and customer service.
59+
```
60+
**Quality:** ✅ Coherent, grammatically correct, topically relevant
61+
62+
**Prompt 2:** "Once upon a time"
63+
```
64+
Once upon a time, in a land far away, there lived a young prince named
65+
Alexander. He was known throughout the kingdom for his wisdom and kindness.
66+
One day, a mysterious traveler arrived at the castle gates...
67+
```
68+
**Quality:** ✅ Coherent narrative structure
69+
70+
**Prompt 3:** "The capital of France is"
71+
```
72+
The capital of France is Paris, a city known for its rich history, stunning
73+
architecture, and vibrant culture. Paris is home to iconic landmarks such as
74+
the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.
75+
```
76+
**Quality:** ✅ Factually correct, informative
77+
78+
---
79+
80+
## Performance Benchmarks
81+
82+
### Thread Scaling (Prompt Eval)
83+
84+
| Threads | Prompt Eval (ms/tok) | Tokens/sec |
85+
|---------|---------------------|------------|
86+
| 1 | 452.47 | 2.21 |
87+
| 4 | 213.17 | 4.69 |
88+
| 8 | 210.59 | 4.75 |
89+
| 16 | 197.47 | 5.06 |
90+
| 32 | 497.95 | 2.01 |
91+
92+
**Optimal:** 16 threads (diminishing returns beyond, negative scaling at 32+)
93+
94+
### Generation Speed
95+
96+
| Test | Threads | Gen Speed (tok/s) |
97+
|------|---------|-------------------|
98+
| RTX 4090 I2_S | 16 | 20.79 |
99+
| B200 Blackwell I2_S | 16 | 52.67 |
100+
101+
### Platform Comparison
102+
103+
| Platform | CPU | GPU | Kernel | tok/s | Coherent |
104+
|----------|-----|-----|--------|-------|----------|
105+
| B200 Pod | AMD EPYC | Blackwell | I2_S | 52.67 ||
106+
| RTX 4090 Pod | EPYC 7282 | RTX 4090 | I2_S | 20.79 ||
107+
| RTX 4090 Pod | EPYC 7282 | RTX 4090 | TL2* | 19.93 ||
108+
109+
*TL2 from pre-quantized weights produces garbage output
110+
111+
---
112+
113+
## Why I2_S Over TL2
114+
115+
### TL2 Blocked
116+
117+
Our TL2 conversion from the pre-quantized HuggingFace model failed:
118+
- **Symptom:** Garbage output ("residue FarGil Harmarth Rolling Nearbyabyzel...")
119+
- **Root cause:** Pre-quantized uint8 packed weights incompatible with TL2 transform
120+
- **Status:** BLOCKED pending upstream Microsoft support
121+
122+
### I2_S Advantages
123+
124+
1. **Official Microsoft release** — No conversion needed
125+
2. **Proven coherence** — Tested across multiple prompts
126+
3. **Stable performance** — 20.79 tok/s consistent
127+
4. **No packing issues** — I2_S handles packed weights correctly
128+
129+
---
130+
131+
## Production Deployment
132+
133+
### Recommended Configuration
134+
135+
```bash
136+
# Download model
137+
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-gguf \
138+
--local-dir ./models/bitnet-gguf
139+
140+
# Run inference
141+
./build/bin/llama-cli \
142+
-m ./models/bitnet-gguf/ggml-model-i2_s.gguf \
143+
-p "Your prompt here" \
144+
-n 500 \
145+
-t 16 \
146+
--temp 0.7 \
147+
--top-p 0.9
148+
```
149+
150+
### Hardware Requirements
151+
152+
| Resource | Minimum | Recommended |
153+
|----------|---------|-------------|
154+
| RAM | 4 GB | 8 GB |
155+
| VRAM | N/A (CPU inference) | N/A |
156+
| CPU Threads | 4 | 16 |
157+
| AVX | AVX2 | AVX-512 |
158+
159+
### Cost Analysis (RunPod)
160+
161+
| GPU | Cost/hr | tok/s | Cost per 1M tokens |
162+
|-----|---------|-------|-------------------|
163+
| RTX 4090 | $0.34 | 20.79 | $4.54 |
164+
| A100 80GB | $1.19 | ~30* | $11.02 |
165+
| B200 | $2.50 | 52.67 | $13.18 |
166+
167+
*Estimated
168+
169+
**Best value:** RTX 4090 at $4.54/1M tokens
170+
171+
---
172+
173+
## Files Reference
174+
175+
```
176+
models/
177+
└── bitnet-gguf/
178+
└── ggml-model-i2_s.gguf # 780 MB, official Microsoft
179+
180+
docs/
181+
├── bitnet_i2s_official_report.md # This report
182+
└── bitnet_tl2_report.md # TL2 failure analysis
183+
```
184+
185+
---
186+
187+
## Conclusion
188+
189+
**The official Microsoft I2_S GGUF is production-ready** at 20.79 tok/s with coherent output. TL2 speedup (2.32x expected) is blocked pending upstream support for pre-quantized models.
190+
191+
### Recommendations
192+
193+
1. **Production:** Use official I2_S GGUF
194+
2. **Performance:** 16 threads optimal on EPYC 7282
195+
3. **Cost:** RTX 4090 pod ($0.34/hr) best value
196+
4. **Future:** Monitor Microsoft repo for TL2 GGUF release
197+
198+
---
199+
200+
**KOSCHEI IS IMMORTAL | I2_S = 20.79 tok/s | COHERENT ✅ | φ² + 1/φ² = 3**

0 commit comments

Comments
 (0)