|
| 1 | +# BitNet b1.58-2B-4T — Official I2_S GGUF Report |
| 2 | + |
| 3 | +**Date:** February 6, 2026 |
| 4 | +**Status:** ✅ PRODUCTION READY |
| 5 | +**Platform:** RTX 4090 Pod (AMD EPYC 7282 Rome, 64 vCPU, AVX2 only) |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Executive Summary |
| 10 | + |
| 11 | +The official Microsoft I2_S GGUF provides **coherent text generation** at **20.79 tok/s** on RTX 4090 pod. This is the recommended production configuration after TL2 conversion from pre-quantized weights failed. |
| 12 | + |
| 13 | +### Key Metrics |
| 14 | + |
| 15 | +| Metric | Value | |
| 16 | +|--------|-------| |
| 17 | +| **Model** | `microsoft/bitnet-b1.58-2B-4T-gguf` | |
| 18 | +| **Kernel** | I2_S (Integer 2-bit Signed, MAD) | |
| 19 | +| **Speed** | 20.79 tok/s | |
| 20 | +| **Coherence** | ✅ PASS | |
| 21 | +| **Tensors** | 332 | |
| 22 | +| **Parameters** | 2B (4T training tokens) | |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Model Source |
| 27 | + |
| 28 | +```bash |
| 29 | +# Official Microsoft GGUF download |
| 30 | +huggingface-cli download microsoft/bitnet-b1.58-2B-4T-gguf \ |
| 31 | + --local-dir ./models/bitnet-gguf |
| 32 | + |
| 33 | +# File: ggml-model-i2_s.gguf |
| 34 | +# Architecture: bitnet-b1.58 |
| 35 | +``` |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## Inference Results |
| 40 | + |
| 41 | +### Test Configuration |
| 42 | + |
| 43 | +```bash |
| 44 | +./build/bin/llama-cli \ |
| 45 | + -m models/bitnet-gguf/ggml-model-i2_s.gguf \ |
| 46 | + -p "The future of artificial intelligence is" \ |
| 47 | + -n 100 \ |
| 48 | + -t 16 |
| 49 | +``` |
| 50 | + |
| 51 | +### Generated Text Samples |
| 52 | + |
| 53 | +**Prompt 1:** "The future of artificial intelligence is" |
| 54 | +``` |
| 55 | +The future of artificial intelligence is uncertain, but one thing is clear: |
| 56 | +AI will be a major player in the world of finance. The impact of AI on the |
| 57 | +financial industry is likely to be significant, with AI-powered systems |
| 58 | +already being used for trading, risk management, and customer service. |
| 59 | +``` |
| 60 | +**Quality:** ✅ Coherent, grammatically correct, topically relevant |
| 61 | + |
| 62 | +**Prompt 2:** "Once upon a time" |
| 63 | +``` |
| 64 | +Once upon a time, in a land far away, there lived a young prince named |
| 65 | +Alexander. He was known throughout the kingdom for his wisdom and kindness. |
| 66 | +One day, a mysterious traveler arrived at the castle gates... |
| 67 | +``` |
| 68 | +**Quality:** ✅ Coherent narrative structure |
| 69 | + |
| 70 | +**Prompt 3:** "The capital of France is" |
| 71 | +``` |
| 72 | +The capital of France is Paris, a city known for its rich history, stunning |
| 73 | +architecture, and vibrant culture. Paris is home to iconic landmarks such as |
| 74 | +the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. |
| 75 | +``` |
| 76 | +**Quality:** ✅ Factually correct, informative |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Performance Benchmarks |
| 81 | + |
| 82 | +### Thread Scaling (Prompt Eval) |
| 83 | + |
| 84 | +| Threads | Prompt Eval (ms/tok) | Tokens/sec | |
| 85 | +|---------|---------------------|------------| |
| 86 | +| 1 | 452.47 | 2.21 | |
| 87 | +| 4 | 213.17 | 4.69 | |
| 88 | +| 8 | 210.59 | 4.75 | |
| 89 | +| 16 | 197.47 | 5.06 | |
| 90 | +| 32 | 497.95 | 2.01 | |
| 91 | + |
| 92 | +**Optimal:** 16 threads (diminishing returns beyond, negative scaling at 32+) |
| 93 | + |
| 94 | +### Generation Speed |
| 95 | + |
| 96 | +| Test | Threads | Gen Speed (tok/s) | |
| 97 | +|------|---------|-------------------| |
| 98 | +| RTX 4090 I2_S | 16 | 20.79 | |
| 99 | +| B200 Blackwell I2_S | 16 | 52.67 | |
| 100 | + |
| 101 | +### Platform Comparison |
| 102 | + |
| 103 | +| Platform | CPU | GPU | Kernel | tok/s | Coherent | |
| 104 | +|----------|-----|-----|--------|-------|----------| |
| 105 | +| B200 Pod | AMD EPYC | Blackwell | I2_S | 52.67 | ✅ | |
| 106 | +| RTX 4090 Pod | EPYC 7282 | RTX 4090 | I2_S | 20.79 | ✅ | |
| 107 | +| RTX 4090 Pod | EPYC 7282 | RTX 4090 | TL2* | 19.93 | ❌ | |
| 108 | + |
| 109 | +*TL2 from pre-quantized weights produces garbage output |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## Why I2_S Over TL2 |
| 114 | + |
| 115 | +### TL2 Blocked |
| 116 | + |
| 117 | +Our TL2 conversion from the pre-quantized HuggingFace model failed: |
| 118 | +- **Symptom:** Garbage output ("residue FarGil Harmarth Rolling Nearbyabyzel...") |
| 119 | +- **Root cause:** Pre-quantized uint8 packed weights incompatible with TL2 transform |
| 120 | +- **Status:** BLOCKED pending upstream Microsoft support |
| 121 | + |
| 122 | +### I2_S Advantages |
| 123 | + |
| 124 | +1. **Official Microsoft release** — No conversion needed |
| 125 | +2. **Proven coherence** — Tested across multiple prompts |
| 126 | +3. **Stable performance** — 20.79 tok/s consistent |
| 127 | +4. **No packing issues** — I2_S handles packed weights correctly |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Production Deployment |
| 132 | + |
| 133 | +### Recommended Configuration |
| 134 | + |
| 135 | +```bash |
| 136 | +# Download model |
| 137 | +huggingface-cli download microsoft/bitnet-b1.58-2B-4T-gguf \ |
| 138 | + --local-dir ./models/bitnet-gguf |
| 139 | + |
| 140 | +# Run inference |
| 141 | +./build/bin/llama-cli \ |
| 142 | + -m ./models/bitnet-gguf/ggml-model-i2_s.gguf \ |
| 143 | + -p "Your prompt here" \ |
| 144 | + -n 500 \ |
| 145 | + -t 16 \ |
| 146 | + --temp 0.7 \ |
| 147 | + --top-p 0.9 |
| 148 | +``` |
| 149 | + |
| 150 | +### Hardware Requirements |
| 151 | + |
| 152 | +| Resource | Minimum | Recommended | |
| 153 | +|----------|---------|-------------| |
| 154 | +| RAM | 4 GB | 8 GB | |
| 155 | +| VRAM | N/A (CPU inference) | N/A | |
| 156 | +| CPU Threads | 4 | 16 | |
| 157 | +| AVX | AVX2 | AVX-512 | |
| 158 | + |
| 159 | +### Cost Analysis (RunPod) |
| 160 | + |
| 161 | +| GPU | Cost/hr | tok/s | Cost per 1M tokens | |
| 162 | +|-----|---------|-------|-------------------| |
| 163 | +| RTX 4090 | $0.34 | 20.79 | $4.54 | |
| 164 | +| A100 80GB | $1.19 | ~30* | $11.02 | |
| 165 | +| B200 | $2.50 | 52.67 | $13.18 | |
| 166 | + |
| 167 | +*Estimated |
| 168 | + |
| 169 | +**Best value:** RTX 4090 at $4.54/1M tokens |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Files Reference |
| 174 | + |
| 175 | +``` |
| 176 | +models/ |
| 177 | +└── bitnet-gguf/ |
| 178 | + └── ggml-model-i2_s.gguf # 780 MB, official Microsoft |
| 179 | +
|
| 180 | +docs/ |
| 181 | +├── bitnet_i2s_official_report.md # This report |
| 182 | +└── bitnet_tl2_report.md # TL2 failure analysis |
| 183 | +``` |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## Conclusion |
| 188 | + |
| 189 | +**The official Microsoft I2_S GGUF is production-ready** at 20.79 tok/s with coherent output. TL2 speedup (2.32x expected) is blocked pending upstream support for pre-quantized models. |
| 190 | + |
| 191 | +### Recommendations |
| 192 | + |
| 193 | +1. **Production:** Use official I2_S GGUF |
| 194 | +2. **Performance:** 16 threads optimal on EPYC 7282 |
| 195 | +3. **Cost:** RTX 4090 pod ($0.34/hr) best value |
| 196 | +4. **Future:** Monitor Microsoft repo for TL2 GGUF release |
| 197 | + |
| 198 | +--- |
| 199 | + |
| 200 | +**KOSCHEI IS IMMORTAL | I2_S = 20.79 tok/s | COHERENT ✅ | φ² + 1/φ² = 3** |
0 commit comments