|
1 | | -# BitNet b1.58-2B-4T — TL2 Kernel Conversion & Benchmark Report |
| 1 | +# BitNet b1.58-2B-4T — TL2 Kernel Conversion Report |
2 | 2 |
|
3 | 3 | **Date:** February 6, 2026 |
4 | | -**Status:** SCRIPT READY — Awaiting RTX 4090 pod deployment |
5 | | -**Target:** 100-200 tok/s with TL2 lookup-table kernels |
6 | | -**Script:** `scripts/runpod_tl2_bitnet.sh` |
| 4 | +**Status:** TL2 CONVERSION BLOCKED — I2_S Baseline Confirmed |
| 5 | +**Platform:** RTX 4090 Pod (AMD EPYC 7282 Rome, 64 vCPU, AVX2 only) |
7 | 6 |
|
8 | 7 | --- |
9 | 8 |
|
10 | 9 | ## Executive Summary |
11 | 10 |
|
12 | | -TL2 (Table Lookup Level 2) kernels promise **2.32x speedup** over the current I2_S MAD kernel. Based on the B200 benchmark (52.67 tok/s with I2_S), TL2 should achieve **~120 tok/s** on the same hardware. On RTX 4090 pod (35 tok/s I2_S baseline), TL2 targets **~80 tok/s**. |
| 11 | +TL2 conversion from the pre-quantized HuggingFace model **failed to produce coherent output**. The official Microsoft I2_S GGUF works correctly at **20.79 tok/s** with coherent text generation. |
13 | 12 |
|
14 | | -### Three Critical Patches |
| 13 | +### Key Findings |
15 | 14 |
|
16 | | -The upstream Microsoft BitNet repo has three bugs preventing TL2 from working with BitNet b1.58-2B-4T: |
| 15 | +| Metric | I2_S (Official) | TL2 (Our Conversion) | |
| 16 | +|--------|-----------------|---------------------| |
| 17 | +| **Coherence** | ✅ PASS | ❌ FAIL (garbage) | |
| 18 | +| **Speed** | 20.79 tok/s | 19.93 tok/s | |
| 19 | +| **Model loads** | ✅ 332 tensors | ✅ 332 tensors | |
| 20 | +| **Source** | Microsoft GGUF | HF → Unpack → TL2 | |
17 | 21 |
|
18 | | -| Patch | File | Bug | Fix | |
19 | | -|-------|------|-----|-----| |
20 | | -| **1** | `setup_env.py` | `BITNET_X86_TL2=OFF` hardcoded for x86_64 | Change to `=ON` | |
21 | | -| **2** | `convert-hf-to-gguf-bitnet.py` | Only registers `BitnetForCausalLM` (lowercase n) | Add `@Model.register("BitNetForCausalLM")` | |
22 | | -| **3** | `convert-hf-to-gguf-bitnet.py` | `set_vocab()` hardcodes `_set_vocab_sentencepiece()` | Try/except fallback: SP → LlamaHF → GPT2/BPE | |
| 22 | +### Root Cause |
23 | 23 |
|
24 | | ---- |
25 | | - |
26 | | -## Background |
27 | | - |
28 | | -### I2_S vs TL2 Kernel Comparison |
29 | | - |
30 | | -| Feature | I2_S (MAD) | TL2 (Table Lookup) | |
31 | | -|---------|-----------|---------------------| |
32 | | -| **Encoding** | 2-bit signed integer | 5-bit lookup table (3 ternary values) | |
33 | | -| **Bits/weight** | 2.0 | ~1.67 | |
34 | | -| **Kernel** | Multiply-Add-Dot | Table lookup + accumulate | |
35 | | -| **AVX-512 utilization** | Partial (VNNI underused) | Full (optimized LUT) | |
36 | | -| **Expected speed** | 35-56 tok/s | 80-200 tok/s | |
37 | | -| **Speedup factor** | 1x (baseline) | **2.32x** (published benchmarks) | |
| 24 | +The BitNet b1.58-2B-4T model on HuggingFace is **pre-quantized with packed uint8 weights** (4 ternary values per byte). Our unpacking + TL2 transformation pipeline introduces errors in the weight encoding that break coherent generation. |
38 | 25 |
|
39 | | -### Why TL2 Was Not Used Previously |
40 | | - |
41 | | -On the B200 pod (February 5, 2026), TL2 failed because: |
42 | | - |
43 | | -1. **Tokenizer bug:** `convert-hf-to-gguf-bitnet.py` hardcodes SentencePiece tokenizer, but BitNet b1.58-2B-4T uses BPE (`tokenizer.json`, LLaMA 3 style) |
44 | | -2. **Architecture name bug:** Model config has `BitNetForCausalLM` (capital N), converter only registers `BitnetForCausalLM` (lowercase n) |
45 | | -3. **CMake flag bug:** `setup_env.py` hardcodes `-DBITNET_X86_TL2=OFF` for x86_64, never enabling TL2 kernels even when `-q tl2` is passed |
| 26 | +--- |
46 | 27 |
|
47 | | -**Critical finding from B200:** Loading an I2_S model with TL2 kernels compiled drops inference from 50 tok/s to **1.55 tok/s** — the formats are incompatible. |
| 28 | +## Work Completed |
48 | 29 |
|
49 | | ---- |
| 30 | +### 1. Patches Applied (7 total) |
50 | 31 |
|
51 | | -## Patch Details |
| 32 | +| # | File | Issue | Fix | |
| 33 | +|---|------|-------|-----| |
| 34 | +| 1 | `setup_env.py` | `BITNET_X86_TL2=OFF` hardcoded | Set to `ON` | |
| 35 | +| 2 | `convert-hf-to-gguf-bitnet.py` | `BitnetForCausalLM` lowercase | Added `BitNetForCausalLM` | |
| 36 | +| 3 | `convert-hf-to-gguf-bitnet.py` | SentencePiece hardcoded | Added BPE fallback | |
| 37 | +| 4 | `codegen_tl2.py` | Missing 2B-4T shapes | Added `[[640, 2560], [2560, 2560], [2560, 6912], [6912, 2560]]` | |
| 38 | +| 5 | `setup_env.py` | Wrong model name in codegen | Fixed to `BitNet-b1.58-2B-4T` | |
| 39 | +| 6 | `convert-hf-to-gguf-bitnet.py` | `weight_scale` tensors not skipped | Added `if name.endswith("weight_scale"): continue` | |
| 40 | +| 7 | Block sizes | BK=64 not divisible by 3 | Changed to BK=96 | |
52 | 41 |
|
53 | | -### Patch 1: Enable TL2 in CMake |
| 42 | +### 2. Unpacking Script Created |
54 | 43 |
|
55 | | -**File:** `setup_env.py` |
| 44 | +`/tmp/unpack_bitnet.py` — unpacks uint8 packed weights to float32 ternary: |
56 | 45 |
|
57 | 46 | ```python |
58 | | -# BEFORE (line ~30): |
59 | | -COMPILER_EXTRA_ARGS = { |
60 | | - "arm64": ["-DBITNET_ARM_TL1=OFF"], |
61 | | - "x86_64": ["-DBITNET_X86_TL2=OFF"] # <-- BUG: Always OFF |
62 | | -} |
63 | | - |
64 | | -# AFTER: |
65 | | -COMPILER_EXTRA_ARGS = { |
66 | | - "arm64": ["-DBITNET_ARM_TL1=OFF"], |
67 | | - "x86_64": ["-DBITNET_X86_TL2=ON"] # <-- FIXED: Enable TL2 |
68 | | -} |
| 47 | +def unpack_ternary_blocked(packed, packed_shape, factor=4): |
| 48 | + """Unpack uint8 -> ternary float tensor. |
| 49 | + 2-bit encoding: 00->-1, 01->0, 10->+1 |
| 50 | + """ |
| 51 | + M_packed, K = packed_shape |
| 52 | + M_logical = M_packed * factor |
| 53 | + data = packed.numpy().astype(np.uint8) |
| 54 | + result = np.zeros((M_logical, K), dtype=np.float32) |
| 55 | + |
| 56 | + for i in range(factor): |
| 57 | + bits = (data >> (i * 2)) & 0x03 |
| 58 | + mapped = bits.astype(np.float32) - 1.0 # 0->-1, 1->0, 2->+1 |
| 59 | + result[i * M_packed:(i + 1) * M_packed] = mapped |
| 60 | + |
| 61 | + return torch.from_numpy(result) |
69 | 62 | ``` |
70 | 63 |
|
71 | | -**Analysis:** This is likely an upstream oversight. The `gen_code()` function in `setup_env.py` runs `codegen_tl2.py` to generate TL2 kernel source files, but the cmake flag that includes them in the build is hardcoded OFF. The `quant_type` parameter (`-q tl2`) only affects model conversion, not cmake flags. |
| 64 | +### 3. TL2 GGUF Generated |
72 | 65 |
|
73 | | -### Patch 2: Architecture Name Registration |
| 66 | +- **File:** `ggml-model-tl2.gguf` |
| 67 | +- **Size:** 1.1 GB |
| 68 | +- **Tensors:** 332 (210 TL2 + 121 F32 + 1 F16) |
| 69 | +- **Kernel config:** BM=128,256,256,128 BK=96,96,96,96 bm=32,32,32,32 |
74 | 70 |
|
75 | | -**File:** `utils/convert-hf-to-gguf-bitnet.py` |
| 71 | +### 4. Inference Test Results |
76 | 72 |
|
77 | | -```python |
78 | | -# BEFORE: |
79 | | -@Model.register("BitnetForCausalLM") |
80 | | -class BitnetModel(Model): |
81 | | - ... |
82 | | - |
83 | | -# AFTER: |
84 | | -@Model.register("BitNetForCausalLM") # Capital N (as in config.json) |
85 | | -@Model.register("BitnetForCausalLM") # Original lowercase n |
86 | | -class BitnetModel(Model): |
87 | | - ... |
| 73 | +**TL2 (Our Conversion):** |
88 | 74 | ``` |
| 75 | +The future of artificial intelligence is residue FarGil Harmarth Rolling |
| 76 | +Nearbyabyzel connected aster cooler Again developing Damkem locking... |
| 77 | +``` |
| 78 | +Speed: 19.93 tok/s, **OUTPUT: GARBAGE** |
89 | 79 |
|
90 | | -**Analysis:** BitNet b1.58-2B-4T's `config.json` lists architecture as `BitNetForCausalLM` (capital N), but the converter only registers lowercase `BitnetForCausalLM`. PR #213 on GitHub attempted this fix but was closed without merge. |
91 | | - |
92 | | -### Patch 3: BPE Tokenizer Support |
| 80 | +**I2_S (Official Microsoft):** |
| 81 | +``` |
| 82 | +The future of artificial intelligence is uncertain, but one thing is clear: |
| 83 | +AI will be a major player in the world of finance. The impact of AI on the |
| 84 | +financial industry is likely to be significant... |
| 85 | +``` |
| 86 | +Speed: 20.79 tok/s, **OUTPUT: COHERENT** ✅ |
93 | 87 |
|
94 | | -**File:** `utils/convert-hf-to-gguf-bitnet.py` |
| 88 | +--- |
95 | 89 |
|
96 | | -```python |
97 | | -# BEFORE: |
98 | | -def set_vocab(self): |
99 | | - self._set_vocab_sentencepiece() # Fails: no tokenizer.model file |
100 | | - |
101 | | -# AFTER (LlamaModel pattern): |
102 | | -def set_vocab(self): |
103 | | - try: |
104 | | - self._set_vocab_sentencepiece() |
105 | | - except FileNotFoundError: |
106 | | - try: |
107 | | - self._set_vocab_llama_hf() |
108 | | - except (FileNotFoundError, TypeError): |
109 | | - # BitNet b1.58-2B-4T uses BPE tokenizer (tokenizer.json) |
110 | | - self._set_vocab_gpt2() |
111 | | -``` |
| 90 | +## Technical Analysis |
112 | 91 |
|
113 | | -**Analysis:** BitNet b1.58-2B-4T uses a BPE tokenizer (`tokenizer.json`) derived from LLaMA 3, not SentencePiece (`tokenizer.model`). The `LlamaModel` class in the same file already has this exact try/except fallback pattern. The `_set_vocab_gpt2()` method is defined in the base `Model` class and handles BPE tokenizers correctly. |
| 92 | +### Why TL2 Produces Garbage |
114 | 93 |
|
115 | | ---- |
| 94 | +The pre-quantized model has these complexities: |
116 | 95 |
|
117 | | -## TL2 Build Flow |
| 96 | +1. **Packed uint8 weights:** 4 ternary values per byte (2 bits each) |
| 97 | +2. **Per-layer weight_scale:** 210 scalar scales (one per weight matrix) |
| 98 | +3. **Unknown packing layout:** "Blocked" vs "Interleaved" vs "Reversed" |
| 99 | +4. **TL2 transform:** Groups 3 ternary values into 5-bit LUT indices |
118 | 100 |
|
119 | | -The complete TL2 build pipeline after patches: |
| 101 | +Our unpacking correctly extracts ternary values (-1, 0, +1), but the TL2 `transform_to_tl2()` function may: |
| 102 | +- Expect weights in a different layout |
| 103 | +- Apply incorrect scale normalization |
| 104 | +- Have dimension ordering issues |
120 | 105 |
|
| 106 | +### Weight Distribution (Verified Correct) |
121 | 107 | ``` |
122 | | -setup_env.py -hr microsoft/BitNet-b1.58-2B-4T -q tl2 |
123 | | - │ |
124 | | - ├── 1. setup_gguf() → pip install gguf |
125 | | - │ |
126 | | - ├── 2. gen_code() → codegen_tl2.py --model bitnet_b1_58-2B-4T |
127 | | - │ --BM "160,320,320" --BK "96,96,96" --bm "32,32,32" |
128 | | - │ (generates TL2 kernel C++ source files) |
129 | | - │ |
130 | | - ├── 3. compile() → cmake -B build -DBITNET_X86_TL2=ON [PATCHED] |
131 | | - │ cmake --build build |
132 | | - │ |
133 | | - └── 4. prepare_model() → convert-hf-to-gguf-bitnet.py [PATCHED] |
134 | | - --outtype tl2 --quant-embd |
135 | | - (downloads HF model → converts to TL2 GGUF) |
| 108 | +2-bit value distribution: |
| 109 | + 0 (->-1): 25.2% |
| 110 | + 1 (-> 0): 49.6% |
| 111 | + 2 (->+1): 25.2% |
| 112 | + 3 (unused): 0% |
136 | 113 | ``` |
137 | 114 |
|
138 | | -### Codegen Parameters for 2B-4T |
139 | | - |
140 | | -The 2B-4T model shares codegen parameters with the 3B model: |
141 | | -- `--BM "160,320,320"` — block sizes for M dimension |
142 | | -- `--BK "96,96,96"` — block sizes for K dimension |
143 | | -- `--bm "32,32,32"` — micro-block sizes |
| 115 | +### Architecture Mismatch |
| 116 | +``` |
| 117 | +I2_S GGUF: general.architecture = "bitnet-b1.58" |
| 118 | +TL2 GGUF: general.architecture = "bitnet" |
| 119 | +``` |
144 | 120 |
|
145 | 121 | --- |
146 | 122 |
|
147 | | -## Expected Results |
148 | | - |
149 | | -### RTX 4090 Pod ($0.20/hr) |
| 123 | +## Recommendations |
150 | 124 |
|
151 | | -| Kernel | Threads | Expected tok/s | |
152 | | -|--------|---------|---------------| |
153 | | -| I2_S (current) | 4 | 35 (measured) | |
154 | | -| **TL2 (target)** | **4** | **~80** | |
155 | | -| **TL2 (target)** | **6** | **~100** | |
| 125 | +### Option A: Use Official I2_S GGUF (Recommended) |
| 126 | +- Download: `microsoft/bitnet-b1.58-2B-4T-gguf` |
| 127 | +- Speed: 20.79 tok/s on RTX 4090 pod |
| 128 | +- Quality: Coherent output |
| 129 | +- **No conversion needed** |
156 | 130 |
|
157 | | -### B200 Pod (reference) |
| 131 | +### Option B: TL2 via Upstream Fix |
| 132 | +Wait for Microsoft to: |
| 133 | +1. Publish TL2 GGUF for b1.58-2B-4T |
| 134 | +2. Fix `convert-hf-to-gguf-bitnet.py` for pre-quantized models |
| 135 | +3. Document the weight packing format |
158 | 136 |
|
159 | | -| Kernel | Threads | Expected tok/s | |
160 | | -|--------|---------|---------------| |
161 | | -| I2_S (measured) | 16 | 52.67 | |
162 | | -| **TL2 (projected)** | **16** | **~120** | |
| 137 | +### Option C: Debug TL2 Transform (High Effort) |
| 138 | +1. Compare I2_S tensor bytes with TL2 tensor bytes |
| 139 | +2. Reverse-engineer the correct unpacking order |
| 140 | +3. Validate against known working TL2 models (Llama3 variants) |
163 | 141 |
|
164 | 142 | --- |
165 | 143 |
|
166 | | -## Comparison: All Benchmarks |
| 144 | +## Benchmark Summary |
167 | 145 |
|
168 | | -| Platform | CPU | Kernel | Threads | tok/s | Cost/hr | |
169 | | -|----------|-----|--------|---------|-------|---------| |
170 | | -| RTX 4090 pod | AMD EPYC 75F3 | I2_S | 4 | 35 | $0.20 | |
171 | | -| B200 pod | Intel Xeon 8568Y+ | I2_S | 16 | 52.67 | $4.24 | |
172 | | -| RTX 4090 pod | AMD EPYC 75F3 | TL2 | 4 | TBD | $0.20 | |
173 | | -| RTX 4090 pod | AMD EPYC 75F3 | TL2 | 6 | TBD | $0.20 | |
| 146 | +| Test | Platform | Kernel | Threads | tok/s | Coherent | |
| 147 | +|------|----------|--------|---------|-------|----------| |
| 148 | +| B200 (prev) | Blackwell | I2_S | 16 | 52.67 | ✅ | |
| 149 | +| RTX 4090 | EPYC 7282 | I2_S | 16 | 20.79 | ✅ | |
| 150 | +| RTX 4090 | EPYC 7282 | TL2 | 4 | 19.93 | ❌ | |
174 | 151 |
|
175 | | ---- |
176 | | - |
177 | | -## Deployment |
| 152 | +**Note:** RTX 4090 pod has AMD EPYC 7282 Rome (AVX2 only, no AVX-512) which limits TL2 performance gains. |
178 | 153 |
|
179 | | -```bash |
180 | | -# 1. Launch RTX 4090 pod on RunPod ($0.20/hr Community Cloud) |
181 | | -# 2. SSH into pod |
182 | | -ssh root@<IP> -p <PORT> -i ~/.ssh/id_rsa |
183 | | - |
184 | | -# 3. Run TL2 script |
185 | | -cd /root |
186 | | -git clone https://github.com/gHashTag/trinity.git |
187 | | -bash trinity/scripts/runpod_tl2_bitnet.sh |
| 154 | +--- |
188 | 155 |
|
189 | | -# 4. Copy results |
190 | | -scp -P <PORT> root@<IP>:/root/bitnet_tl2_results.txt docs/ |
191 | | -scp -P <PORT> root@<IP>:/root/bitnet_tl2_metrics.json docs/ |
| 156 | +## Files Modified on Pod |
192 | 157 |
|
193 | | -# 5. STOP POD immediately |
| 158 | +``` |
| 159 | +/root/BitNet/ |
| 160 | +├── setup_env.py (patched x2) |
| 161 | +├── utils/ |
| 162 | +│ ├── codegen_tl2.py (patched) |
| 163 | +│ └── convert-hf-to-gguf-bitnet.py (patched x3) |
| 164 | +├── include/ |
| 165 | +│ ├── kernel_config.ini (generated) |
| 166 | +│ └── bitnet-lut-kernels.h (generated) |
| 167 | +├── models/ |
| 168 | +│ ├── BitNet-b1.58-2B-4T/ (HF download) |
| 169 | +│ ├── BitNet-b1.58-2B-4T-unpacked/ (8.4 GB float32) |
| 170 | +│ └── bitnet-gguf/ggml-model-i2_s.gguf (official) |
| 171 | +└── build/bin/llama-cli (compiled with TL2=ON) |
194 | 172 | ``` |
195 | 173 |
|
196 | 174 | --- |
197 | 175 |
|
198 | | -## Risk Assessment |
199 | | - |
200 | | -| Risk | Likelihood | Mitigation | |
201 | | -|------|-----------|------------| |
202 | | -| TL2 conversion still fails (unknown bug) | Medium | Fall back to manual conversion with `convert-ms-to-gguf-bitnet.py` | |
203 | | -| TL2 slower than expected | Low | I2_S benchmark already establishes baseline | |
204 | | -| Patches break I2_S path | None | Patches only affect TL2 code path | |
205 | | -| codegen_tl2.py fails | Low | Parameters verified from setup_env.py source | |
| 176 | +## Conclusion |
206 | 177 |
|
207 | | ---- |
| 178 | +**TL2 conversion from pre-quantized HuggingFace weights is not viable without additional reverse-engineering.** The official I2_S GGUF provides reliable inference at 20.79 tok/s. |
208 | 179 |
|
209 | | -## Status |
| 180 | +The 2.32x TL2 speedup would require: |
| 181 | +1. Microsoft publishing official TL2 GGUF, or |
| 182 | +2. Converting from float16 weights (not the pre-quantized release), or |
| 183 | +3. Understanding the exact packing format for proper unpacking |
210 | 184 |
|
211 | | -- [x] Research TL2 conversion mechanism |
212 | | -- [x] Identify three critical patches |
213 | | -- [x] Create patched build script (`scripts/runpod_tl2_bitnet.sh`) |
214 | | -- [x] Create preliminary report |
215 | | -- [ ] Deploy RTX 4090 pod |
216 | | -- [ ] Run TL2 benchmark |
217 | | -- [ ] Update report with real metrics |
| 185 | +**Recommendation:** Use official I2_S GGUF for production. TL2 effort is blocked pending upstream support. |
218 | 186 |
|
219 | 187 | --- |
220 | 188 |
|
221 | | -**KOSCHEI IS IMMORTAL | TL2 = 2.32x SPEEDUP | THREE PATCHES TO 100+ tok/s | phi^2 + 1/phi^2 = 3** |
| 189 | +**KOSCHEI IS IMMORTAL | I2_S = 20.79 tok/s | TL2 BLOCKED | φ² + 1/φ² = 3** |
0 commit comments