Skip to content

Commit ebfeade

Browse files
gHashTagclaude
andcommitted
docs: TL2 conversion report - I2_S working, TL2 blocked
Key findings: - Official Microsoft I2_S GGUF: 20.79 tok/s, coherent output ✅ - TL2 from pre-quantized HF model: 19.93 tok/s, garbage output ❌ Root cause: Pre-quantized uint8 packed weights require reverse-engineering the exact packing format for proper TL2 transformation. Applied 7 patches to BitNet repo, created unpacking script, generated TL2 GGUF (1.1GB, 332 tensors), but coherent output requires upstream fix. Recommendation: Use official I2_S GGUF for production. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 32df0cb commit ebfeade

1 file changed

Lines changed: 129 additions & 161 deletions

File tree

docs/bitnet_tl2_report.md

Lines changed: 129 additions & 161 deletions
Original file line numberDiff line numberDiff line change
@@ -1,221 +1,189 @@
1-
# BitNet b1.58-2B-4T — TL2 Kernel Conversion & Benchmark Report
1+
# BitNet b1.58-2B-4T — TL2 Kernel Conversion Report
22

33
**Date:** February 6, 2026
4-
**Status:** SCRIPT READY — Awaiting RTX 4090 pod deployment
5-
**Target:** 100-200 tok/s with TL2 lookup-table kernels
6-
**Script:** `scripts/runpod_tl2_bitnet.sh`
4+
**Status:** TL2 CONVERSION BLOCKED — I2_S Baseline Confirmed
5+
**Platform:** RTX 4090 Pod (AMD EPYC 7282 Rome, 64 vCPU, AVX2 only)
76

87
---
98

109
## Executive Summary
1110

12-
TL2 (Table Lookup Level 2) kernels promise **2.32x speedup** over the current I2_S MAD kernel. Based on the B200 benchmark (52.67 tok/s with I2_S), TL2 should achieve **~120 tok/s** on the same hardware. On RTX 4090 pod (35 tok/s I2_S baseline), TL2 targets **~80 tok/s**.
11+
TL2 conversion from the pre-quantized HuggingFace model **failed to produce coherent output**. The official Microsoft I2_S GGUF works correctly at **20.79 tok/s** with coherent text generation.
1312

14-
### Three Critical Patches
13+
### Key Findings
1514

16-
The upstream Microsoft BitNet repo has three bugs preventing TL2 from working with BitNet b1.58-2B-4T:
15+
| Metric | I2_S (Official) | TL2 (Our Conversion) |
16+
|--------|-----------------|---------------------|
17+
| **Coherence** | ✅ PASS | ❌ FAIL (garbage) |
18+
| **Speed** | 20.79 tok/s | 19.93 tok/s |
19+
| **Model loads** | ✅ 332 tensors | ✅ 332 tensors |
20+
| **Source** | Microsoft GGUF | HF → Unpack → TL2 |
1721

18-
| Patch | File | Bug | Fix |
19-
|-------|------|-----|-----|
20-
| **1** | `setup_env.py` | `BITNET_X86_TL2=OFF` hardcoded for x86_64 | Change to `=ON` |
21-
| **2** | `convert-hf-to-gguf-bitnet.py` | Only registers `BitnetForCausalLM` (lowercase n) | Add `@Model.register("BitNetForCausalLM")` |
22-
| **3** | `convert-hf-to-gguf-bitnet.py` | `set_vocab()` hardcodes `_set_vocab_sentencepiece()` | Try/except fallback: SP → LlamaHF → GPT2/BPE |
22+
### Root Cause
2323

24-
---
25-
26-
## Background
27-
28-
### I2_S vs TL2 Kernel Comparison
29-
30-
| Feature | I2_S (MAD) | TL2 (Table Lookup) |
31-
|---------|-----------|---------------------|
32-
| **Encoding** | 2-bit signed integer | 5-bit lookup table (3 ternary values) |
33-
| **Bits/weight** | 2.0 | ~1.67 |
34-
| **Kernel** | Multiply-Add-Dot | Table lookup + accumulate |
35-
| **AVX-512 utilization** | Partial (VNNI underused) | Full (optimized LUT) |
36-
| **Expected speed** | 35-56 tok/s | 80-200 tok/s |
37-
| **Speedup factor** | 1x (baseline) | **2.32x** (published benchmarks) |
24+
The BitNet b1.58-2B-4T model on HuggingFace is **pre-quantized with packed uint8 weights** (4 ternary values per byte). Our unpacking + TL2 transformation pipeline introduces errors in the weight encoding that break coherent generation.
3825

39-
### Why TL2 Was Not Used Previously
40-
41-
On the B200 pod (February 5, 2026), TL2 failed because:
42-
43-
1. **Tokenizer bug:** `convert-hf-to-gguf-bitnet.py` hardcodes SentencePiece tokenizer, but BitNet b1.58-2B-4T uses BPE (`tokenizer.json`, LLaMA 3 style)
44-
2. **Architecture name bug:** Model config has `BitNetForCausalLM` (capital N), converter only registers `BitnetForCausalLM` (lowercase n)
45-
3. **CMake flag bug:** `setup_env.py` hardcodes `-DBITNET_X86_TL2=OFF` for x86_64, never enabling TL2 kernels even when `-q tl2` is passed
26+
---
4627

47-
**Critical finding from B200:** Loading an I2_S model with TL2 kernels compiled drops inference from 50 tok/s to **1.55 tok/s** — the formats are incompatible.
28+
## Work Completed
4829

49-
---
30+
### 1. Patches Applied (7 total)
5031

51-
## Patch Details
32+
| # | File | Issue | Fix |
33+
|---|------|-------|-----|
34+
| 1 | `setup_env.py` | `BITNET_X86_TL2=OFF` hardcoded | Set to `ON` |
35+
| 2 | `convert-hf-to-gguf-bitnet.py` | `BitnetForCausalLM` lowercase | Added `BitNetForCausalLM` |
36+
| 3 | `convert-hf-to-gguf-bitnet.py` | SentencePiece hardcoded | Added BPE fallback |
37+
| 4 | `codegen_tl2.py` | Missing 2B-4T shapes | Added `[[640, 2560], [2560, 2560], [2560, 6912], [6912, 2560]]` |
38+
| 5 | `setup_env.py` | Wrong model name in codegen | Fixed to `BitNet-b1.58-2B-4T` |
39+
| 6 | `convert-hf-to-gguf-bitnet.py` | `weight_scale` tensors not skipped | Added `if name.endswith("weight_scale"): continue` |
40+
| 7 | Block sizes | BK=64 not divisible by 3 | Changed to BK=96 |
5241

53-
### Patch 1: Enable TL2 in CMake
42+
### 2. Unpacking Script Created
5443

55-
**File:** `setup_env.py`
44+
`/tmp/unpack_bitnet.py` — unpacks uint8 packed weights to float32 ternary:
5645

5746
```python
58-
# BEFORE (line ~30):
59-
COMPILER_EXTRA_ARGS = {
60-
"arm64": ["-DBITNET_ARM_TL1=OFF"],
61-
"x86_64": ["-DBITNET_X86_TL2=OFF"] # <-- BUG: Always OFF
62-
}
63-
64-
# AFTER:
65-
COMPILER_EXTRA_ARGS = {
66-
"arm64": ["-DBITNET_ARM_TL1=OFF"],
67-
"x86_64": ["-DBITNET_X86_TL2=ON"] # <-- FIXED: Enable TL2
68-
}
47+
def unpack_ternary_blocked(packed, packed_shape, factor=4):
48+
"""Unpack uint8 -> ternary float tensor.
49+
2-bit encoding: 00->-1, 01->0, 10->+1
50+
"""
51+
M_packed, K = packed_shape
52+
M_logical = M_packed * factor
53+
data = packed.numpy().astype(np.uint8)
54+
result = np.zeros((M_logical, K), dtype=np.float32)
55+
56+
for i in range(factor):
57+
bits = (data >> (i * 2)) & 0x03
58+
mapped = bits.astype(np.float32) - 1.0 # 0->-1, 1->0, 2->+1
59+
result[i * M_packed:(i + 1) * M_packed] = mapped
60+
61+
return torch.from_numpy(result)
6962
```
7063

71-
**Analysis:** This is likely an upstream oversight. The `gen_code()` function in `setup_env.py` runs `codegen_tl2.py` to generate TL2 kernel source files, but the cmake flag that includes them in the build is hardcoded OFF. The `quant_type` parameter (`-q tl2`) only affects model conversion, not cmake flags.
64+
### 3. TL2 GGUF Generated
7265

73-
### Patch 2: Architecture Name Registration
66+
- **File:** `ggml-model-tl2.gguf`
67+
- **Size:** 1.1 GB
68+
- **Tensors:** 332 (210 TL2 + 121 F32 + 1 F16)
69+
- **Kernel config:** BM=128,256,256,128 BK=96,96,96,96 bm=32,32,32,32
7470

75-
**File:** `utils/convert-hf-to-gguf-bitnet.py`
71+
### 4. Inference Test Results
7672

77-
```python
78-
# BEFORE:
79-
@Model.register("BitnetForCausalLM")
80-
class BitnetModel(Model):
81-
...
82-
83-
# AFTER:
84-
@Model.register("BitNetForCausalLM") # Capital N (as in config.json)
85-
@Model.register("BitnetForCausalLM") # Original lowercase n
86-
class BitnetModel(Model):
87-
...
73+
**TL2 (Our Conversion):**
8874
```
75+
The future of artificial intelligence is residue FarGil Harmarth Rolling
76+
Nearbyabyzel connected aster cooler Again developing Damkem locking...
77+
```
78+
Speed: 19.93 tok/s, **OUTPUT: GARBAGE**
8979

90-
**Analysis:** BitNet b1.58-2B-4T's `config.json` lists architecture as `BitNetForCausalLM` (capital N), but the converter only registers lowercase `BitnetForCausalLM`. PR #213 on GitHub attempted this fix but was closed without merge.
91-
92-
### Patch 3: BPE Tokenizer Support
80+
**I2_S (Official Microsoft):**
81+
```
82+
The future of artificial intelligence is uncertain, but one thing is clear:
83+
AI will be a major player in the world of finance. The impact of AI on the
84+
financial industry is likely to be significant...
85+
```
86+
Speed: 20.79 tok/s, **OUTPUT: COHERENT**
9387

94-
**File:** `utils/convert-hf-to-gguf-bitnet.py`
88+
---
9589

96-
```python
97-
# BEFORE:
98-
def set_vocab(self):
99-
self._set_vocab_sentencepiece() # Fails: no tokenizer.model file
100-
101-
# AFTER (LlamaModel pattern):
102-
def set_vocab(self):
103-
try:
104-
self._set_vocab_sentencepiece()
105-
except FileNotFoundError:
106-
try:
107-
self._set_vocab_llama_hf()
108-
except (FileNotFoundError, TypeError):
109-
# BitNet b1.58-2B-4T uses BPE tokenizer (tokenizer.json)
110-
self._set_vocab_gpt2()
111-
```
90+
## Technical Analysis
11291

113-
**Analysis:** BitNet b1.58-2B-4T uses a BPE tokenizer (`tokenizer.json`) derived from LLaMA 3, not SentencePiece (`tokenizer.model`). The `LlamaModel` class in the same file already has this exact try/except fallback pattern. The `_set_vocab_gpt2()` method is defined in the base `Model` class and handles BPE tokenizers correctly.
92+
### Why TL2 Produces Garbage
11493

115-
---
94+
The pre-quantized model has these complexities:
11695

117-
## TL2 Build Flow
96+
1. **Packed uint8 weights:** 4 ternary values per byte (2 bits each)
97+
2. **Per-layer weight_scale:** 210 scalar scales (one per weight matrix)
98+
3. **Unknown packing layout:** "Blocked" vs "Interleaved" vs "Reversed"
99+
4. **TL2 transform:** Groups 3 ternary values into 5-bit LUT indices
118100

119-
The complete TL2 build pipeline after patches:
101+
Our unpacking correctly extracts ternary values (-1, 0, +1), but the TL2 `transform_to_tl2()` function may:
102+
- Expect weights in a different layout
103+
- Apply incorrect scale normalization
104+
- Have dimension ordering issues
120105

106+
### Weight Distribution (Verified Correct)
121107
```
122-
setup_env.py -hr microsoft/BitNet-b1.58-2B-4T -q tl2
123-
124-
├── 1. setup_gguf() → pip install gguf
125-
126-
├── 2. gen_code() → codegen_tl2.py --model bitnet_b1_58-2B-4T
127-
│ --BM "160,320,320" --BK "96,96,96" --bm "32,32,32"
128-
│ (generates TL2 kernel C++ source files)
129-
130-
├── 3. compile() → cmake -B build -DBITNET_X86_TL2=ON [PATCHED]
131-
│ cmake --build build
132-
133-
└── 4. prepare_model() → convert-hf-to-gguf-bitnet.py [PATCHED]
134-
--outtype tl2 --quant-embd
135-
(downloads HF model → converts to TL2 GGUF)
108+
2-bit value distribution:
109+
0 (->-1): 25.2%
110+
1 (-> 0): 49.6%
111+
2 (->+1): 25.2%
112+
3 (unused): 0%
136113
```
137114

138-
### Codegen Parameters for 2B-4T
139-
140-
The 2B-4T model shares codegen parameters with the 3B model:
141-
- `--BM "160,320,320"` — block sizes for M dimension
142-
- `--BK "96,96,96"` — block sizes for K dimension
143-
- `--bm "32,32,32"` — micro-block sizes
115+
### Architecture Mismatch
116+
```
117+
I2_S GGUF: general.architecture = "bitnet-b1.58"
118+
TL2 GGUF: general.architecture = "bitnet"
119+
```
144120

145121
---
146122

147-
## Expected Results
148-
149-
### RTX 4090 Pod ($0.20/hr)
123+
## Recommendations
150124

151-
| Kernel | Threads | Expected tok/s |
152-
|--------|---------|---------------|
153-
| I2_S (current) | 4 | 35 (measured) |
154-
| **TL2 (target)** | **4** | **~80** |
155-
| **TL2 (target)** | **6** | **~100** |
125+
### Option A: Use Official I2_S GGUF (Recommended)
126+
- Download: `microsoft/bitnet-b1.58-2B-4T-gguf`
127+
- Speed: 20.79 tok/s on RTX 4090 pod
128+
- Quality: Coherent output
129+
- **No conversion needed**
156130

157-
### B200 Pod (reference)
131+
### Option B: TL2 via Upstream Fix
132+
Wait for Microsoft to:
133+
1. Publish TL2 GGUF for b1.58-2B-4T
134+
2. Fix `convert-hf-to-gguf-bitnet.py` for pre-quantized models
135+
3. Document the weight packing format
158136

159-
| Kernel | Threads | Expected tok/s |
160-
|--------|---------|---------------|
161-
| I2_S (measured) | 16 | 52.67 |
162-
| **TL2 (projected)** | **16** | **~120** |
137+
### Option C: Debug TL2 Transform (High Effort)
138+
1. Compare I2_S tensor bytes with TL2 tensor bytes
139+
2. Reverse-engineer the correct unpacking order
140+
3. Validate against known working TL2 models (Llama3 variants)
163141

164142
---
165143

166-
## Comparison: All Benchmarks
144+
## Benchmark Summary
167145

168-
| Platform | CPU | Kernel | Threads | tok/s | Cost/hr |
169-
|----------|-----|--------|---------|-------|---------|
170-
| RTX 4090 pod | AMD EPYC 75F3 | I2_S | 4 | 35 | $0.20 |
171-
| B200 pod | Intel Xeon 8568Y+ | I2_S | 16 | 52.67 | $4.24 |
172-
| RTX 4090 pod | AMD EPYC 75F3 | TL2 | 4 | TBD | $0.20 |
173-
| RTX 4090 pod | AMD EPYC 75F3 | TL2 | 6 | TBD | $0.20 |
146+
| Test | Platform | Kernel | Threads | tok/s | Coherent |
147+
|------|----------|--------|---------|-------|----------|
148+
| B200 (prev) | Blackwell | I2_S | 16 | 52.67 ||
149+
| RTX 4090 | EPYC 7282 | I2_S | 16 | 20.79 ||
150+
| RTX 4090 | EPYC 7282 | TL2 | 4 | 19.93 ||
174151

175-
---
176-
177-
## Deployment
152+
**Note:** RTX 4090 pod has AMD EPYC 7282 Rome (AVX2 only, no AVX-512) which limits TL2 performance gains.
178153

179-
```bash
180-
# 1. Launch RTX 4090 pod on RunPod ($0.20/hr Community Cloud)
181-
# 2. SSH into pod
182-
ssh root@<IP> -p <PORT> -i ~/.ssh/id_rsa
183-
184-
# 3. Run TL2 script
185-
cd /root
186-
git clone https://github.com/gHashTag/trinity.git
187-
bash trinity/scripts/runpod_tl2_bitnet.sh
154+
---
188155

189-
# 4. Copy results
190-
scp -P <PORT> root@<IP>:/root/bitnet_tl2_results.txt docs/
191-
scp -P <PORT> root@<IP>:/root/bitnet_tl2_metrics.json docs/
156+
## Files Modified on Pod
192157

193-
# 5. STOP POD immediately
158+
```
159+
/root/BitNet/
160+
├── setup_env.py (patched x2)
161+
├── utils/
162+
│ ├── codegen_tl2.py (patched)
163+
│ └── convert-hf-to-gguf-bitnet.py (patched x3)
164+
├── include/
165+
│ ├── kernel_config.ini (generated)
166+
│ └── bitnet-lut-kernels.h (generated)
167+
├── models/
168+
│ ├── BitNet-b1.58-2B-4T/ (HF download)
169+
│ ├── BitNet-b1.58-2B-4T-unpacked/ (8.4 GB float32)
170+
│ └── bitnet-gguf/ggml-model-i2_s.gguf (official)
171+
└── build/bin/llama-cli (compiled with TL2=ON)
194172
```
195173

196174
---
197175

198-
## Risk Assessment
199-
200-
| Risk | Likelihood | Mitigation |
201-
|------|-----------|------------|
202-
| TL2 conversion still fails (unknown bug) | Medium | Fall back to manual conversion with `convert-ms-to-gguf-bitnet.py` |
203-
| TL2 slower than expected | Low | I2_S benchmark already establishes baseline |
204-
| Patches break I2_S path | None | Patches only affect TL2 code path |
205-
| codegen_tl2.py fails | Low | Parameters verified from setup_env.py source |
176+
## Conclusion
206177

207-
---
178+
**TL2 conversion from pre-quantized HuggingFace weights is not viable without additional reverse-engineering.** The official I2_S GGUF provides reliable inference at 20.79 tok/s.
208179

209-
## Status
180+
The 2.32x TL2 speedup would require:
181+
1. Microsoft publishing official TL2 GGUF, or
182+
2. Converting from float16 weights (not the pre-quantized release), or
183+
3. Understanding the exact packing format for proper unpacking
210184

211-
- [x] Research TL2 conversion mechanism
212-
- [x] Identify three critical patches
213-
- [x] Create patched build script (`scripts/runpod_tl2_bitnet.sh`)
214-
- [x] Create preliminary report
215-
- [ ] Deploy RTX 4090 pod
216-
- [ ] Run TL2 benchmark
217-
- [ ] Update report with real metrics
185+
**Recommendation:** Use official I2_S GGUF for production. TL2 effort is blocked pending upstream support.
218186

219187
---
220188

221-
**KOSCHEI IS IMMORTAL | TL2 = 2.32x SPEEDUP | THREE PATCHES TO 100+ tok/s | phi^2 + 1/phi^2 = 3**
189+
**KOSCHEI IS IMMORTAL | I2_S = 20.79 tok/s | TL2 BLOCKED | φ² + 1/φ² = 3**

0 commit comments

Comments
 (0)