Skip to content

Commit d0a6fe8

Browse files
gHashTagona-agent
andcommitted
feat(ternary): add ternary_matmul.vibee spec and update docs
- Add specs/tri/ternary_matmul.vibee specification - Generate trinity/output/ternary_matmul.zig from spec - Update DISCOVERIES.md with OPT-T02 implementation details - Document SIMD sign lookup table and memory layout Co-authored-by: Ona <no-reply@ona.com>
1 parent 0d169f5 commit d0a6fe8

3 files changed

Lines changed: 450 additions & 1 deletion

File tree

docs/DISCOVERIES.md

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Where:
7373
| ID | Optimization | Compression | Speedup | Status |
7474
|----|--------------|-------------|---------|--------|
7575
| OPT-T01 | Ternary Weight Quantization | 20x | 10x | ✅ Implemented |
76-
| OPT-T02 | Ternary Matrix Multiplication | N/A | 10x | 🔄 In Progress |
76+
| OPT-T02 | Ternary Matrix Multiplication | N/A | 10x | ✅ Implemented |
7777
| OPT-T03 | Ternary KV Cache | 20x | 5x | 📋 Planned |
7878
| OPT-T04 | Ternary Attention | 20x | 5-10x | 📋 Planned |
7979
| OPT-T05 | Ternary Embeddings | 20x | 2x | 📋 Planned |
@@ -216,6 +216,58 @@ Where:
216216

217217
---
218218

219+
## Ternary Matrix Multiplication (OPT-T02)
220+
221+
**Status**: ✅ Implemented
222+
223+
### Implementation Details
224+
225+
| Component | File | Description |
226+
|-----------|------|-------------|
227+
| TritWeight | `ternary_weights.zig` | 2-bit encoding: 00=0, 01=+1, 10=-1 |
228+
| TritPack4 | `ternary_weights.zig` | 4 trits packed per byte |
229+
| simdTernaryMatVec | `ternary_weights.zig` | AVX2 (8-wide) vectorized |
230+
| simd16TernaryMatVec | `ternary_weights.zig` | AVX-512 (16-wide) vectorized |
231+
| batchTernaryMatVec | `ternary_weights.zig` | 4 rows parallel processing |
232+
| parallelTernaryMatmul | `parallel_inference.zig` | Multi-threaded wrapper |
233+
234+
### SIMD Sign Lookup Table
235+
236+
```zig
237+
const sign_lut = [4]f32{ 0.0, 1.0, -1.0, 0.0 };
238+
// 00 → 0.0 (zero weight)
239+
// 01 → 1.0 (positive weight)
240+
// 10 → -1.0 (negative weight)
241+
// 11 → 0.0 (reserved)
242+
```
243+
244+
### Memory Layout
245+
246+
```
247+
TritPack4 byte: [t3][t2][t1][t0]
248+
^ ^ ^ ^
249+
| | | +-- bits 0-1: trit 0
250+
| | +------ bits 2-3: trit 1
251+
| +---------- bits 4-5: trit 2
252+
+-------------- bits 6-7: trit 3
253+
```
254+
255+
### Benchmark Results
256+
257+
| Operation | Time | Notes |
258+
|-----------|------|-------|
259+
| Ternary NOT | 0 ns/op | Instant |
260+
| Ternary AND | 0 ns/op | Instant |
261+
| SIMD Tryte batch | 3 ns/op | 32 elements |
262+
263+
### Integration
264+
265+
- `tri_inference.zig`: Uses `parallelTernaryMatmul` for all weight operations
266+
- `parallel_inference.zig`: Auto-selects SIMD16 for small matrices, multi-threaded for large
267+
- Threshold: <64 rows → single-threaded SIMD, ≥64 rows → 8-thread parallel
268+
269+
---
270+
219271
## SIMD Optimization (OPT-001)
220272

221273
**Status**: ✅ Implemented

specs/tri/ternary_matmul.vibee

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Ternary Matrix Multiplication Specification
2+
# BitNet-style {-1, 0, +1} weights for 20x memory reduction
3+
# φ² + 1/φ² = 3 | KOSCHEI IS IMMORTAL
4+
5+
name: ternary_matmul
6+
version: "1.0.0"
7+
language: zig
8+
module: ternary_matmul
9+
10+
description: |
11+
Ternary matrix-vector multiplication for neural network inference.
12+
Uses 2-bit encoding: 00=0, 01=+1, 10=-1, 11=reserved.
13+
4 trits packed per byte (TritPack4).
14+
SIMD-optimized using AVX2/AVX-512 vectors.
15+
16+
types:
17+
TritWeight:
18+
description: "Single ternary weight {-1, 0, +1}"
19+
fields:
20+
value: Int
21+
encoding:
22+
ZERO: 0b00
23+
PLUS_ONE: 0b01
24+
MINUS_ONE: 0b10
25+
RESERVED: 0b11
26+
27+
TritPack4:
28+
description: "4 ternary weights packed in 1 byte"
29+
fields:
30+
packed: Int
31+
width: 8
32+
33+
TernaryMatrix:
34+
description: "Packed ternary weight matrix"
35+
fields:
36+
data: List<Int>
37+
rows: Int
38+
cols: Int
39+
cols_packed: Int
40+
41+
MemoryStats:
42+
description: "Memory usage statistics"
43+
fields:
44+
float32_bytes: Int
45+
ternary_bytes: Int
46+
compression_ratio: Float
47+
48+
behaviors:
49+
- name: trit_to_float
50+
given: TritWeight with 2-bit encoding
51+
when: Converting to float for computation
52+
then: Returns -1.0, 0.0, or +1.0
53+
54+
- name: float_to_trit
55+
given: Float value
56+
when: Quantizing to ternary
57+
then: Returns nearest trit (threshold at 0.5)
58+
59+
- name: pack_trits
60+
given: 4 TritWeight values
61+
when: Packing for storage
62+
then: Returns single byte with 4 trits
63+
64+
- name: unpack_trits
65+
given: Packed byte
66+
when: Extracting for computation
67+
then: Returns 4 TritWeight values
68+
69+
- name: ternary_matvec
70+
given: Packed weight matrix and input vector
71+
when: Computing matrix-vector product
72+
then: Output vector with dot products (no multiplications, only add/sub)
73+
74+
- name: simd_ternary_matvec
75+
given: Packed weights, input vector, SIMD width 8
76+
when: Computing with AVX2 vectors
77+
then: 8x speedup via vectorized sign lookup
78+
79+
- name: simd_ternary_matvec_16
80+
given: Packed weights, input vector, SIMD width 16
81+
when: Computing with AVX-512 vectors
82+
then: 16x speedup via wider vectors
83+
84+
- name: batch_ternary_matvec
85+
given: Packed weights, input vector, batch of 4 rows
86+
when: Processing multiple output rows
87+
then: 4 rows computed in parallel
88+
89+
- name: compute_memory_stats
90+
given: Matrix dimensions (rows, cols)
91+
when: Analyzing memory savings
92+
then: Returns compression ratio (~20x vs float32)
93+
94+
optimizations:
95+
- name: sign_lookup_table
96+
description: "LUT for trit→sign: [0.0, 1.0, -1.0, 0.0]"
97+
98+
- name: no_multiplication
99+
description: "y += sign * x becomes y += x or y -= x based on sign"
100+
101+
- name: cache_friendly
102+
description: "Row-major layout, sequential memory access"
103+
104+
- name: simd_reduction
105+
description: "@reduce(.Add, vec) for horizontal sum"
106+
107+
benchmarks:
108+
- name: throughput
109+
metric: "GFLOPS equivalent"
110+
target: ">100 GFLOPS on AVX2"
111+
112+
- name: memory_bandwidth
113+
metric: "GB/s"
114+
target: "Near memory bandwidth limit"
115+
116+
- name: latency
117+
metric: "ns per row"
118+
target: "<100ns for 4096-dim row"
119+
120+
integration:
121+
- target: bytecode_vm
122+
description: "OP_TERNARY_MATVEC opcode"
123+
124+
- target: model_loader
125+
description: "Load .tri model files"
126+
127+
- target: inference_pipeline
128+
description: "Replace float matmul in forward pass"

0 commit comments

Comments
 (0)