Commit e377c0f
committed
README.md: comprehensive speed comparison + cosine emulation + stable Rust tricks
Rewritten in rustynum README style for a senior Rust developer audience.
Performance data:
- GEMM: 139 GFLOPS (10.5× over upstream, matches NumPy OpenBLAS)
- Codebook: 380K tok/s (AMX) → 500 tok/s (Pi 4 NEON) per-tier breakdown
- SPO palette: 611M lookups/s, 1.8ns latency, 388KB working set
- f16 transcoding: 94M params/s, 7.3e-6 max error on 15M param model
- Cosine emulation: 611M/s via 256-step palette (0.4% error at 1/40σ)
Architecture sections:
- SIMD polyfill layer (F32x16 etc. on stable, LazyLock dispatch)
- Backend layer (Goto GEMM, MKL/OpenBLAS feature-gated)
- HPC module library (55 modules, 880 tests)
- Codec layer (Fingerprint, Base17, CAM-PQ, palette semiring)
- Burn integration (SIMD-augmented tensor ops)
7 "What We Build That Nobody Else Does":
1. Complete std::simd polyfill on stable
2. f16 types without nightly (u16 carrier + F16C/FCVTL)
3. AMX on stable via asm!(".byte") encoding
4. Tiered ARM NEON (A53/A72/A76 with microarch awareness)
5. Frozen dispatch (0.3ns function pointer, no branch)
6. BF16 RNE bit-exact with hardware VCVTNEPS2BF16
7. Cognitive codec stack (Fingerprint→Base17→CAM-PQ→Palette→bgz7)
Cosine emulation section explaining palette distance tables:
- 256×256 u8 table = 64KB (fits L1 cache)
- Foveal (1/40σ): 0.4% error, 611M/s
- Good (1/4σ): 2% error, 611M/s
- Near (1σ): 8% error, 2.4B/s (64-step)
- 12× faster than SIMD f32 dot product (no FP division/multiply)
https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU1 parent 81f644a commit e377c0f
1 file changed
Lines changed: 174 additions & 127 deletions
0 commit comments