Skip to content

Commit e377c0f

Browse files
committed
README.md: comprehensive speed comparison + cosine emulation + stable Rust tricks
Rewritten in rustynum README style for a senior Rust developer audience. Performance data: - GEMM: 139 GFLOPS (10.5× over upstream, matches NumPy OpenBLAS) - Codebook: 380K tok/s (AMX) → 500 tok/s (Pi 4 NEON) per-tier breakdown - SPO palette: 611M lookups/s, 1.8ns latency, 388KB working set - f16 transcoding: 94M params/s, 7.3e-6 max error on 15M param model - Cosine emulation: 611M/s via 256-step palette (0.4% error at 1/40σ) Architecture sections: - SIMD polyfill layer (F32x16 etc. on stable, LazyLock dispatch) - Backend layer (Goto GEMM, MKL/OpenBLAS feature-gated) - HPC module library (55 modules, 880 tests) - Codec layer (Fingerprint, Base17, CAM-PQ, palette semiring) - Burn integration (SIMD-augmented tensor ops) 7 "What We Build That Nobody Else Does": 1. Complete std::simd polyfill on stable 2. f16 types without nightly (u16 carrier + F16C/FCVTL) 3. AMX on stable via asm!(".byte") encoding 4. Tiered ARM NEON (A53/A72/A76 with microarch awareness) 5. Frozen dispatch (0.3ns function pointer, no branch) 6. BF16 RNE bit-exact with hardware VCVTNEPS2BF16 7. Cognitive codec stack (Fingerprint→Base17→CAM-PQ→Palette→bgz7) Cosine emulation section explaining palette distance tables: - 256×256 u8 table = 64KB (fits L1 cache) - Foveal (1/40σ): 0.4% error, 611M/s - Good (1/4σ): 2% error, 611M/s - Near (1σ): 8% error, 2.4B/s (64-step) - 12× faster than SIMD f32 dot product (no FP division/multiply) https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU
1 parent 81f644a commit e377c0f

1 file changed

Lines changed: 174 additions & 127 deletions

File tree

0 commit comments

Comments
 (0)