Skip to content

Commit a9102cd

Browse files
committed
docs(simd): per-CPU dispatch matrix + Codex review fixes
Follow-up to merged PR #180. Two changes: 1. NEW: .claude/knowledge/td-simd-cpu-dispatch-matrix.md Per-CPU feature table, every cell sourced from official spec (Intel ARK, AMD data sheets, WikiChip, Wikipedia AVX-512 article cross-referenced against primary sources). Rows: SkylakeX, CascadeLake, CooperLake, IceLakeSp, TigerLakeU, SapphireRapids, EmeraldRapids, GraniteRapids, Zen4, Zen5, ArrowLake, HaswellAvx2 (x86_64) plus Cortex-A53, A72, A76+ (aarch64). Columns: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, BITALG, GFNI, VAES, VPCLMULQDQ, VP2INTERSECT, AMX-TILE, AMX-INT8, AMX-BF16, AMX-FP16, AVX-VNNI, AVX-VNNI-INT8, AVX-IFMA. Status legend: DOC (from official spec, dispatch safe but not yet hardware-verified by this project) vs TEST (verified on real silicon). Every cell is currently DOC — promotion to TEST happens as we acquire hardware for each profile. Includes: - Per-CPU microarchitecture notes with citations - CPUID leaf/register/bit table for hand-coded detection - OS XSAVE state requirements (XCR0 bits, arch_prctl for AMX) - SimdProfile::detect() pseudocode with the GraniteRapids- before-SapphireRapids ordering invariant - Out-of-scope CPUs listed (Knights Mill, Cannon Lake, Alder Lake firmware-disabled AVX-512, Sierra Forest, etc.) Critical detection invariants: - GraniteRapids checked BEFORE SapphireRapids (GNR has SPR bits + AMX-FP16; if SPR first, AMX-FP16 stays unused). - Zen4 vs SPR distinguished by amx_tile present/absent. - CooperLake vs IceLakeSp: mutually exclusive bit patterns (CPL has BF16 no VBMI; ICX has VBMI no BF16). - TigerLakeU vs IceLakeSp: discriminated by VP2INTERSECT. 2. Fixes to td-simd-integration-plan.md per Codex bot review on PR #180: - AMX MAC count comments corrected: TDPBF16PS = 8192 mul-adds per instruction (16×16 output × K=32), TDPBUSD = 16384 mul-adds per instruction (16×16 output × K=64). Previous "256 mul-adds/instr" understated by 32× and 64× respectively and would have skewed Phase 1 prioritization. Numbers now align with src/hpc/amx_matmul.rs:15 and bf16_tile_gemm.rs:155-157. - A72 vs A53 detection: replaced the unreliable `neon && aes` heuristic (both A53 with crypto and A72 have identical HWCAP flags) with explicit `Armv8Neon` fallback and a doc comment stating that /proc/cpuinfo `CPU part` reading is required to split them. SimdProfile enum, dispatch table, and quick-reference tables collapsed to single Armv8Neon variant. Future improvement: split into A72Fast/A53Baseline when /proc/cpuinfo lookup is wired. No code changes. Documentation only.
1 parent 18451bc commit a9102cd

2 files changed

Lines changed: 368 additions & 14 deletions

File tree

0 commit comments

Comments
 (0)