Commit a9102cd
committed
docs(simd): per-CPU dispatch matrix + Codex review fixes
Follow-up to merged PR #180. Two changes:
1. NEW: .claude/knowledge/td-simd-cpu-dispatch-matrix.md
Per-CPU feature table, every cell sourced from official spec
(Intel ARK, AMD data sheets, WikiChip, Wikipedia AVX-512
article cross-referenced against primary sources).
Rows: SkylakeX, CascadeLake, CooperLake, IceLakeSp,
TigerLakeU, SapphireRapids, EmeraldRapids, GraniteRapids,
Zen4, Zen5, ArrowLake, HaswellAvx2 (x86_64) plus
Cortex-A53, A72, A76+ (aarch64).
Columns: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VNNI, BF16,
FP16, VPOPCNTDQ, BITALG, GFNI, VAES, VPCLMULQDQ,
VP2INTERSECT, AMX-TILE, AMX-INT8, AMX-BF16, AMX-FP16,
AVX-VNNI, AVX-VNNI-INT8, AVX-IFMA.
Status legend: DOC (from official spec, dispatch safe but
not yet hardware-verified by this project) vs TEST (verified
on real silicon). Every cell is currently DOC — promotion to
TEST happens as we acquire hardware for each profile.
Includes:
- Per-CPU microarchitecture notes with citations
- CPUID leaf/register/bit table for hand-coded detection
- OS XSAVE state requirements (XCR0 bits, arch_prctl for AMX)
- SimdProfile::detect() pseudocode with the GraniteRapids-
before-SapphireRapids ordering invariant
- Out-of-scope CPUs listed (Knights Mill, Cannon Lake,
Alder Lake firmware-disabled AVX-512, Sierra Forest, etc.)
Critical detection invariants:
- GraniteRapids checked BEFORE SapphireRapids (GNR has SPR
bits + AMX-FP16; if SPR first, AMX-FP16 stays unused).
- Zen4 vs SPR distinguished by amx_tile present/absent.
- CooperLake vs IceLakeSp: mutually exclusive bit patterns
(CPL has BF16 no VBMI; ICX has VBMI no BF16).
- TigerLakeU vs IceLakeSp: discriminated by VP2INTERSECT.
2. Fixes to td-simd-integration-plan.md per Codex bot review on
PR #180:
- AMX MAC count comments corrected: TDPBF16PS = 8192 mul-adds
per instruction (16×16 output × K=32), TDPBUSD = 16384
mul-adds per instruction (16×16 output × K=64). Previous
"256 mul-adds/instr" understated by 32× and 64× respectively
and would have skewed Phase 1 prioritization. Numbers now
align with src/hpc/amx_matmul.rs:15 and
bf16_tile_gemm.rs:155-157.
- A72 vs A53 detection: replaced the unreliable
`neon && aes` heuristic (both A53 with crypto and A72 have
identical HWCAP flags) with explicit `Armv8Neon` fallback
and a doc comment stating that /proc/cpuinfo `CPU part`
reading is required to split them. SimdProfile enum,
dispatch table, and quick-reference tables collapsed to
single Armv8Neon variant. Future improvement: split into
A72Fast/A53Baseline when /proc/cpuinfo lookup is wired.
No code changes. Documentation only.1 parent 18451bc commit a9102cd
2 files changed
Lines changed: 368 additions & 14 deletions
0 commit comments