docs(simd): align with existing ArmProfile::arm_profile() heuristic

claude · claude · commit 47d1ba28b9dc · 2026-05-20T19:53:20.000Z
Self-correction. The previous commit collapsed A72Fast + A53Baseline
into a new `Armv8Neon` variant, claiming the two could not be
distinguished by HWCAP. That reinvented something the codebase
already solves.

`src/hpc/simd_caps.rs:317-336` has had `ArmProfile::arm_profile()`
in tree since the SBC support landed. Its decision tree:

  asimd_dotprod present  → A76DotProd (Pi 5 / A76+)
  aes present (no dotprod) → A72Fast   (Pi 4 / Pi 3 / Pi Zero 2W)
  no aes                 → A53Baseline (QEMU / minimal aarch64)

The line 329 comment explicitly admits the A72Fast branch catches
both A72 silicon (Pi 4) and A53-with-crypto silicon (Pi 3, Pi Zero
2W): "we report A72-tier since most deployments target Pi 4." The
dispatch tables would be identical at the ISA level (both are
ARMv8.0+crypto, no dotprod), so this is intentional.

The `A53Baseline` variant catches the rare case of NEON-without-
crypto (QEMU, minimal aarch64 builds), which my `Armv8Neon` collapse
lost.

Changes:
- Reverted SimdProfile enum to A76DotProd / A72Fast / A53Baseline.
- detect() pseudocode now delegates to existing arm_profile() helper.
- GemmDispatch table restored to 3 aarch64 entries.
- Quick-reference tables list both A72Fast and A53Baseline rows with
  a note that they share the same kernel.
- Dispatch matrix split into 4 rows: A53+crypto (→A72Fast),
  A53-no-crypto (→A53Baseline), A72 (→A72Fast), A76+ (→A76DotProd).

This is more honest than the Armv8Neon collapse: it preserves the
existing in-tree pattern, names it correctly, and documents the
A72Fast-as-ARMv8.0+crypto-catch-all semantic that the codebase
already chose.
diff --git a/.claude/knowledge/td-simd-cpu-dispatch-matrix.md b/.claude/knowledge/td-simd-cpu-dispatch-matrix.md
@@ -169,12 +169,13 @@ ARK equivalents: Ryzen 9 9950X, EPYC 9755 (Turin), Threadripper 9980X.
 
 ## Master matrix — aarch64
 
-Rows ordered by SoC tier (Pi family naming as canonical). A53 and A72 are listed as separate documented silicon (they have distinct microarchitecture — single vs dual NEON pipeline), but **the runtime `SimdProfile` collapses both into one variant `Armv8Neon`** because HWCAP/CPUID alone cannot distinguish them. Splitting them requires reading `/proc/cpuinfo` `CPU part` field (0xd03 = A53, 0xd08 = A72) — deferred until benchmarks demand it.
+Rows ordered by SoC tier (Pi family naming as canonical). **The existing detection helper `ArmProfile::arm_profile()` at `src/hpc/simd_caps.rs:317-336` already implements this dispatch and is the canonical reference.** It admits in its own comments that A72 silicon and A53-with-crypto silicon cannot be distinguished by HWCAP alone, and pragmatically maps both to `A72Fast` since the dispatch tables would be identical at the ISA level (both are ARMv8.0+crypto with no dotprod). The `A53Baseline` variant catches the rare case of NEON-without-crypto (QEMU, minimal aarch64 builds).
 
 | CPU silicon | Runtime profile | NEON | dotprod | fp16 | bf16+ (BFMMLA/BFDOT) | i8mm (SMMLA/UMMLA) | crypto (aes+sha2) | crc32 | sve | sve2 |
 |---|---|---|---|---|---|---|---|---|---|---|
-| **Cortex-A53** (Pi Zero 2W, Pi 3) | `Armv8Neon` | DOC | ✗ | ✗ | ✗ | ✗ | DOC | DOC | ✗ | ✗ |
-| **Cortex-A72** (Pi 4, Orange Pi 4) | `Armv8Neon` | DOC | ✗ | ✗ | ✗ | ✗ | DOC | DOC | ✗ | ✗ |
+| **Cortex-A53 + crypto** (Pi 3, Pi Zero 2W) | `A72Fast` (heuristic) | DOC | ✗ | ✗ | ✗ | ✗ | DOC | DOC | ✗ | ✗ |
+| **Cortex-A53 no crypto** (QEMU, minimal) | `A53Baseline` | DOC | ✗ | ✗ | ✗ | ✗ | ✗ | DOC | ✗ | ✗ |
+| **Cortex-A72** (Pi 4, Orange Pi 4) | `A72Fast` | DOC | ✗ | ✗ | ✗ | ✗ | DOC | DOC | ✗ | ✗ |
 | **Cortex-A76+** (Pi 5, Orange Pi 5, Apple M1+) | `A76DotProd` | DOC | DOC | DOC | DOC | DOC | DOC | DOC | ✗ | ✗ |
 
 Apple M-series add SVE/SVE2 from M4 onwards; not yet in scope for this matrix.
diff --git a/.claude/knowledge/td-simd-integration-plan.md b/.claude/knowledge/td-simd-integration-plan.md
@@ -65,15 +65,20 @@ pub enum SimdProfile {
     /// ARMv8.2-A: A76 (Pi 5), Apple M-series, Snapdragon 8 Gen 2+.
     /// NEON + dotprod + fp16 + bf16+ (BFMMLA/BFDOT).
     A76DotProd,
-    /// ARMv8.0 fallback: A72 (Pi 4), A53 (Pi 3 / Pi Zero 2 W), any other
-    /// ARMv8.0 part. The two cannot be distinguished by HWCAP/CPUID alone
-    /// — they expose identical ISA flags (NEON + AES + SHA2 + CRC32, no
-    /// dotprod). The difference is microarchitectural (dual vs single
-    /// NEON pipeline width). Distinguishing requires reading
-    /// /proc/cpuinfo `CPU part` (0xd03 = A53, 0xd08 = A72). Future
-    /// improvement: split into A72Fast/A53Baseline once that lookup is
-    /// wired. For now, dispatch table is identical for both.
-    Armv8Neon,
+    /// ARMv8.0 with crypto extension: Pi 4 (A72), Pi 3 (A53-with-crypto),
+    /// Pi Zero 2 W (A53-with-crypto), Orange Pi 4. Cannot distinguish
+    /// A53-with-crypto from A72 by HWCAP — both expose neon + aes + sha2 +
+    /// crc32 with no dotprod. Dispatch table is identical at the ISA level
+    /// (same NEON instructions). Existing `ArmProfile::arm_profile()` in
+    /// `src/hpc/simd_caps.rs:317-336` calls this `A72Fast` and admits the
+    /// heuristic ("we report A72-tier since most deployments target Pi 4")
+    /// — adopt that naming for consistency.
+    A72Fast,
+    /// ARMv8.0 without crypto: rare in the wild (QEMU, minimal aarch64
+    /// builds without `+aes`). Existing `ArmProfile::A53Baseline` catches
+    /// this case; preserved for that purpose. Real A53 silicon (Pi 3, Pi
+    /// Zero 2 W) usually has crypto and resolves as `A72Fast` above.
+    A53Baseline,
 
     // ── Fallback ──
     /// Anything else: wasm32, riscv, x86 baseline, unknown aarch64.
@@ -122,21 +127,22 @@ impl SimdProfile {
         }
         #[cfg(target_arch = "aarch64")]
         {
-            let caps = simd_caps();
-            if caps.asimd_dotprod && caps.fp16 {
-                return SimdProfile::A76DotProd;
-            }
-            // A72 and A53 expose the same ISA flags (NEON + AES + SHA2 + CRC32,
-            // no dotprod/fp16/bf16). They cannot be distinguished by
-            // `is_aarch64_feature_detected!` alone — the difference is dual vs
-            // single NEON pipeline width, a microarchitectural property not
-            // reported by CPUID/HWCAP. Distinguishing requires reading
-            // /proc/cpuinfo `CPU part` (0xd03 = A53, 0xd08 = A72) or doing a
-            // microbenchmark probe. Until then, collapse both into a single
-            // ARMv8.0 profile — the dispatch tables would be identical at the
-            // ISA level (same NEON instructions); the only difference would
-            // be tile/block sizes tuned to dual vs single pipeline.
-            return SimdProfile::Armv8Neon;
+            // Reuse the existing `ArmProfile::arm_profile()` heuristic from
+            // `src/hpc/simd_caps.rs:317-336`. It already encodes the right
+            // decisions and has been in tree since the SBC support landed:
+            //   asimd_dotprod present  → A76DotProd (Pi 5 / A76+)
+            //   aes present (no dotprod) → A72Fast   (Pi 4 / Pi 3 / Pi Zero 2W)
+            //   no aes                 → A53Baseline (QEMU / minimal aarch64)
+            // The A72Fast branch catches A53-with-crypto silicon (Pi 3) and
+            // A72 silicon (Pi 4) alike — they share the ARMv8.0+crypto ISA
+            // and the dispatch tables would be identical. See arm_profile
+            // doc comments for the deployment-pragmatic reasoning.
+            return match simd_caps().arm_profile() {
+                ArmProfile::A76DotProd  => SimdProfile::A76DotProd,
+                ArmProfile::A72Fast     => SimdProfile::A72Fast,
+                ArmProfile::A53Baseline => SimdProfile::A53Baseline,
+                ArmProfile::NotArm      => SimdProfile::Scalar,
+            };
         }
         SimdProfile::Scalar
     }
@@ -240,7 +246,8 @@ pub fn gemm_dispatch() -> &'static GemmDispatch {
             SimdProfile::ArrowLake => &ARROW_GEMM,
             SimdProfile::HaswellAvx2 => &HSW_GEMM,
             SimdProfile::A76DotProd => &A76_GEMM,
-            SimdProfile::Armv8Neon => &ARMV8_NEON_GEMM,
+            SimdProfile::A72Fast => &A72_GEMM,
+            SimdProfile::A53Baseline => &A53_GEMM,
             SimdProfile::Scalar => &SCALAR_GEMM,
         }
     });
@@ -417,7 +424,7 @@ For each named primitive, the silicon-by-silicon route after all 4 phases land:
 | IceLakeSp, CascadeLake, SkylakeX | F32x16 mul_add over decoded BF16 rows (`hpc/bf16_tile_gemm.rs::fallback_path`) |
 | ArrowLake, HaswellAvx2 | F32x8 mul_add over decoded BF16 rows (new) |
 | A76DotProd | NEON BFMMLA via asm-byte (new in Phase 2 TD-T10) |
-| Armv8Neon | NEON F32x4 mul_add over decoded BF16 (new) |
+| A72Fast, A53Baseline | NEON F32x4 mul_add over decoded BF16 (new) — same kernel, separate table entries for symmetry with `ArmProfile` |
 | Scalar | Scalar triple loop (current `quantized.rs:444`) — kept as the reference |
 
 ### `int8_gemm_i32` (u8 × i8 → i32 matmul)
@@ -430,7 +437,7 @@ For each named primitive, the silicon-by-silicon route after all 4 phases land:
 | ArrowLake | `_mm256_dpbusd_epi32` (existing `vnni2_dot_u8_i8` at `simd_amx.rs:203`) |
 | HaswellAvx2 | Scalar i32 accumulate (no VNNI pre-Cascade Lake) |
 | A76DotProd | NEON SDOT (`vdotq_s32`, existing in `simd_neon.rs`) |
-| Armv8Neon | NEON int16x8 widen + multiply-accumulate |
+| A72Fast, A53Baseline | NEON int16x8 widen + multiply-accumulate — same kernel for both ARMv8.0 tiers |
 
 ### `gemv_f32` (BLAS-2 matrix-vector)