Skip to content

Commit 47d1ba2

Browse files
committed
docs(simd): align with existing ArmProfile::arm_profile() heuristic
Self-correction. The previous commit collapsed A72Fast + A53Baseline into a new `Armv8Neon` variant, claiming the two could not be distinguished by HWCAP. That reinvented something the codebase already solves. `src/hpc/simd_caps.rs:317-336` has had `ArmProfile::arm_profile()` in tree since the SBC support landed. Its decision tree: asimd_dotprod present → A76DotProd (Pi 5 / A76+) aes present (no dotprod) → A72Fast (Pi 4 / Pi 3 / Pi Zero 2W) no aes → A53Baseline (QEMU / minimal aarch64) The line 329 comment explicitly admits the A72Fast branch catches both A72 silicon (Pi 4) and A53-with-crypto silicon (Pi 3, Pi Zero 2W): "we report A72-tier since most deployments target Pi 4." The dispatch tables would be identical at the ISA level (both are ARMv8.0+crypto, no dotprod), so this is intentional. The `A53Baseline` variant catches the rare case of NEON-without- crypto (QEMU, minimal aarch64 builds), which my `Armv8Neon` collapse lost. Changes: - Reverted SimdProfile enum to A76DotProd / A72Fast / A53Baseline. - detect() pseudocode now delegates to existing arm_profile() helper. - GemmDispatch table restored to 3 aarch64 entries. - Quick-reference tables list both A72Fast and A53Baseline rows with a note that they share the same kernel. - Dispatch matrix split into 4 rows: A53+crypto (→A72Fast), A53-no-crypto (→A53Baseline), A72 (→A72Fast), A76+ (→A76DotProd). This is more honest than the Armv8Neon collapse: it preserves the existing in-tree pattern, names it correctly, and documents the A72Fast-as-ARMv8.0+crypto-catch-all semantic that the codebase already chose.
1 parent a9102cd commit 47d1ba2

2 files changed

Lines changed: 38 additions & 30 deletions

File tree

.claude/knowledge/td-simd-cpu-dispatch-matrix.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,12 +169,13 @@ ARK equivalents: Ryzen 9 9950X, EPYC 9755 (Turin), Threadripper 9980X.
169169

170170
## Master matrix — aarch64
171171

172-
Rows ordered by SoC tier (Pi family naming as canonical). A53 and A72 are listed as separate documented silicon (they have distinct microarchitecture — single vs dual NEON pipeline), but **the runtime `SimdProfile` collapses both into one variant `Armv8Neon`** because HWCAP/CPUID alone cannot distinguish them. Splitting them requires reading `/proc/cpuinfo` `CPU part` field (0xd03 = A53, 0xd08 = A72) — deferred until benchmarks demand it.
172+
Rows ordered by SoC tier (Pi family naming as canonical). **The existing detection helper `ArmProfile::arm_profile()` at `src/hpc/simd_caps.rs:317-336` already implements this dispatch and is the canonical reference.** It admits in its own comments that A72 silicon and A53-with-crypto silicon cannot be distinguished by HWCAP alone, and pragmatically maps both to `A72Fast` since the dispatch tables would be identical at the ISA level (both are ARMv8.0+crypto with no dotprod). The `A53Baseline` variant catches the rare case of NEON-without-crypto (QEMU, minimal aarch64 builds).
173173

174174
| CPU silicon | Runtime profile | NEON | dotprod | fp16 | bf16+ (BFMMLA/BFDOT) | i8mm (SMMLA/UMMLA) | crypto (aes+sha2) | crc32 | sve | sve2 |
175175
|---|---|---|---|---|---|---|---|---|---|---|
176-
| **Cortex-A53** (Pi Zero 2W, Pi 3) | `Armv8Neon` | DOC ||||| DOC | DOC |||
177-
| **Cortex-A72** (Pi 4, Orange Pi 4) | `Armv8Neon` | DOC ||||| DOC | DOC |||
176+
| **Cortex-A53 + crypto** (Pi 3, Pi Zero 2W) | `A72Fast` (heuristic) | DOC ||||| DOC | DOC |||
177+
| **Cortex-A53 no crypto** (QEMU, minimal) | `A53Baseline` | DOC |||||| DOC |||
178+
| **Cortex-A72** (Pi 4, Orange Pi 4) | `A72Fast` | DOC ||||| DOC | DOC |||
178179
| **Cortex-A76+** (Pi 5, Orange Pi 5, Apple M1+) | `A76DotProd` | DOC | DOC | DOC | DOC | DOC | DOC | DOC |||
179180

180181
Apple M-series add SVE/SVE2 from M4 onwards; not yet in scope for this matrix.

.claude/knowledge/td-simd-integration-plan.md

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -65,15 +65,20 @@ pub enum SimdProfile {
6565
/// ARMv8.2-A: A76 (Pi 5), Apple M-series, Snapdragon 8 Gen 2+.
6666
/// NEON + dotprod + fp16 + bf16+ (BFMMLA/BFDOT).
6767
A76DotProd,
68-
/// ARMv8.0 fallback: A72 (Pi 4), A53 (Pi 3 / Pi Zero 2 W), any other
69-
/// ARMv8.0 part. The two cannot be distinguished by HWCAP/CPUID alone
70-
/// — they expose identical ISA flags (NEON + AES + SHA2 + CRC32, no
71-
/// dotprod). The difference is microarchitectural (dual vs single
72-
/// NEON pipeline width). Distinguishing requires reading
73-
/// /proc/cpuinfo `CPU part` (0xd03 = A53, 0xd08 = A72). Future
74-
/// improvement: split into A72Fast/A53Baseline once that lookup is
75-
/// wired. For now, dispatch table is identical for both.
76-
Armv8Neon,
68+
/// ARMv8.0 with crypto extension: Pi 4 (A72), Pi 3 (A53-with-crypto),
69+
/// Pi Zero 2 W (A53-with-crypto), Orange Pi 4. Cannot distinguish
70+
/// A53-with-crypto from A72 by HWCAP — both expose neon + aes + sha2 +
71+
/// crc32 with no dotprod. Dispatch table is identical at the ISA level
72+
/// (same NEON instructions). Existing `ArmProfile::arm_profile()` in
73+
/// `src/hpc/simd_caps.rs:317-336` calls this `A72Fast` and admits the
74+
/// heuristic ("we report A72-tier since most deployments target Pi 4")
75+
/// — adopt that naming for consistency.
76+
A72Fast,
77+
/// ARMv8.0 without crypto: rare in the wild (QEMU, minimal aarch64
78+
/// builds without `+aes`). Existing `ArmProfile::A53Baseline` catches
79+
/// this case; preserved for that purpose. Real A53 silicon (Pi 3, Pi
80+
/// Zero 2 W) usually has crypto and resolves as `A72Fast` above.
81+
A53Baseline,
7782

7883
// ── Fallback ──
7984
/// Anything else: wasm32, riscv, x86 baseline, unknown aarch64.
@@ -122,21 +127,22 @@ impl SimdProfile {
122127
}
123128
#[cfg(target_arch = "aarch64")]
124129
{
125-
let caps = simd_caps();
126-
if caps.asimd_dotprod && caps.fp16 {
127-
return SimdProfile::A76DotProd;
128-
}
129-
// A72 and A53 expose the same ISA flags (NEON + AES + SHA2 + CRC32,
130-
// no dotprod/fp16/bf16). They cannot be distinguished by
131-
// `is_aarch64_feature_detected!` alone — the difference is dual vs
132-
// single NEON pipeline width, a microarchitectural property not
133-
// reported by CPUID/HWCAP. Distinguishing requires reading
134-
// /proc/cpuinfo `CPU part` (0xd03 = A53, 0xd08 = A72) or doing a
135-
// microbenchmark probe. Until then, collapse both into a single
136-
// ARMv8.0 profile — the dispatch tables would be identical at the
137-
// ISA level (same NEON instructions); the only difference would
138-
// be tile/block sizes tuned to dual vs single pipeline.
139-
return SimdProfile::Armv8Neon;
130+
// Reuse the existing `ArmProfile::arm_profile()` heuristic from
131+
// `src/hpc/simd_caps.rs:317-336`. It already encodes the right
132+
// decisions and has been in tree since the SBC support landed:
133+
// asimd_dotprod present → A76DotProd (Pi 5 / A76+)
134+
// aes present (no dotprod) → A72Fast (Pi 4 / Pi 3 / Pi Zero 2W)
135+
// no aes → A53Baseline (QEMU / minimal aarch64)
136+
// The A72Fast branch catches A53-with-crypto silicon (Pi 3) and
137+
// A72 silicon (Pi 4) alike — they share the ARMv8.0+crypto ISA
138+
// and the dispatch tables would be identical. See arm_profile
139+
// doc comments for the deployment-pragmatic reasoning.
140+
return match simd_caps().arm_profile() {
141+
ArmProfile::A76DotProd => SimdProfile::A76DotProd,
142+
ArmProfile::A72Fast => SimdProfile::A72Fast,
143+
ArmProfile::A53Baseline => SimdProfile::A53Baseline,
144+
ArmProfile::NotArm => SimdProfile::Scalar,
145+
};
140146
}
141147
SimdProfile::Scalar
142148
}
@@ -240,7 +246,8 @@ pub fn gemm_dispatch() -> &'static GemmDispatch {
240246
SimdProfile::ArrowLake => &ARROW_GEMM,
241247
SimdProfile::HaswellAvx2 => &HSW_GEMM,
242248
SimdProfile::A76DotProd => &A76_GEMM,
243-
SimdProfile::Armv8Neon => &ARMV8_NEON_GEMM,
249+
SimdProfile::A72Fast => &A72_GEMM,
250+
SimdProfile::A53Baseline => &A53_GEMM,
244251
SimdProfile::Scalar => &SCALAR_GEMM,
245252
}
246253
});
@@ -417,7 +424,7 @@ For each named primitive, the silicon-by-silicon route after all 4 phases land:
417424
| IceLakeSp, CascadeLake, SkylakeX | F32x16 mul_add over decoded BF16 rows (`hpc/bf16_tile_gemm.rs::fallback_path`) |
418425
| ArrowLake, HaswellAvx2 | F32x8 mul_add over decoded BF16 rows (new) |
419426
| A76DotProd | NEON BFMMLA via asm-byte (new in Phase 2 TD-T10) |
420-
| Armv8Neon | NEON F32x4 mul_add over decoded BF16 (new) |
427+
| A72Fast, A53Baseline | NEON F32x4 mul_add over decoded BF16 (new) — same kernel, separate table entries for symmetry with `ArmProfile` |
421428
| Scalar | Scalar triple loop (current `quantized.rs:444`) — kept as the reference |
422429

423430
### `int8_gemm_i32` (u8 × i8 → i32 matmul)
@@ -430,7 +437,7 @@ For each named primitive, the silicon-by-silicon route after all 4 phases land:
430437
| ArrowLake | `_mm256_dpbusd_epi32` (existing `vnni2_dot_u8_i8` at `simd_amx.rs:203`) |
431438
| HaswellAvx2 | Scalar i32 accumulate (no VNNI pre-Cascade Lake) |
432439
| A76DotProd | NEON SDOT (`vdotq_s32`, existing in `simd_neon.rs`) |
433-
| Armv8Neon | NEON int16x8 widen + multiply-accumulate |
440+
| A72Fast, A53Baseline | NEON int16x8 widen + multiply-accumulate — same kernel for both ARMv8.0 tiers |
434441

435442
### `gemv_f32` (BLAS-2 matrix-vector)
436443

0 commit comments

Comments
 (0)