Skip to content

Commit bab9f2e

Browse files
authored
Merge pull request #83 from AdaWorldAPI/claude/setup-embedding-pipeline-Fa65C
docs: clarify VNNI dispatch tiers — F32x16 is the floor, no scalar on x86 avx512vnni (64 MACs) and avxvnniint8 (32 MACs) are mutually exclusive by hardware generation. The scalar i32 path in matvec_dispatch only exists for non-x86 correctness. On x86, the thinking engine dispatches to F32x16 FMA (16 MACs) when no VNNI is detected — never reaches the scalar path. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
2 parents 09815cb + 6901753 commit bab9f2e

1 file changed

Lines changed: 14 additions & 4 deletions

File tree

src/simd_amx.rs

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -201,11 +201,19 @@ pub fn vnni_matvec_scalar(
201201
}
202202
}
203203

204-
/// Runtime-dispatched MatVec: avx512vnni → avxvnniint8 (VNNI2) → scalar.
204+
/// Runtime-dispatched VNNI MatVec: avx512vnni → avxvnniint8 → scalar i32.
205205
///
206-
/// Tier 2: avx512vnni — 64 MACs/instr (zmm, Cascade Lake+, Zen 4+)
207-
/// Tier 1: avxvnniint8 — 32 MACs/instr (ymm, Arrow Lake, NUC 14 i9-185H)
208-
/// Tier 0: scalar
206+
/// Three tiers, mutually exclusive by hardware generation:
207+
/// avx512vnni — 64 MACs/instr (zmm, Cascade Lake+, Zen 4+)
208+
/// avxvnniint8 — 32 MACs/instr (ymm, Arrow Lake, NUC 14 i9-185H)
209+
/// scalar i32 — only for non-x86 or testing (caller should prefer F32x16 FMA)
210+
///
211+
/// NOTE: The scalar path here does i32 multiply-accumulate, NOT f32.
212+
/// For the thinking engine, F32x16 FMA (16 MACs/instr) is the true floor.
213+
/// This scalar path exists only for correctness on non-x86 targets.
214+
/// The thinking engine's cycle_auto() dispatches:
215+
/// VNNI detected → cycle_vnni() → this function
216+
/// No VNNI → cycle() → F32x16 (never reaches here)
209217
pub fn matvec_dispatch(
210218
table: &[u8],
211219
energy_i8: &[i8],
@@ -223,6 +231,8 @@ pub fn matvec_dispatch(
223231
return;
224232
}
225233
}
234+
// Non-x86 or no VNNI: i32 scalar accumulate.
235+
// On x86, the thinking engine uses F32x16 FMA instead of reaching here.
226236
vnni_matvec_scalar(table, energy_i8, result, n);
227237
}
228238

0 commit comments

Comments
 (0)