Commit b1979d7
committed
feat(hpc): TD-T2 — AMX TDPBUSD tile kernel + matmul_i8_to_i32 wiring
Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the
integer operand family. Builds the missing int8 tile kernel from
scratch (the BF16 equivalent shipped in PR #104; the int8 one had
never been built despite the primitives existing in simd_amx since
day one) and wires matmul_i8_to_i32's AMX arm through it.
New module `hpc::int8_tile_gemm`:
* `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel,
K must be multiple of 64. Mirror shape of
`bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand
family that TDPBUSD natively supports. **One TDPBUSD = 16 384
multiply-accumulates per instruction** (16×16 output tile × 64
K-elements per A row × 4 K-elements per inner-product). That's
256× the VPDPBUSD-zmm throughput per instruction.
* Internal `amx_path()` uses the existing primitives in
`amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig →
tile_zero → K/64 iterations of (tile_load A, tile_load B,
tile_dpbusd) → tile_store → tile_release.
* `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32
triple-loop reference.
New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`:
* Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad
layout required by TDPBUSD tile 2.
* `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]`
* Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout
for TDPBF16PS — both kernels reach the same 64-byte tile row
width via element-width × pack-factor symmetry: BF16 is 2B × 2,
INT8 is 1B × 4).
Wiring `matmul_i8_to_i32`'s AMX arm (was placebo):
Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR
`int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself
was never reached even on real AMX silicon. Now:
1. Shift A: i8 → u8 via (+128).
2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling
int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block
extracted into K × 16 scratch once per j_tile, reused across
i_tile iterations.
3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]).
The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0;
misaligned shapes fall back to the scalar reference. Phase-4 work
will land mixed AMX-tile + per-axis scalar tail handling for
arbitrary shapes (same shape of Phase-4 work TD-T1 deferred).
Verification:
* Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new
tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test
now exercises the actual TDPBUSD path because this host has
amx_int8 + amx_tile in /proc/cpuinfo; the test continues to
pass with bit-identical results to the scalar reference).
* `vnni_pack_i8_roundtrip` test verifies the pack layout matches
the spec exactly for an 8 × 4 sample.
* `fallback_matches_scalar_reference_k64` test verifies the
non-AMX path produces the same i32 output as a hand-written
reference for a 64-K, pseudo-random u8/i8 matrix pair.
* `public_api_diagonal_k128` test asserts a structured pattern
(A = identity-like, B = constant 2) gives the expected
accumulation through the full dispatch chain.
* `cargo clippy --lib -D warnings` clean.
* `cargo fmt --all --check` clean.
Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX
arm no longer falls back to it (the scalar else-branch uses an
inline triple-loop directly).
After this commit, the per-CPU dispatch table from PR #180 has the
AMX tier wired for BOTH operand families on Sapphire Rapids+:
BF16 GEMM: SPR+ → TDPBF16PS (TD-T1 / TD-T1b in PR #182)
INT8 GEMM: SPR+ → TDPBUSD (this commit)
Out of scope (separate PRs):
* VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade
Lake / Zen 4+ (avx512vnni without AMX). The kernel function
`vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just
need to assemble them into a m×n×k GEMM and wire as the
middle dispatch tier (analogous to the VDPBF16PS arm in PR
#182's bf16_gemm_dispatch).
* AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level
surface from PR #182) — it's u8 × i8 natively so no sign-shift
needed, simpler to wire than matmul_i8_to_i32.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u1 parent 098c5aa commit b1979d7
3 files changed
Lines changed: 288 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
193 | 193 | | |
194 | 194 | | |
195 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
196 | 222 | | |
197 | 223 | | |
198 | 224 | | |
| |||
207 | 233 | | |
208 | 234 | | |
209 | 235 | | |
210 | | - | |
| 236 | + | |
211 | 237 | | |
212 | 238 | | |
213 | 239 | | |
| |||
537 | 563 | | |
538 | 564 | | |
539 | 565 | | |
540 | | - | |
541 | | - | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
542 | 570 | | |
543 | | - | |
544 | | - | |
545 | | - | |
546 | | - | |
547 | | - | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
548 | 577 | | |
549 | 578 | | |
550 | 579 | | |
| |||
556 | 585 | | |
557 | 586 | | |
558 | 587 | | |
559 | | - | |
560 | | - | |
561 | | - | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
562 | 594 | | |
563 | 595 | | |
564 | | - | |
565 | | - | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
566 | 621 | | |
567 | 622 | | |
568 | 623 | | |
| |||
575 | 630 | | |
576 | 631 | | |
577 | 632 | | |
578 | | - | |
| 633 | + | |
| 634 | + | |
579 | 635 | | |
580 | 636 | | |
581 | 637 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
69 | 73 | | |
70 | 74 | | |
71 | 75 | | |
| |||
0 commit comments