[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264
Draft
DarylHawkinsAMD wants to merge 17 commits into
Draft
[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264DarylHawkinsAMD wants to merge 17 commits into
DarylHawkinsAMD wants to merge 17 commits into
Conversation
Foundation for SDPA backward multi-kernel dispatch (ALMIOPEN-1824 / I8.1 Stream A). Adds CSV-driven config-header generation for the backward pipeline so SdpaBwdPlanBuilder can resolve kernels at runtime instead of the POC's hard-coded selection. * Snapshot fmha_v3_bwd CSVs and .co binaries from AITER (commit 9522048dc10de20ba9dcda1c0a3f640867e7a586) for gfx942 and gfx950, using a flat layout that mirrors AITER's source. Backward has no MI300/MI308 split; the previous MI300/ subfolder is removed. Three CSVs are imported per arch (odo, dqdkdv, dq_convert); the empty fmha_bwd_dq_shuffle.csv is omitted. * Extend asm_sdpa_engine/CMakeLists.txt with a second add_custom_command invoking codegen.py -m fmha_v3_bwd, producing asm_fmha_v3_bwd_configs.hpp alongside the existing fwd header. * Append a frozen KernelTiles skeleton plus odoTiles / dqdkdvTiles / dqConvertTiles fields to SdpaBwdParams so Streams B and C can build against a stable contract. Skeleton is default-initialised; population is Stream B's responsibility. * Add SOURCE.md per arch documenting the AITER source URL, commit, and subdirectory for future binary-versioning automation (I8.7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hard-coded K_TS_ODO/K_TS_KV/K_TS_DQ constants in SdpaBwdPlan with per-stage tile sizes plumbed through SdpaBwdParams (ALMIOPEN-1824 / I8.1 Stream C). The plan no longer knows specific kernel tile dimensions; it consumes whatever the builder resolved from the AITER CSV. * SdpaBwdParams: add a constexpr KernelTiles::gridDim(extent) ceil-div helper. Returns 0 when ts is 0 so unit tests can exercise the unpopulated case; runtime callers get a no-op grid in that scenario, which the launch path can detect. * SdpaBwdPlan: delete the three K_TS_* constants, route grid-x at each stage through _params.<stage>Tiles.gridDim(...), and plumb the dqdkdv KV-tile through MhaBwdArgs::ts_dqdkdv for the dqdkdv.Ts byte-stride that previously used K_TS_KV. * TestSdpaBwdPlanGridMath: regression tests pinning the POC config (hd128/bf16/seqlen=2048) grid dims to the values the deleted constants would have produced, plus ceil-div edge cases for the KernelTiles helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hard-coded SDPA backward POC selection with a registry walk
over the AITER-derived configs (ALMIOPEN-1824 / I8.1 Stream B). The
plan builder now resolves a kernel for each pipeline stage (odo,
dqdkdv, dq_convert) from the on-disk .co binaries, and routes the
matched tile sizes into SdpaBwdParams for Plan.cpp to consume.
* SdpaBwdPlanBuilder: introduce backward-local enums (MaskType,
RoundingMode, BatchMode, AccumulatorMode) and helpers ported from
the forward path (getMaskType, getRoundingMode, getBatchMode,
getDataTypeIdentifier, computePssk, computePddv, lookupKernelNameKey,
getKernelCoPath flat layout).
* isApplicable narrows day-one acceptance to the POC kernarg layout:
gfx942, batch mode, atomic32+pssk+pddv all 1, bf16 (RTNE bf16_cvt
in {0,1,2}) or fp16 (bf16_cvt = 3), square hdim in {64, 128, 192}.
Variants outside this set are rejected with a clear log so I8.2 /
I8.4.2 / I8.5 / I8.6 follow-ups can lift restrictions one axis at
a time. AITER does not ship hd64 with pddv=1 (already aligned),
so the effective accepted matrix is hd in {128, 192}; the registry
miss surfaces this cleanly.
* buildPlan resolves all three stage keys, populates
params.{odo,dqdkdv,dqConvert}Tiles, and replaces the three
hard-coded co paths and mangled kernel names with the registry
values. Forward-style MI300/MI308 sub-folder split is intentionally
absent for backward — AITER's bwd source has a flat layout.
* TestSdpaBwdPlanBuilder: add registry-lookup, isApplicable rejection,
and pssk/pddv parameterised tests; flip FP16 hd128 in the existing
IsApplicableSdpaBwdVariations from rejected to accepted; document
why hd64 stays rejected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ers/dahawkin/sdpa-bwd-kernel-selection
The merged Stream B/C work was missed by the prior build because the ninja graph was stale until ninja_check forced a reconfigure. Once the new test source linked, clang-tidy surfaced three classes of warnings-as-errors: * SdpaBwdPlanBuilder.cpp: brace single-line `if (...) continue;` bodies (readability-braces-around-statements) and add an explicit default arm to the PipelineStage switch (clang-diagnostic-switch-default). * TestSdpaBwdPlanBuilder.cpp / TestSdpaBwdPlanGridMath.cpp: rename constexpr variables from camelBack (`kFooBar`) to UPPER_CASE (`FOO_BAR`) per .clang-tidy's readability-identifier-naming.ConstexprVariableCase. * TestSdpaBwdPlanBuilder.cpp registry-lookup test for hd192/bf16 causal-tl/batch: the dqdkdv row in AITER's CSV is pssk=1, pddv=1 for that combination, not pddv=0 as the test assumed; the test now matches the CSV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the SDPA backward integration smoke suite to cover the remainder of the day-one accepted dispatch matrix: bf16/fp16 across hd128/hd192 on a single small problem each. The pre-existing hd128/bf16 case stays as the regression baseline. FP16 uses a slightly looser tolerance to accommodate its narrower dynamic range under softmax recomputation. Also drop residual task-tracking references in SdpaBwdPlanBuilder.cpp comments per the project convention that comments describe current state, not change history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verification on gfx942 surfaced two issues that this change resolves: 1. The AITER snapshot of bwd_hd128_odo_bf16.co (9344 bytes) produced D-aux values that diverged from the in-tree CPU backward reference enough to fail the existing hd128/bf16 SDPA backward correctness test. Restore the develop POC binary (10280 bytes) for that one file. SOURCE.md now records this as an explicit local override so future AITER refreshes know to either keep the override or coordinate a tolerance / reference update. 2. The hd192 and FP16 dispatch paths exist in the registry and exercise correctly, but the in-tree CPU reference is not calibrated against those kernel variants — gradient outputs land far outside the existing tolerance. Calibrating that reference is owned by the I8.4.x tasks. Until those land, isApplicable rejects anything other than hd128 / BF16. The PlanBuilder gate now mirrors what we can actually verify end-to-end. The dispatch infrastructure for hd192 and FP16 stays in place (registry walk, tile plumbing, kernel name resolution) and is reachable as soon as the gate is widened. * SdpaBwdPlanBuilder: tighten isApplicable to hd128 + bf16 with rationale comments pointing at the I8.4.x dependency. * TestSdpaBwdPlanBuilder: IsApplicableSdpaBwdVariations now expects hd64, hd192, and FP16 to all return false; comments explain why. * IntegrationGpuSdpaBackward: revert the hd192 and FP16 cases that were added in Phase 3 — they were over-reach into I8.4.x territory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness: - Zero-init dq_acc workspace before A32 DQDKDV launch (AITER's asm frontend uses torch::zeros for dq_accum; without this, dQ silently corrupts whenever seq_kv > ts*2). - Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/lse/ dq_acc field instead of logging-and-continuing. - Reject fractional GQA ratio (nhead_q % nhead_k != 0) in isApplicable. - Add hipPeekAtLastError + gridX>0 precheck per kernel launch. - gfx950 BF16 bf16_cvt special case in registry lookup (matches forward). - Wrap tensorMap.at lookups in isApplicable with find()+return-false. - Fix latent fwd buildPlan bug: silently fell through after a failed registry lookup. - static_assert(offsetof) on load-bearing kernarg fields. Architecture / dispatch: - Extract BwdDispatchTuple computed once and consumed by both isApplicable and buildPlan so the per-stage tuples cannot drift. - resolveStage returns std::optional<ResolvedKernel> instead of three out-references. - Model A16 path as dormant-but-correct: skip dq_convert launch and dq_acc allocation when the resolved row has atomic32==0; gate flip remains pinned to A32 in computeDispatchTuples with a TODO. Maintainability: - Cache getDeviceString on the builder; eliminate duplicated try/catch. - Hoist BF16_CVT_FP16_SENTINEL to SdpaBwdPlanBuilder.hpp. - Drop dead computePssk/computePddv helpers (no compatible registry row in the current matrix). - Drop unused KernelTiles::tsQO field; drop unused archId param from backward getKernelCoPath. - initializeExecutionSettings throws std::logic_error instead of emitting a spurious LOG_ERROR on every engine instantiation. - Comment hygiene; group-mode rejection names the actual graph attribute; getRoundingMode TODO uses ALMIOPEN-1824 tag. Provenance: - SHA256 manifest for gfx942/gfx950 fmha_v3_bwd .co directories (incl. the local-override bwd_hd128_odo_bf16.co). - Reconcile dual-AITER-commit state in asm_kernels/README.md and AsmKernelPath.hpp comment block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per PR review: MhaBwdArgs is documented as a line-for-line mirror of AITER's mha_bwd_args, but I8.1 added a `ts_dqdkdv` field to plumb the KV-tile size in from the dispatch tuple. AITER takes the tile size from the kernel-traits template parameter at the call site, not from the args struct, so adding the field broke the mirroring contract. Pass `tsKv` as an explicit parameter to `buildDqdkdvArgs` and `validateMhaBwdByteStrides` instead. `MhaBwdArgs` is now field-for-field identical to AITER again, and the build helpers stay self-contained (no back-channel into `_params`). Verified on gfx942: 49/50 unit tests pass (1 gfx950 reject test skips on this arch); SdpaFwd/0, SdpaFwd/1, SdpaBwd/0 GPU integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fwd builder refactor of getKernelNameKey (string key -> config*) and the silent-fall-through fix in SdpaFwdPlanBuilder::buildPlan are out of scope for this bwd-dispatch PR. Revert SdpaFwdPlanBuilder.cpp to its develop state; the silent-fall-through bug will be addressed in its own PR with appropriate test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… TODOs
Bwd dispatch refinements from the second review pass:
- Replace getDataTypeIdentifier (returns "" sentinel on unsupported
dtype) with tryGetDataTypeIdentifier returning std::optional<std::string>.
Matches the tryGetDeviceString convention already in the same file
and the broader std::optional idiom across hip-kernel-provider engines.
buildPlan now bails on std::nullopt instead of silently propagating
an empty key into the registry walk.
- Re-tag I8.x TODOs to their Jira tickets under epic ALMIOPEN-1823
so reviewers can navigate directly to the work item:
I8.2 -> ALMIOPEN-1825 (A16 vs A32 workspace sizing)
I8.4.x -> ALMIOPEN-1832 (CPU reference calibration for hd192/FP16)
I8.9 (workspace alignment) left as I8.9 because no child Jira ticket
exists yet for that scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
clang-tidy misc-const-correctness violations on lines this PR introduced or already touched: - SdpaBwdPlan.cpp: drop two unused using decls (alignUp, K_WORKSPACE_ALIGNMENT_BYTES); const for err, mhaArgs, gdxOdo, gdxDqdkdv, memsetErr, gdxPost. - SdpaKernelUtils.hpp: const for the three hipError_t err locals in launchKernel and HipModuleGuard's destructor / move-assign. - SdpaBwdPlanBuilder.cpp: const for leftAndRightBoundsSet, the eight *Uid locals in isApplicable, the nine *Uid locals in buildPlan, and bf16CvtValue (both call sites). - TestSdpaBwdPlanBuilder.cpp: const for graphWrapper, settings, workspaceSize, applicabilityTests, and the qDims/kDims/vDims/oDims vectors in IsApplicable_RejectsFractionalGqaRatio and IsApplicable_RejectsAsymmetricHdim. Out-of-scope tidy violations in IntegrationGpuSdpaBackward.cpp, HipDeviceUtils.hpp, batchnorm/rmsnorm tests, etc. are not addressed here — they pre-date this PR and belong in their own cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous signature returned a const fmha_v3_bwdConfig* into the codegen-emitted `static CFG cfg_fmha_bwd_*` registries. The pointer was safe in practice (static storage duration, registries never mutated), but the API leaked a lifetime contract the language did not enforce: any future maintainer mutating one of the maps would silently invalidate the cached pointer. Returning std::optional<fmha_v3_bwdConfig> by value decouples the caller from the registry's storage entirely. The struct is small (four std::strings + 11 ints, ~250 bytes), and findConfig is called ~3 times per dispatch resolution — not in any hot path. Matches the optional-based fallible-return idiom already used by tryGetDataTypeIdentifier and tryGetDeviceString in the same file. Call sites updated: - lookupKernelNameKey: const fmha_v3_bwdConfig* cfg = nullptr -> std::optional<fmha_v3_bwdConfig> cfg - buildPlan's resolveStage lambda: parameter type and dereference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IPlanBuilder::buildPlan documents 'throws HipdnnPluginException if the plan cannot be built'. SdpaBwdPlanBuilder::buildPlan was instead silently early-returning at every error site, leaving the caller's ExecutionContext without a plan; the next call to ExecutionContextBase::getPlan() then threw a generic HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR with no context, hiding the actual root cause from the API caller. Eight silent-return sites converted: - tryGetDeviceString failure - unsupported tensor dtype combination (isApplicable should have caught) - odo / dqdkdv / dq_convert registry lookup miss (resolveStage lambda now returns ResolvedKernel by value and throws directly; callers no longer need to check the return) - odo / dqdkdv / dq_convert kernel module load failure Most use HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR because isApplicable is the public-facing precondition check; if buildPlan reaches one of these, something upstream has already gone wrong. Each throw carries a message naming the failing operation and the relevant context (arch, dtype, hdim, .co path) so the API caller's exception is actionable. Matches the throw-on-failure pattern used throughout BatchnormPlanBuilder, ApplicabilityChecks, and the rest of the hip-kernel-provider plan builders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… execute By the time SdpaBwdPlan::execute runs, every parameter must already be known good — anything checkable from graph data should have been gated in isApplicable, and anything that does fail at execute time is a genuinely exceptional condition that should throw, not silently return. Changes: - Move the byte-stride uint32 overflow check into SdpaBwdPlanBuilder::isApplicable. A new wouldBwdByteStridesFitUint32 helper extracts strides directly from the graph tensors and computes the derived dq_acc strides from B/H/S/D dims. Tile size for the Ts product comes from a findConfig() call against cfg_fmha_bwd_dqdkdv. Reject the graph (return false) on any overflow. - Add o_tensor_uid lookup in isApplicable (was previously only resolved in buildPlan; needed for the o-tensor stride checks). - Drop the runtime validateMhaBwdByteStrides call from execute — the same condition is now caught at applicability time. Delete the helper and fitsUint32AfterScale + K_U32_MAX_AS_I64 from SdpaBwdPlan.cpp (no remaining callers). - Drop the three checkLaunchPreconditions(gridX==0) calls and the helper itself — gridX==0 was a sentinel for unresolved CSV rows, but buildPlan now throws on resolveStage failure (commit 99292b3), so reaching execute with ts==0 is impossible. - Convert remaining execute early-returns to throws: * workspace==nullptr: BAD_PARAM (caller contract violation — getMaxWorkspaceSize always returns > 0 for bwd) * hipMemsetAsync failure: INTERNAL_ERROR with the HIP error string * hipModuleLaunchKernel failure (3 stages): INTERNAL_ERROR * Async post-launch error (3 stages): INTERNAL_ERROR via the new throwOnLaunchPostError replacing checkLaunchPostError Each throw carries enough context (stage name, HIP error string) for the API caller's exception to be actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The byte-stride helper landed with K_U32_MAX_AS_I64, K_BF16_BYTES,
and K_FP32_BYTES declared as function-local constexprs. The inner
`check` lambda has empty capture `[]` and uses K_U32_MAX_AS_I64,
which fails to build under clang's strict capture rules:
error: variable 'K_U32_MAX_AS_I64' cannot be implicitly captured
in a lambda with no capture-default specified
(Also tripped modernize-use-auto on the function-local int64_t
declaration.)
Hoist all three constants to the anonymous namespace where they don't
need capture. Compilation regression was missed because the prior
verification used `ninja hipdnn-unit-check`, which builds projects/
hipdnn only — the hip-kernel-provider sources were never recompiled
with the change. Verified now via `ninja hip_kernel_provider_impl`
and `ninja hip_kernel_provider_tests`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- buildPlan: restore numbered stage banners (1-6), per-tensor layout comments above each strides extraction (Q/K/V/O/dO/dQ/dK/dV/Stats), and the // Tensor UIDs / // Tensor objects block labels. The new 6-stage numbering reflects the post-CSV-dispatch flow (resolve and load are now separate stages). - initializeExecutionSettings: revert from throw std::logic_error back to HIPDNN_PLUGIN_LOG_ERROR. The throw fired on the normal path (HipKernelEngine calls this unconditionally from getMaxWorkspaceSize and initializeExecutionContext for every applicable graph), and std::logic_error is the wrong type for "not implemented" anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the hard-coded SDPA backward POC kernel selection in
asm_sdpa_enginewitha runtime, CSV-driven dispatch that mirrors the forward path. The plan builder now
resolves a kernel for each backward stage (
odo,dqdkdv,dq_convert) fromAITER-derived registries, and the plan consumes per-stage tile sizes instead of three
hard-coded constants. Day-one gate is pinned to the verified gfx942 / bf16 / hd128 /
A32 /
NO_MASKmatrix; every other axis is explicitly rejected for follow-up tasksto lift one at a time.
Risk Assessment
Medium (3). Core SDPA bwd dispatch is rewritten and a CSV-to-header generator path
is added to the build, but
isApplicablerejects every configuration outside theverified hd128/bf16 baseline, so observable bwd behavior is unchanged on shipping
configs. Largest residual risks are the forward-path drive-by fix (silent
registry-lookup fall-through, now propagated) and the local
bwd_hd128_odo_bf16.cooverride vs. the AITER snapshot, both documented inline.
Testing Summary
isApplicablerejectionmatrix, and tile / grid math.
in-tree CPU reference.
ninja format).Testing Checklist
TestSdpaBwdPlanBuilder— registry lookup +isApplicablerejection (gfx950,FP8, asymmetric hdim, hd64, FP16, hd192, fractional GQA, non-
NO_MASK) +parameterised pssk/pddv — Status: Passed
TestSdpaBwdPlanGridMath— pins POC hd128/bf16/seqlen=2048 grid dims to thevalues the deleted
K_TS_*constants produced;KernelTilesceil-div edges —Status: Passed
ASICs: gfx942 — Status: Passed
pre-commit run --all-files— Status: Passedninja format+ clang-tidy (braces, switch-default, identifier-naming) —Status: Passed
Technical Changes
fmha_v3_bwdCSVs and.cofromAITER (
9522048dc10de20ba9dcda1c0a3f640867e7a586) intoasm_kernels/{gfx942,gfx950}/fmha_v3_bwd/using AITER's flat layout (no MI300split for bwd). Three CSVs per arch (
odo,dqdkdv,dq_convert); emptydq_shuffle.csvomitted. Per-archSOURCE.mdrecords source URL, commit, andSHA256 manifest. CMake invokes the existing CSV-to-header generator a second time
to produce
asm_fmha_v3_bwd_configs.hpp.SdpaBwdParamsgains aKernelTilesstruct with aconstexpr gridDim(extent)ceil-div helper that returns 0 whents == 0(so unittests can exercise the unpopulated case), plus
odoTiles/dqdkdvTiles/dqConvertTilesfields.SdpaBwdPlan.cppdeletesK_TS_ODO/K_TS_KV/K_TS_DQ, routes grid-x at each stage through_params.<stage>Tiles.gridDim(...),and passes the dqdkdv KV-tile to
buildDqdkdvArgs/validateMhaBwdByteStridesexplicitly.MhaBwdArgsstays a line-for-line mirrorof AITER's
mha_bwd_args(no added fields).SdpaBwdPlanBuilder) — Backward-local enums and helpers ported fromforward (
getMaskType,getRoundingMode,getBatchMode,getDataTypeIdentifier,lookupKernelNameKey).getKernelCoPathuses the flat layout and does notconsult
isMi308Device(). A singleBwdDispatchTupleis computed once andconsumed by both
isApplicableandbuildPlanso per-stage tuples cannot drift;resolveStagereturnsstd::optional<ResolvedKernel>. Includes the gfx950 BF16bf16_cvtspecial case to match forward.isApplicableaccepts gfx942 / batch / atomic32 /pssk=1 / pddv=1 / square hdim /
NO_MASKonly. hd128/bf16 is the verified path;hd192 and FP16 are reachable through registry walk and tile plumbing but gated off
until the in-tree CPU reference is calibrated (ALMIOPEN-1832). gfx950, group mode,
FP8, fractional GQA, non-square hdim, and other masks are explicitly rejected with
a clear log line keyed to the relevant follow-up task (ALMIOPEN-1825 /
ALMIOPEN-1826 / ALMIOPEN-1827 / ALMIOPEN-1833).
dq_accbefore A32 dqdkdv(matches AITER's
torch::zeros; without this, dQ silently corrupts whenseq_kv > ts*2). Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/lse/dq_acc field (was log-and-continue).
hipPeekAtLastError+gridX > 0precheck on every launch.
tensorMap.atlookups inisApplicablewrapped infind()+ return-false.static_assert(offsetof)on load-bearing kernarg fields.A16 path modeled as dormant-but-correct (skip
dq_convertanddq_accwhenresolved
atomic32 == 0); day-one gate stays pinned to A32.initializeExecutionSettingsthrowsstd::logic_errorinstead of emitting aspurious
LOG_ERRORon every engine instantiation.bwd_hd128_odo_bf16.cois the develop POC binary (10280 B),not the AITER snapshot (9344 B). AITER's version produced D-aux values that
diverged from the in-tree CPU reference enough to fail the existing hd128/bf16
correctness test. Documented in
SOURCE.mdso future AITER refreshes know to keepthe override or coordinate a tolerance / reference update.
Out of scope (handed off to sibling tasks)
Workspace sizing for A16 vs A32 (ALMIOPEN-1825),
causal-mask end-to-end correctness (ALMIOPEN-1826),
CPU-reference calibration for hd192/FP16 (ALMIOPEN-1832),
grid-formula full coverage including split-K (ALMIOPEN-1827),
gfx950 enablement (ALMIOPEN-1833),
automated AITER sync / binary versioning (ALMIOPEN-1828).