[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) by DarylHawkinsAMD · Pull Request #7264 · ROCm/rocm-libraries

DarylHawkinsAMD · 2026-05-11T05:13:35Z

Summary

Replaces the hard-coded SDPA backward POC kernel selection in asm_sdpa_engine with
a runtime, CSV-driven dispatch that mirrors the forward path. The plan builder now
resolves a kernel for each backward stage (odo, dqdkdv, dq_convert) from
AITER-derived registries, and the plan consumes per-stage tile sizes instead of three
hard-coded constants. Day-one gate is pinned to the verified gfx942 / bf16 / hd128 /
A32 / NO_MASK matrix; every other axis is explicitly rejected for follow-up tasks
to lift one at a time.

Reviewer note: filter out .csv and .co files in the GitHub "Files changed" view
to see only relevant code, build, and documentation changes.

Risk Assessment

Medium (3). Core SDPA bwd dispatch is rewritten and a CSV-to-header generator path
is added to the build, but isApplicable rejects every configuration outside the
verified hd128/bf16 baseline, so observable bwd behavior is unchanged on shipping
configs. Largest residual risks are the forward-path drive-by fix (silent
registry-lookup fall-through, now propagated) and the local bwd_hd128_odo_bf16.co
override vs. the AITER snapshot, both documented inline.

Testing Summary

Unit coverage for the new dispatch: registry lookup, isApplicable rejection
matrix, and tile / grid math.
End-to-end correctness on the verified hd128/bf16 backward path against the
in-tree CPU reference.
Lint/format gates clean (clang-tidy, pre-commit, ninja format).

Testing Checklist

TestSdpaBwdPlanBuilder — registry lookup + isApplicable rejection (gfx950,
FP8, asymmetric hdim, hd64, FP16, hd192, fractional GQA, non-NO_MASK) +
parameterised pssk/pddv — Status: Passed
TestSdpaBwdPlanGridMath — pins POC hd128/bf16/seqlen=2048 grid dims to the
values the deleted K_TS_* constants produced; KernelTiles ceil-div edges —
Status: Passed
hd128/bf16 SDPA backward correctness test — ASICs: gfx942 — Status: Passed
hd128/bf16 backward integration smoke (regression baseline) —
ASICs: gfx942 — Status: Passed
pre-commit run --all-files — Status: Passed
ninja format + clang-tidy (braces, switch-default, identifier-naming) —
Status: Passed
PR CI — GitHub PR checks — Status: Pending

Technical Changes

Kernel snapshot & build wiring — Snapshot fmha_v3_bwd CSVs and .co from
AITER (9522048dc10de20ba9dcda1c0a3f640867e7a586) into
asm_kernels/{gfx942,gfx950}/fmha_v3_bwd/ using AITER's flat layout (no MI300
split for bwd). Three CSVs per arch (odo, dqdkdv, dq_convert); empty
dq_shuffle.csv omitted. Per-arch SOURCE.md records source URL, commit, and
SHA256 manifest. CMake invokes the existing CSV-to-header generator a second time
to produce asm_fmha_v3_bwd_configs.hpp.
Tile-size plumbing — SdpaBwdParams gains a KernelTiles struct with a
constexpr gridDim(extent) ceil-div helper that returns 0 when ts == 0 (so unit
tests can exercise the unpopulated case), plus odoTiles / dqdkdvTiles /
dqConvertTiles fields. SdpaBwdPlan.cpp deletes K_TS_ODO / K_TS_KV /
K_TS_DQ, routes grid-x at each stage through _params.<stage>Tiles.gridDim(...),
and passes the dqdkdv KV-tile to buildDqdkdvArgs /
validateMhaBwdByteStrides explicitly. MhaBwdArgs stays a line-for-line mirror
of AITER's mha_bwd_args (no added fields).
Dispatch (SdpaBwdPlanBuilder) — Backward-local enums and helpers ported from
forward (getMaskType, getRoundingMode, getBatchMode, getDataTypeIdentifier,
lookupKernelNameKey). getKernelCoPath uses the flat layout and does not
consult isMi308Device(). A single BwdDispatchTuple is computed once and
consumed by both isApplicable and buildPlan so per-stage tuples cannot drift;
resolveStage returns std::optional<ResolvedKernel>. Includes the gfx950 BF16
bf16_cvt special case to match forward.
Day-one accepted matrix — isApplicable accepts gfx942 / batch / atomic32 /
pssk=1 / pddv=1 / square hdim / NO_MASK only. hd128/bf16 is the verified path;
hd192 and FP16 are reachable through registry walk and tile plumbing but gated off
until the in-tree CPU reference is calibrated (ALMIOPEN-1832). gfx950, group mode,
FP8, fractional GQA, non-square hdim, and other masks are explicitly rejected with
a clear log line keyed to the relevant follow-up task (ALMIOPEN-1825 /
ALMIOPEN-1826 / ALMIOPEN-1827 / ALMIOPEN-1833).
Correctness hardening (PR-review pass) — Zero-init dq_acc before A32 dqdkdv
(matches AITER's torch::zeros; without this, dQ silently corrupts when
seq_kv > ts*2). Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/
lse/dq_acc field (was log-and-continue). hipPeekAtLastError + gridX > 0
precheck on every launch. tensorMap.at lookups in isApplicable wrapped in
find() + return-false. static_assert(offsetof) on load-bearing kernarg fields.
A16 path modeled as dormant-but-correct (skip dq_convert and dq_acc when
resolved atomic32 == 0); day-one gate stays pinned to A32.
initializeExecutionSettings throws std::logic_error instead of emitting a
spurious LOG_ERROR on every engine instantiation.
Local override — bwd_hd128_odo_bf16.co is the develop POC binary (10280 B),
not the AITER snapshot (9344 B). AITER's version produced D-aux values that
diverged from the in-tree CPU reference enough to fail the existing hd128/bf16
correctness test. Documented in SOURCE.md so future AITER refreshes know to keep
the override or coordinate a tolerance / reference update.

Out of scope (handed off to sibling tasks)

Workspace sizing for A16 vs A32 (ALMIOPEN-1825),
causal-mask end-to-end correctness (ALMIOPEN-1826),
CPU-reference calibration for hd192/FP16 (ALMIOPEN-1832),
grid-formula full coverage including split-K (ALMIOPEN-1827),
gfx950 enablement (ALMIOPEN-1833),
automated AITER sync / binary versioning (ALMIOPEN-1828).

Foundation for SDPA backward multi-kernel dispatch (ALMIOPEN-1824 / I8.1 Stream A). Adds CSV-driven config-header generation for the backward pipeline so SdpaBwdPlanBuilder can resolve kernels at runtime instead of the POC's hard-coded selection. * Snapshot fmha_v3_bwd CSVs and .co binaries from AITER (commit 9522048dc10de20ba9dcda1c0a3f640867e7a586) for gfx942 and gfx950, using a flat layout that mirrors AITER's source. Backward has no MI300/MI308 split; the previous MI300/ subfolder is removed. Three CSVs are imported per arch (odo, dqdkdv, dq_convert); the empty fmha_bwd_dq_shuffle.csv is omitted. * Extend asm_sdpa_engine/CMakeLists.txt with a second add_custom_command invoking codegen.py -m fmha_v3_bwd, producing asm_fmha_v3_bwd_configs.hpp alongside the existing fwd header. * Append a frozen KernelTiles skeleton plus odoTiles / dqdkdvTiles / dqConvertTiles fields to SdpaBwdParams so Streams B and C can build against a stable contract. Skeleton is default-initialised; population is Stream B's responsibility. * Add SOURCE.md per arch documenting the AITER source URL, commit, and subdirectory for future binary-versioning automation (I8.7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the hard-coded K_TS_ODO/K_TS_KV/K_TS_DQ constants in SdpaBwdPlan with per-stage tile sizes plumbed through SdpaBwdParams (ALMIOPEN-1824 / I8.1 Stream C). The plan no longer knows specific kernel tile dimensions; it consumes whatever the builder resolved from the AITER CSV. * SdpaBwdParams: add a constexpr KernelTiles::gridDim(extent) ceil-div helper. Returns 0 when ts is 0 so unit tests can exercise the unpopulated case; runtime callers get a no-op grid in that scenario, which the launch path can detect. * SdpaBwdPlan: delete the three K_TS_* constants, route grid-x at each stage through _params.<stage>Tiles.gridDim(...), and plumb the dqdkdv KV-tile through MhaBwdArgs::ts_dqdkdv for the dqdkdv.Ts byte-stride that previously used K_TS_KV. * TestSdpaBwdPlanGridMath: regression tests pinning the POC config (hd128/bf16/seqlen=2048) grid dims to the values the deleted constants would have produced, plus ceil-div edge cases for the KernelTiles helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the hard-coded SDPA backward POC selection with a registry walk over the AITER-derived configs (ALMIOPEN-1824 / I8.1 Stream B). The plan builder now resolves a kernel for each pipeline stage (odo, dqdkdv, dq_convert) from the on-disk .co binaries, and routes the matched tile sizes into SdpaBwdParams for Plan.cpp to consume. * SdpaBwdPlanBuilder: introduce backward-local enums (MaskType, RoundingMode, BatchMode, AccumulatorMode) and helpers ported from the forward path (getMaskType, getRoundingMode, getBatchMode, getDataTypeIdentifier, computePssk, computePddv, lookupKernelNameKey, getKernelCoPath flat layout). * isApplicable narrows day-one acceptance to the POC kernarg layout: gfx942, batch mode, atomic32+pssk+pddv all 1, bf16 (RTNE bf16_cvt in {0,1,2}) or fp16 (bf16_cvt = 3), square hdim in {64, 128, 192}. Variants outside this set are rejected with a clear log so I8.2 / I8.4.2 / I8.5 / I8.6 follow-ups can lift restrictions one axis at a time. AITER does not ship hd64 with pddv=1 (already aligned), so the effective accepted matrix is hd in {128, 192}; the registry miss surfaces this cleanly. * buildPlan resolves all three stage keys, populates params.{odo,dqdkdv,dqConvert}Tiles, and replaces the three hard-coded co paths and mangled kernel names with the registry values. Forward-style MI300/MI308 sub-folder split is intentionally absent for backward — AITER's bwd source has a flat layout. * TestSdpaBwdPlanBuilder: add registry-lookup, isApplicable rejection, and pssk/pddv parameterised tests; flip FP16 hd128 in the existing IsApplicableSdpaBwdVariations from rejected to accepted; document why hd64 stays rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ers/dahawkin/sdpa-bwd-kernel-selection

The merged Stream B/C work was missed by the prior build because the ninja graph was stale until ninja_check forced a reconfigure. Once the new test source linked, clang-tidy surfaced three classes of warnings-as-errors: * SdpaBwdPlanBuilder.cpp: brace single-line `if (...) continue;` bodies (readability-braces-around-statements) and add an explicit default arm to the PipelineStage switch (clang-diagnostic-switch-default). * TestSdpaBwdPlanBuilder.cpp / TestSdpaBwdPlanGridMath.cpp: rename constexpr variables from camelBack (`kFooBar`) to UPPER_CASE (`FOO_BAR`) per .clang-tidy's readability-identifier-naming.ConstexprVariableCase. * TestSdpaBwdPlanBuilder.cpp registry-lookup test for hd192/bf16 causal-tl/batch: the dqdkdv row in AITER's CSV is pssk=1, pddv=1 for that combination, not pddv=0 as the test assumed; the test now matches the CSV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend the SDPA backward integration smoke suite to cover the remainder of the day-one accepted dispatch matrix: bf16/fp16 across hd128/hd192 on a single small problem each. The pre-existing hd128/bf16 case stays as the regression baseline. FP16 uses a slightly looser tolerance to accommodate its narrower dynamic range under softmax recomputation. Also drop residual task-tracking references in SdpaBwdPlanBuilder.cpp comments per the project convention that comments describe current state, not change history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verification on gfx942 surfaced two issues that this change resolves: 1. The AITER snapshot of bwd_hd128_odo_bf16.co (9344 bytes) produced D-aux values that diverged from the in-tree CPU backward reference enough to fail the existing hd128/bf16 SDPA backward correctness test. Restore the develop POC binary (10280 bytes) for that one file. SOURCE.md now records this as an explicit local override so future AITER refreshes know to either keep the override or coordinate a tolerance / reference update. 2. The hd192 and FP16 dispatch paths exist in the registry and exercise correctly, but the in-tree CPU reference is not calibrated against those kernel variants — gradient outputs land far outside the existing tolerance. Calibrating that reference is owned by the I8.4.x tasks. Until those land, isApplicable rejects anything other than hd128 / BF16. The PlanBuilder gate now mirrors what we can actually verify end-to-end. The dispatch infrastructure for hd192 and FP16 stays in place (registry walk, tile plumbing, kernel name resolution) and is reachable as soon as the gate is widened. * SdpaBwdPlanBuilder: tighten isApplicable to hd128 + bf16 with rationale comments pointing at the I8.4.x dependency. * TestSdpaBwdPlanBuilder: IsApplicableSdpaBwdVariations now expects hd64, hd192, and FP16 to all return false; comments explain why. * IntegrationGpuSdpaBackward: revert the hd192 and FP16 cases that were added in Phase 3 — they were over-reach into I8.4.x territory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Correctness: - Zero-init dq_acc workspace before A32 DQDKDV launch (AITER's asm frontend uses torch::zeros for dq_accum; without this, dQ silently corrupts whenever seq_kv > ts*2). - Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/lse/ dq_acc field instead of logging-and-continuing. - Reject fractional GQA ratio (nhead_q % nhead_k != 0) in isApplicable. - Add hipPeekAtLastError + gridX>0 precheck per kernel launch. - gfx950 BF16 bf16_cvt special case in registry lookup (matches forward). - Wrap tensorMap.at lookups in isApplicable with find()+return-false. - Fix latent fwd buildPlan bug: silently fell through after a failed registry lookup. - static_assert(offsetof) on load-bearing kernarg fields. Architecture / dispatch: - Extract BwdDispatchTuple computed once and consumed by both isApplicable and buildPlan so the per-stage tuples cannot drift. - resolveStage returns std::optional<ResolvedKernel> instead of three out-references. - Model A16 path as dormant-but-correct: skip dq_convert launch and dq_acc allocation when the resolved row has atomic32==0; gate flip remains pinned to A32 in computeDispatchTuples with a TODO. Maintainability: - Cache getDeviceString on the builder; eliminate duplicated try/catch. - Hoist BF16_CVT_FP16_SENTINEL to SdpaBwdPlanBuilder.hpp. - Drop dead computePssk/computePddv helpers (no compatible registry row in the current matrix). - Drop unused KernelTiles::tsQO field; drop unused archId param from backward getKernelCoPath. - initializeExecutionSettings throws std::logic_error instead of emitting a spurious LOG_ERROR on every engine instantiation. - Comment hygiene; group-mode rejection names the actual graph attribute; getRoundingMode TODO uses ALMIOPEN-1824 tag. Provenance: - SHA256 manifest for gfx942/gfx950 fmha_v3_bwd .co directories (incl. the local-override bwd_hd128_odo_bf16.co). - Reconcile dual-AITER-commit state in asm_kernels/README.md and AsmKernelPath.hpp comment block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per PR review: MhaBwdArgs is documented as a line-for-line mirror of AITER's mha_bwd_args, but I8.1 added a `ts_dqdkdv` field to plumb the KV-tile size in from the dispatch tuple. AITER takes the tile size from the kernel-traits template parameter at the call site, not from the args struct, so adding the field broke the mirroring contract. Pass `tsKv` as an explicit parameter to `buildDqdkdvArgs` and `validateMhaBwdByteStrides` instead. `MhaBwdArgs` is now field-for-field identical to AITER again, and the build helpers stay self-contained (no back-channel into `_params`). Verified on gfx942: 49/50 unit tests pass (1 gfx950 reject test skips on this arch); SdpaFwd/0, SdpaFwd/1, SdpaBwd/0 GPU integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The fwd builder refactor of getKernelNameKey (string key -> config*) and the silent-fall-through fix in SdpaFwdPlanBuilder::buildPlan are out of scope for this bwd-dispatch PR. Revert SdpaFwdPlanBuilder.cpp to its develop state; the silent-fall-through bug will be addressed in its own PR with appropriate test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… TODOs Bwd dispatch refinements from the second review pass: - Replace getDataTypeIdentifier (returns "" sentinel on unsupported dtype) with tryGetDataTypeIdentifier returning std::optional<std::string>. Matches the tryGetDeviceString convention already in the same file and the broader std::optional idiom across hip-kernel-provider engines. buildPlan now bails on std::nullopt instead of silently propagating an empty key into the registry walk. - Re-tag I8.x TODOs to their Jira tickets under epic ALMIOPEN-1823 so reviewers can navigate directly to the work item: I8.2 -> ALMIOPEN-1825 (A16 vs A32 workspace sizing) I8.4.x -> ALMIOPEN-1832 (CPU reference calibration for hd192/FP16) I8.9 (workspace alignment) left as I8.9 because no child Jira ticket exists yet for that scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clang-tidy misc-const-correctness violations on lines this PR introduced or already touched: - SdpaBwdPlan.cpp: drop two unused using decls (alignUp, K_WORKSPACE_ALIGNMENT_BYTES); const for err, mhaArgs, gdxOdo, gdxDqdkdv, memsetErr, gdxPost. - SdpaKernelUtils.hpp: const for the three hipError_t err locals in launchKernel and HipModuleGuard's destructor / move-assign. - SdpaBwdPlanBuilder.cpp: const for leftAndRightBoundsSet, the eight *Uid locals in isApplicable, the nine *Uid locals in buildPlan, and bf16CvtValue (both call sites). - TestSdpaBwdPlanBuilder.cpp: const for graphWrapper, settings, workspaceSize, applicabilityTests, and the qDims/kDims/vDims/oDims vectors in IsApplicable_RejectsFractionalGqaRatio and IsApplicable_RejectsAsymmetricHdim. Out-of-scope tidy violations in IntegrationGpuSdpaBackward.cpp, HipDeviceUtils.hpp, batchnorm/rmsnorm tests, etc. are not addressed here — they pre-date this PR and belong in their own cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous signature returned a const fmha_v3_bwdConfig* into the codegen-emitted `static CFG cfg_fmha_bwd_*` registries. The pointer was safe in practice (static storage duration, registries never mutated), but the API leaked a lifetime contract the language did not enforce: any future maintainer mutating one of the maps would silently invalidate the cached pointer. Returning std::optional<fmha_v3_bwdConfig> by value decouples the caller from the registry's storage entirely. The struct is small (four std::strings + 11 ints, ~250 bytes), and findConfig is called ~3 times per dispatch resolution — not in any hot path. Matches the optional-based fallible-return idiom already used by tryGetDataTypeIdentifier and tryGetDeviceString in the same file. Call sites updated: - lookupKernelNameKey: const fmha_v3_bwdConfig* cfg = nullptr -> std::optional<fmha_v3_bwdConfig> cfg - buildPlan's resolveStage lambda: parameter type and dereference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

IPlanBuilder::buildPlan documents 'throws HipdnnPluginException if the plan cannot be built'. SdpaBwdPlanBuilder::buildPlan was instead silently early-returning at every error site, leaving the caller's ExecutionContext without a plan; the next call to ExecutionContextBase::getPlan() then threw a generic HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR with no context, hiding the actual root cause from the API caller. Eight silent-return sites converted: - tryGetDeviceString failure - unsupported tensor dtype combination (isApplicable should have caught) - odo / dqdkdv / dq_convert registry lookup miss (resolveStage lambda now returns ResolvedKernel by value and throws directly; callers no longer need to check the return) - odo / dqdkdv / dq_convert kernel module load failure Most use HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR because isApplicable is the public-facing precondition check; if buildPlan reaches one of these, something upstream has already gone wrong. Each throw carries a message naming the failing operation and the relevant context (arch, dtype, hdim, .co path) so the API caller's exception is actionable. Matches the throw-on-failure pattern used throughout BatchnormPlanBuilder, ApplicabilityChecks, and the rest of the hip-kernel-provider plan builders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… execute By the time SdpaBwdPlan::execute runs, every parameter must already be known good — anything checkable from graph data should have been gated in isApplicable, and anything that does fail at execute time is a genuinely exceptional condition that should throw, not silently return. Changes: - Move the byte-stride uint32 overflow check into SdpaBwdPlanBuilder::isApplicable. A new wouldBwdByteStridesFitUint32 helper extracts strides directly from the graph tensors and computes the derived dq_acc strides from B/H/S/D dims. Tile size for the Ts product comes from a findConfig() call against cfg_fmha_bwd_dqdkdv. Reject the graph (return false) on any overflow. - Add o_tensor_uid lookup in isApplicable (was previously only resolved in buildPlan; needed for the o-tensor stride checks). - Drop the runtime validateMhaBwdByteStrides call from execute — the same condition is now caught at applicability time. Delete the helper and fitsUint32AfterScale + K_U32_MAX_AS_I64 from SdpaBwdPlan.cpp (no remaining callers). - Drop the three checkLaunchPreconditions(gridX==0) calls and the helper itself — gridX==0 was a sentinel for unresolved CSV rows, but buildPlan now throws on resolveStage failure (commit 99292b3), so reaching execute with ts==0 is impossible. - Convert remaining execute early-returns to throws: * workspace==nullptr: BAD_PARAM (caller contract violation — getMaxWorkspaceSize always returns > 0 for bwd) * hipMemsetAsync failure: INTERNAL_ERROR with the HIP error string * hipModuleLaunchKernel failure (3 stages): INTERNAL_ERROR * Async post-launch error (3 stages): INTERNAL_ERROR via the new throwOnLaunchPostError replacing checkLaunchPostError Each throw carries enough context (stage name, HIP error string) for the API caller's exception to be actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The byte-stride helper landed with K_U32_MAX_AS_I64, K_BF16_BYTES, and K_FP32_BYTES declared as function-local constexprs. The inner `check` lambda has empty capture `[]` and uses K_U32_MAX_AS_I64, which fails to build under clang's strict capture rules: error: variable 'K_U32_MAX_AS_I64' cannot be implicitly captured in a lambda with no capture-default specified (Also tripped modernize-use-auto on the function-local int64_t declaration.) Hoist all three constants to the anonymous namespace where they don't need capture. Compilation regression was missed because the prior verification used `ninja hipdnn-unit-check`, which builds projects/ hipdnn only — the hip-kernel-provider sources were never recompiled with the change. Verified now via `ninja hip_kernel_provider_impl` and `ninja hip_kernel_provider_tests`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- buildPlan: restore numbered stage banners (1-6), per-tensor layout comments above each strides extraction (Q/K/V/O/dO/dQ/dK/dV/Stats), and the // Tensor UIDs / // Tensor objects block labels. The new 6-stage numbering reflects the post-CSV-dispatch flow (resolve and load are now separate stages). - initializeExecutionSettings: revert from throw std::logic_error back to HIPDNN_PLUGIN_LOG_ERROR. The throw fired on the normal path (HipKernelEngine calls this unconditionally from getMaxWorkspaceSize and initializeExecutionContext for every applicable graph), and std::logic_error is the wrong type for "not implemented" anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DarylHawkinsAMD and others added 8 commits May 10, 2026 12:41

Merge branch 'users/dahawkin/sdpa-bwd-kernel-selection-WIP-C' into us…

9834b94

…ers/dahawkin/sdpa-bwd-kernel-selection

github-actions Bot added documentation project: hip-kernel-provider labels May 11, 2026

assistant-librarian Bot added the organization: ROCm label May 11, 2026

DarylHawkinsAMD changed the title ~~Users/dahawkin/sdpa bwd kernel selection~~ # [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) May 11, 2026

DarylHawkinsAMD changed the title ~~# [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)~~ [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) May 11, 2026

DarylHawkinsAMD and others added 9 commits May 11, 2026 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264

[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264
DarylHawkinsAMD wants to merge 17 commits into
developfrom
users/dahawkin/sdpa-bwd-kernel-selection

DarylHawkinsAMD commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DarylHawkinsAMD commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Risk Assessment

Testing Summary

Testing Checklist

Technical Changes

Out of scope (handed off to sibling tasks)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DarylHawkinsAMD commented May 11, 2026 •

edited

Loading