Skip to content

[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264

Draft
DarylHawkinsAMD wants to merge 17 commits into
developfrom
users/dahawkin/sdpa-bwd-kernel-selection
Draft

[hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824)#7264
DarylHawkinsAMD wants to merge 17 commits into
developfrom
users/dahawkin/sdpa-bwd-kernel-selection

Conversation

@DarylHawkinsAMD
Copy link
Copy Markdown
Contributor

@DarylHawkinsAMD DarylHawkinsAMD commented May 11, 2026

Summary

Replaces the hard-coded SDPA backward POC kernel selection in asm_sdpa_engine with
a runtime, CSV-driven dispatch that mirrors the forward path. The plan builder now
resolves a kernel for each backward stage (odo, dqdkdv, dq_convert) from
AITER-derived registries, and the plan consumes per-stage tile sizes instead of three
hard-coded constants. Day-one gate is pinned to the verified gfx942 / bf16 / hd128 /
A32 / NO_MASK matrix; every other axis is explicitly rejected for follow-up tasks
to lift one at a time.

Reviewer note: filter out .csv and .co files in the GitHub "Files changed" view
to see only relevant code, build, and documentation changes.

Risk Assessment

Medium (3). Core SDPA bwd dispatch is rewritten and a CSV-to-header generator path
is added to the build, but isApplicable rejects every configuration outside the
verified hd128/bf16 baseline, so observable bwd behavior is unchanged on shipping
configs. Largest residual risks are the forward-path drive-by fix (silent
registry-lookup fall-through, now propagated) and the local bwd_hd128_odo_bf16.co
override vs. the AITER snapshot, both documented inline.

Testing Summary

  • Unit coverage for the new dispatch: registry lookup, isApplicable rejection
    matrix, and tile / grid math.
  • End-to-end correctness on the verified hd128/bf16 backward path against the
    in-tree CPU reference.
  • Lint/format gates clean (clang-tidy, pre-commit, ninja format).

Testing Checklist

  • TestSdpaBwdPlanBuilder — registry lookup + isApplicable rejection (gfx950,
    FP8, asymmetric hdim, hd64, FP16, hd192, fractional GQA, non-NO_MASK) +
    parameterised pssk/pddv — Status: Passed
  • TestSdpaBwdPlanGridMath — pins POC hd128/bf16/seqlen=2048 grid dims to the
    values the deleted K_TS_* constants produced; KernelTiles ceil-div edges —
    Status: Passed
  • hd128/bf16 SDPA backward correctness test — ASICs: gfx942 — Status: Passed
  • hd128/bf16 backward integration smoke (regression baseline) —
    ASICs: gfx942 — Status: Passed
  • pre-commit run --all-files — Status: Passed
  • ninja format + clang-tidy (braces, switch-default, identifier-naming) —
    Status: Passed
  • PR CI — GitHub PR checks — Status: Pending

Technical Changes

  • Kernel snapshot & build wiring — Snapshot fmha_v3_bwd CSVs and .co from
    AITER (9522048dc10de20ba9dcda1c0a3f640867e7a586) into
    asm_kernels/{gfx942,gfx950}/fmha_v3_bwd/ using AITER's flat layout (no MI300
    split for bwd). Three CSVs per arch (odo, dqdkdv, dq_convert); empty
    dq_shuffle.csv omitted. Per-arch SOURCE.md records source URL, commit, and
    SHA256 manifest. CMake invokes the existing CSV-to-header generator a second time
    to produce asm_fmha_v3_bwd_configs.hpp.
  • Tile-size plumbingSdpaBwdParams gains a KernelTiles struct with a
    constexpr gridDim(extent) ceil-div helper that returns 0 when ts == 0 (so unit
    tests can exercise the unpopulated case), plus odoTiles / dqdkdvTiles /
    dqConvertTiles fields. SdpaBwdPlan.cpp deletes K_TS_ODO / K_TS_KV /
    K_TS_DQ, routes grid-x at each stage through _params.<stage>Tiles.gridDim(...),
    and passes the dqdkdv KV-tile to buildDqdkdvArgs /
    validateMhaBwdByteStrides explicitly. MhaBwdArgs stays a line-for-line mirror
    of AITER's mha_bwd_args (no added fields).
  • Dispatch (SdpaBwdPlanBuilder) — Backward-local enums and helpers ported from
    forward (getMaskType, getRoundingMode, getBatchMode, getDataTypeIdentifier,
    lookupKernelNameKey). getKernelCoPath uses the flat layout and does not
    consult isMi308Device(). A single BwdDispatchTuple is computed once and
    consumed by both isApplicable and buildPlan so per-stage tuples cannot drift;
    resolveStage returns std::optional<ResolvedKernel>. Includes the gfx950 BF16
    bf16_cvt special case to match forward.
  • Day-one accepted matrixisApplicable accepts gfx942 / batch / atomic32 /
    pssk=1 / pddv=1 / square hdim / NO_MASK only. hd128/bf16 is the verified path;
    hd192 and FP16 are reachable through registry walk and tile plumbing but gated off
    until the in-tree CPU reference is calibrated (ALMIOPEN-1832). gfx950, group mode,
    FP8, fractional GQA, non-square hdim, and other masks are explicitly rejected with
    a clear log line keyed to the relevant follow-up task (ALMIOPEN-1825 /
    ALMIOPEN-1826 / ALMIOPEN-1827 / ALMIOPEN-1833).
  • Correctness hardening (PR-review pass) — Zero-init dq_acc before A32 dqdkdv
    (matches AITER's torch::zeros; without this, dQ silently corrupts when
    seq_kv > ts*2). Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/
    lse/dq_acc field (was log-and-continue). hipPeekAtLastError + gridX > 0
    precheck on every launch. tensorMap.at lookups in isApplicable wrapped in
    find() + return-false. static_assert(offsetof) on load-bearing kernarg fields.
    A16 path modeled as dormant-but-correct (skip dq_convert and dq_acc when
    resolved atomic32 == 0); day-one gate stays pinned to A32.
    initializeExecutionSettings throws std::logic_error instead of emitting a
    spurious LOG_ERROR on every engine instantiation.
  • Local overridebwd_hd128_odo_bf16.co is the develop POC binary (10280 B),
    not the AITER snapshot (9344 B). AITER's version produced D-aux values that
    diverged from the in-tree CPU reference enough to fail the existing hd128/bf16
    correctness test. Documented in SOURCE.md so future AITER refreshes know to keep
    the override or coordinate a tolerance / reference update.

Out of scope (handed off to sibling tasks)

Workspace sizing for A16 vs A32 (ALMIOPEN-1825),
causal-mask end-to-end correctness (ALMIOPEN-1826),
CPU-reference calibration for hd192/FP16 (ALMIOPEN-1832),
grid-formula full coverage including split-K (ALMIOPEN-1827),
gfx950 enablement (ALMIOPEN-1833),
automated AITER sync / binary versioning (ALMIOPEN-1828).

DarylHawkinsAMD and others added 8 commits May 10, 2026 12:41
Foundation for SDPA backward multi-kernel dispatch (ALMIOPEN-1824 / I8.1
Stream A). Adds CSV-driven config-header generation for the backward
pipeline so SdpaBwdPlanBuilder can resolve kernels at runtime instead of
the POC's hard-coded selection.

* Snapshot fmha_v3_bwd CSVs and .co binaries from AITER
  (commit 9522048dc10de20ba9dcda1c0a3f640867e7a586) for gfx942 and
  gfx950, using a flat layout that mirrors AITER's source. Backward has
  no MI300/MI308 split; the previous MI300/ subfolder is removed. Three
  CSVs are imported per arch (odo, dqdkdv, dq_convert); the empty
  fmha_bwd_dq_shuffle.csv is omitted.
* Extend asm_sdpa_engine/CMakeLists.txt with a second
  add_custom_command invoking codegen.py -m fmha_v3_bwd, producing
  asm_fmha_v3_bwd_configs.hpp alongside the existing fwd header.
* Append a frozen KernelTiles skeleton plus odoTiles / dqdkdvTiles /
  dqConvertTiles fields to SdpaBwdParams so Streams B and C can build
  against a stable contract. Skeleton is default-initialised; population
  is Stream B's responsibility.
* Add SOURCE.md per arch documenting the AITER source URL, commit, and
  subdirectory for future binary-versioning automation (I8.7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hard-coded K_TS_ODO/K_TS_KV/K_TS_DQ constants in
SdpaBwdPlan with per-stage tile sizes plumbed through SdpaBwdParams
(ALMIOPEN-1824 / I8.1 Stream C). The plan no longer knows specific
kernel tile dimensions; it consumes whatever the builder resolved from
the AITER CSV.

* SdpaBwdParams: add a constexpr KernelTiles::gridDim(extent) ceil-div
  helper. Returns 0 when ts is 0 so unit tests can exercise the
  unpopulated case; runtime callers get a no-op grid in that scenario,
  which the launch path can detect.
* SdpaBwdPlan: delete the three K_TS_* constants, route grid-x at each
  stage through _params.<stage>Tiles.gridDim(...), and plumb the
  dqdkdv KV-tile through MhaBwdArgs::ts_dqdkdv for the dqdkdv.Ts
  byte-stride that previously used K_TS_KV.
* TestSdpaBwdPlanGridMath: regression tests pinning the POC config
  (hd128/bf16/seqlen=2048) grid dims to the values the deleted
  constants would have produced, plus ceil-div edge cases for the
  KernelTiles helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hard-coded SDPA backward POC selection with a registry walk
over the AITER-derived configs (ALMIOPEN-1824 / I8.1 Stream B). The
plan builder now resolves a kernel for each pipeline stage (odo,
dqdkdv, dq_convert) from the on-disk .co binaries, and routes the
matched tile sizes into SdpaBwdParams for Plan.cpp to consume.

* SdpaBwdPlanBuilder: introduce backward-local enums (MaskType,
  RoundingMode, BatchMode, AccumulatorMode) and helpers ported from
  the forward path (getMaskType, getRoundingMode, getBatchMode,
  getDataTypeIdentifier, computePssk, computePddv, lookupKernelNameKey,
  getKernelCoPath flat layout).
* isApplicable narrows day-one acceptance to the POC kernarg layout:
  gfx942, batch mode, atomic32+pssk+pddv all 1, bf16 (RTNE bf16_cvt
  in {0,1,2}) or fp16 (bf16_cvt = 3), square hdim in {64, 128, 192}.
  Variants outside this set are rejected with a clear log so I8.2 /
  I8.4.2 / I8.5 / I8.6 follow-ups can lift restrictions one axis at
  a time. AITER does not ship hd64 with pddv=1 (already aligned),
  so the effective accepted matrix is hd in {128, 192}; the registry
  miss surfaces this cleanly.
* buildPlan resolves all three stage keys, populates
  params.{odo,dqdkdv,dqConvert}Tiles, and replaces the three
  hard-coded co paths and mangled kernel names with the registry
  values. Forward-style MI300/MI308 sub-folder split is intentionally
  absent for backward — AITER's bwd source has a flat layout.
* TestSdpaBwdPlanBuilder: add registry-lookup, isApplicable rejection,
  and pssk/pddv parameterised tests; flip FP16 hd128 in the existing
  IsApplicableSdpaBwdVariations from rejected to accepted; document
  why hd64 stays rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The merged Stream B/C work was missed by the prior build because the
ninja graph was stale until ninja_check forced a reconfigure. Once
the new test source linked, clang-tidy surfaced three classes of
warnings-as-errors:

* SdpaBwdPlanBuilder.cpp: brace single-line `if (...) continue;`
  bodies (readability-braces-around-statements) and add an explicit
  default arm to the PipelineStage switch
  (clang-diagnostic-switch-default).
* TestSdpaBwdPlanBuilder.cpp / TestSdpaBwdPlanGridMath.cpp: rename
  constexpr variables from camelBack (`kFooBar`) to UPPER_CASE
  (`FOO_BAR`) per .clang-tidy's
  readability-identifier-naming.ConstexprVariableCase.
* TestSdpaBwdPlanBuilder.cpp registry-lookup test for hd192/bf16
  causal-tl/batch: the dqdkdv row in AITER's CSV is pssk=1, pddv=1
  for that combination, not pddv=0 as the test assumed; the test now
  matches the CSV.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the SDPA backward integration smoke suite to cover the remainder
of the day-one accepted dispatch matrix: bf16/fp16 across hd128/hd192
on a single small problem each. The pre-existing hd128/bf16 case stays
as the regression baseline. FP16 uses a slightly looser tolerance to
accommodate its narrower dynamic range under softmax recomputation.

Also drop residual task-tracking references in
SdpaBwdPlanBuilder.cpp comments per the project convention that
comments describe current state, not change history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verification on gfx942 surfaced two issues that this change resolves:

1. The AITER snapshot of bwd_hd128_odo_bf16.co (9344 bytes) produced D-aux
   values that diverged from the in-tree CPU backward reference enough to
   fail the existing hd128/bf16 SDPA backward correctness test. Restore the
   develop POC binary (10280 bytes) for that one file. SOURCE.md now
   records this as an explicit local override so future AITER refreshes
   know to either keep the override or coordinate a tolerance / reference
   update.
2. The hd192 and FP16 dispatch paths exist in the registry and exercise
   correctly, but the in-tree CPU reference is not calibrated against
   those kernel variants — gradient outputs land far outside the existing
   tolerance. Calibrating that reference is owned by the I8.4.x tasks.
   Until those land, isApplicable rejects anything other than hd128 / BF16.

The PlanBuilder gate now mirrors what we can actually verify end-to-end.
The dispatch infrastructure for hd192 and FP16 stays in place (registry
walk, tile plumbing, kernel name resolution) and is reachable as soon as
the gate is widened.

* SdpaBwdPlanBuilder: tighten isApplicable to hd128 + bf16 with rationale
  comments pointing at the I8.4.x dependency.
* TestSdpaBwdPlanBuilder: IsApplicableSdpaBwdVariations now expects hd64,
  hd192, and FP16 to all return false; comments explain why.
* IntegrationGpuSdpaBackward: revert the hd192 and FP16 cases that were
  added in Phase 3 — they were over-reach into I8.4.x territory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness:
- Zero-init dq_acc workspace before A32 DQDKDV launch (AITER's asm
  frontend uses torch::zeros for dq_accum; without this, dQ silently
  corrupts whenever seq_kv > ts*2).
- Reject uint32 byte-stride overflow on every Q/K/V/O/dO/dQ/dK/dV/lse/
  dq_acc field instead of logging-and-continuing.
- Reject fractional GQA ratio (nhead_q % nhead_k != 0) in isApplicable.
- Add hipPeekAtLastError + gridX>0 precheck per kernel launch.
- gfx950 BF16 bf16_cvt special case in registry lookup (matches forward).
- Wrap tensorMap.at lookups in isApplicable with find()+return-false.
- Fix latent fwd buildPlan bug: silently fell through after a failed
  registry lookup.
- static_assert(offsetof) on load-bearing kernarg fields.

Architecture / dispatch:
- Extract BwdDispatchTuple computed once and consumed by both
  isApplicable and buildPlan so the per-stage tuples cannot drift.
- resolveStage returns std::optional<ResolvedKernel> instead of three
  out-references.
- Model A16 path as dormant-but-correct: skip dq_convert launch and
  dq_acc allocation when the resolved row has atomic32==0; gate flip
  remains pinned to A32 in computeDispatchTuples with a TODO.

Maintainability:
- Cache getDeviceString on the builder; eliminate duplicated try/catch.
- Hoist BF16_CVT_FP16_SENTINEL to SdpaBwdPlanBuilder.hpp.
- Drop dead computePssk/computePddv helpers (no compatible registry
  row in the current matrix).
- Drop unused KernelTiles::tsQO field; drop unused archId param from
  backward getKernelCoPath.
- initializeExecutionSettings throws std::logic_error instead of
  emitting a spurious LOG_ERROR on every engine instantiation.
- Comment hygiene; group-mode rejection names the actual graph
  attribute; getRoundingMode TODO uses ALMIOPEN-1824 tag.

Provenance:
- SHA256 manifest for gfx942/gfx950 fmha_v3_bwd .co directories
  (incl. the local-override bwd_hd128_odo_bf16.co).
- Reconcile dual-AITER-commit state in asm_kernels/README.md and
  AsmKernelPath.hpp comment block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DarylHawkinsAMD DarylHawkinsAMD changed the title Users/dahawkin/sdpa bwd kernel selection # [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) May 11, 2026
@DarylHawkinsAMD DarylHawkinsAMD changed the title # [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) [hipDNN][SDPA] CSV-driven multi-kernel dispatch for SDPA backward (ALMIOPEN-1824) May 11, 2026
DarylHawkinsAMD and others added 9 commits May 11, 2026 16:13
Per PR review: MhaBwdArgs is documented as a line-for-line mirror of
AITER's mha_bwd_args, but I8.1 added a `ts_dqdkdv` field to plumb the
KV-tile size in from the dispatch tuple.  AITER takes the tile size from
the kernel-traits template parameter at the call site, not from the args
struct, so adding the field broke the mirroring contract.

Pass `tsKv` as an explicit parameter to `buildDqdkdvArgs` and
`validateMhaBwdByteStrides` instead.  `MhaBwdArgs` is now field-for-field
identical to AITER again, and the build helpers stay self-contained
(no back-channel into `_params`).

Verified on gfx942: 49/50 unit tests pass (1 gfx950 reject test skips on
this arch); SdpaFwd/0, SdpaFwd/1, SdpaBwd/0 GPU integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fwd builder refactor of getKernelNameKey (string key -> config*)
and the silent-fall-through fix in SdpaFwdPlanBuilder::buildPlan are
out of scope for this bwd-dispatch PR. Revert SdpaFwdPlanBuilder.cpp
to its develop state; the silent-fall-through bug will be addressed
in its own PR with appropriate test coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… TODOs

Bwd dispatch refinements from the second review pass:

- Replace getDataTypeIdentifier (returns "" sentinel on unsupported
  dtype) with tryGetDataTypeIdentifier returning std::optional<std::string>.
  Matches the tryGetDeviceString convention already in the same file
  and the broader std::optional idiom across hip-kernel-provider engines.
  buildPlan now bails on std::nullopt instead of silently propagating
  an empty key into the registry walk.

- Re-tag I8.x TODOs to their Jira tickets under epic ALMIOPEN-1823
  so reviewers can navigate directly to the work item:
    I8.2  -> ALMIOPEN-1825 (A16 vs A32 workspace sizing)
    I8.4.x -> ALMIOPEN-1832 (CPU reference calibration for hd192/FP16)
  I8.9 (workspace alignment) left as I8.9 because no child Jira ticket
  exists yet for that scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
clang-tidy misc-const-correctness violations on lines this PR introduced
or already touched:

- SdpaBwdPlan.cpp: drop two unused using decls (alignUp,
  K_WORKSPACE_ALIGNMENT_BYTES); const for err, mhaArgs, gdxOdo,
  gdxDqdkdv, memsetErr, gdxPost.
- SdpaKernelUtils.hpp: const for the three hipError_t err locals in
  launchKernel and HipModuleGuard's destructor / move-assign.
- SdpaBwdPlanBuilder.cpp: const for leftAndRightBoundsSet, the eight
  *Uid locals in isApplicable, the nine *Uid locals in buildPlan, and
  bf16CvtValue (both call sites).
- TestSdpaBwdPlanBuilder.cpp: const for graphWrapper, settings,
  workspaceSize, applicabilityTests, and the qDims/kDims/vDims/oDims
  vectors in IsApplicable_RejectsFractionalGqaRatio and
  IsApplicable_RejectsAsymmetricHdim.

Out-of-scope tidy violations in IntegrationGpuSdpaBackward.cpp,
HipDeviceUtils.hpp, batchnorm/rmsnorm tests, etc. are not addressed
here — they pre-date this PR and belong in their own cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous signature returned a const fmha_v3_bwdConfig* into the
codegen-emitted `static CFG cfg_fmha_bwd_*` registries.  The pointer
was safe in practice (static storage duration, registries never
mutated), but the API leaked a lifetime contract the language did not
enforce: any future maintainer mutating one of the maps would silently
invalidate the cached pointer.

Returning std::optional<fmha_v3_bwdConfig> by value decouples the
caller from the registry's storage entirely.  The struct is small
(four std::strings + 11 ints, ~250 bytes), and findConfig is called
~3 times per dispatch resolution — not in any hot path.  Matches the
optional-based fallible-return idiom already used by
tryGetDataTypeIdentifier and tryGetDeviceString in the same file.

Call sites updated:
- lookupKernelNameKey: const fmha_v3_bwdConfig* cfg = nullptr -> std::optional<fmha_v3_bwdConfig> cfg
- buildPlan's resolveStage lambda: parameter type and dereference

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IPlanBuilder::buildPlan documents 'throws HipdnnPluginException if the
plan cannot be built'.  SdpaBwdPlanBuilder::buildPlan was instead
silently early-returning at every error site, leaving the caller's
ExecutionContext without a plan; the next call to
ExecutionContextBase::getPlan() then threw a generic
HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR with no context, hiding the actual
root cause from the API caller.

Eight silent-return sites converted:
- tryGetDeviceString failure
- unsupported tensor dtype combination (isApplicable should have caught)
- odo / dqdkdv / dq_convert registry lookup miss (resolveStage lambda
  now returns ResolvedKernel by value and throws directly; callers no
  longer need to check the return)
- odo / dqdkdv / dq_convert kernel module load failure

Most use HIPDNN_PLUGIN_STATUS_INTERNAL_ERROR because isApplicable is
the public-facing precondition check; if buildPlan reaches one of
these, something upstream has already gone wrong.  Each throw carries
a message naming the failing operation and the relevant context (arch,
dtype, hdim, .co path) so the API caller's exception is actionable.

Matches the throw-on-failure pattern used throughout
BatchnormPlanBuilder, ApplicabilityChecks, and the rest of the
hip-kernel-provider plan builders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… execute

By the time SdpaBwdPlan::execute runs, every parameter must already be
known good — anything checkable from graph data should have been gated
in isApplicable, and anything that does fail at execute time is a
genuinely exceptional condition that should throw, not silently return.

Changes:

- Move the byte-stride uint32 overflow check into
  SdpaBwdPlanBuilder::isApplicable.  A new wouldBwdByteStridesFitUint32
  helper extracts strides directly from the graph tensors and computes
  the derived dq_acc strides from B/H/S/D dims.  Tile size for the
  Ts product comes from a findConfig() call against cfg_fmha_bwd_dqdkdv.
  Reject the graph (return false) on any overflow.
- Add o_tensor_uid lookup in isApplicable (was previously only resolved
  in buildPlan; needed for the o-tensor stride checks).
- Drop the runtime validateMhaBwdByteStrides call from execute — the
  same condition is now caught at applicability time.  Delete the
  helper and fitsUint32AfterScale + K_U32_MAX_AS_I64 from SdpaBwdPlan.cpp
  (no remaining callers).
- Drop the three checkLaunchPreconditions(gridX==0) calls and the
  helper itself — gridX==0 was a sentinel for unresolved CSV rows, but
  buildPlan now throws on resolveStage failure (commit 99292b3), so
  reaching execute with ts==0 is impossible.
- Convert remaining execute early-returns to throws:
  * workspace==nullptr: BAD_PARAM (caller contract violation —
    getMaxWorkspaceSize always returns > 0 for bwd)
  * hipMemsetAsync failure: INTERNAL_ERROR with the HIP error string
  * hipModuleLaunchKernel failure (3 stages): INTERNAL_ERROR
  * Async post-launch error (3 stages): INTERNAL_ERROR via the new
    throwOnLaunchPostError replacing checkLaunchPostError

Each throw carries enough context (stage name, HIP error string) for
the API caller's exception to be actionable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The byte-stride helper landed with K_U32_MAX_AS_I64, K_BF16_BYTES,
and K_FP32_BYTES declared as function-local constexprs.  The inner
`check` lambda has empty capture `[]` and uses K_U32_MAX_AS_I64,
which fails to build under clang's strict capture rules:

    error: variable 'K_U32_MAX_AS_I64' cannot be implicitly captured
    in a lambda with no capture-default specified

(Also tripped modernize-use-auto on the function-local int64_t
declaration.)

Hoist all three constants to the anonymous namespace where they don't
need capture.  Compilation regression was missed because the prior
verification used `ninja hipdnn-unit-check`, which builds projects/
hipdnn only — the hip-kernel-provider sources were never recompiled
with the change.  Verified now via `ninja hip_kernel_provider_impl`
and `ninja hip_kernel_provider_tests`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- buildPlan: restore numbered stage banners (1-6), per-tensor layout
  comments above each strides extraction (Q/K/V/O/dO/dQ/dK/dV/Stats),
  and the // Tensor UIDs / // Tensor objects block labels. The new
  6-stage numbering reflects the post-CSV-dispatch flow (resolve and
  load are now separate stages).
- initializeExecutionSettings: revert from throw std::logic_error back
  to HIPDNN_PLUGIN_LOG_ERROR. The throw fired on the normal path
  (HipKernelEngine calls this unconditionally from getMaxWorkspaceSize
  and initializeExecutionContext for every applicable graph), and
  std::logic_error is the wrong type for "not implemented" anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant