Skip to content

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136

Closed
ozturkosu wants to merge 31 commits into
muozturk/dispatcher-gemm-bridgefrom
muozturk/dispatcher-streamk-gemm-bridge
Closed

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136
ozturkosu wants to merge 31 commits into
muozturk/dispatcher-gemm-bridgefrom
muozturk/dispatcher-streamk-gemm-bridge

Conversation

@ozturkosu

@ozturkosu ozturkosu commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.

This PR is stacked on muozturk/dispatcher-gemm-bridge (#8123) — please merge
that first. Its own diff is just two commits:

  1. [CK_TILE] Add stream_k variant to GEMM Dispatcher codegen (cherry-picked)
  2. [CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM (this work)

What Stream-K needs that regular GEMM doesn't

Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI
as regular GEMM, so the Python side (GpuGemmRunner / GemmDispatcherLib
/ GemmProblem) and the GPU worker are reused unchanged. The differences are
internal to the .so:

  • The generated launch has a Stream-K signature
    SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),
    which allocates the reduction workspace internally (DeviceMem) and uses the
    Atomic reduction strategy.
  • The registry path (generated_tile_backend.hpp::run()) hard-codes the
    single-problem GemmHostArgs launch and won't compile against a Stream-K
    SelectedKernel. So the Stream-K ctypes lib bypasses the registry and calls
    SelectedKernel::launch(args, stream) directly, reporting the name from the
    KERNEL_NAME macro (same approach grouped uses).

Changes

New

  • dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp — same single-problem
    C ABI (dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,
    memset C=0 (Atomic accumulates into C), build StreamKHostArgs with rcr
    strides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
    Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
  • tile_engine/ops/gemm/streamk_gemm_full_benchmark.py — 3-phase driver
    (expand configs → setup_multiple_gemm_dispatchers build → subprocess-isolated
    benchmark), mirroring gemm_full_benchmark.py with variant="stream_k".
  • tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py — disposable GPU worker
    (identical to the regular worker since the ABI matches).
  • tile_engine/ops/gemm/gemm_streamk/configs/default_config.json — small sweep
    config (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
    pad true, persistent false) → 4 kernels.

Modified

  • dispatcher/python/gemm_utils.py_ctypes_source_name() selects
    streamk_gemm_ctypes_lib.cpp for variant=="stream_k" (in both
    _build_compile_jobs and setup_multiple_gemm_dispatchers); .name appends
    _streamk; variant threaded into codegen_args and expand_sweep.
  • dispatcher/python/ctypes_utils.py — pass the requested variant to codegen
    --variants instead of hard-coding "standard".

Validation (gfx942 / MI300X, fp16 / rcr)

Numeric parity vs a numpy fp32 reference (A.f32 @ B.f32). Stream-K's Atomic
reduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).

Shape (M,N,K) status TFLOPS max_rel frob_rel result
1024 x 1024 x 1024 0 23.1 1.10e-3 5.47e-4 PASS
2048 x 2048 x 2048 0 129.5 7.02e-4 3.72e-4 PASS
512 x 768 x 4096 0 24.6 2.01e-3 8.19e-4 PASS
256 x 256 x 256 0 2.1 3.44e-4 2.08e-4 PASS
4096 x 4096 x 1024 0 149.8 5.89e-4 2.65e-4 PASS

Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):

  • 1024 x 1024 x 1024
  • 2048 x 2048 x 2048
  • 4096 x 4096 x 4096
  • 512 x 512 x 8192 (large-K skinny)

All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each .so equals
GemmKernelConfig(variant="stream_k").name, ending in _streamk.

Unsupported-shape handling: a tiny 257^3 problem is correctly reported as
unsupported by the kernel (status -2, too few tiles to partition across CUs)
and surfaced gracefully by the bridge — not a crash.

Test plan

  • unified_gemm_codegen.py ... --variants stream_k emits a *_streamk.hpp
    whose stem == GemmKernelConfig(variant="stream_k").name
  • setup_multiple_gemm_dispatchers builds the Stream-K config set → .so
    compiles & links against streamk_gemm_ctypes_lib.cpp
  • Numeric parity (table above) passes under the fp16 Atomic tolerance
  • Full driver run 16/16 OK; name parity verified
  • Unsupported tiny shape returns status -2 gracefully

Next

Land #8123, then this; afterwards delete the legacy tile_engine/ops/gemm_streamk/ machinery (Phase 4).


Update 2026-06-12 — brought current with #8123 + Copilot fixes

This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):

  • Benchmark-param parity / perf-methodology fix (streamk_gemm_ctypes_lib.cpp):
    benchmark knobs defaulted to warmup=3/repeat=10 — a cold, un-ramped clock, the
    root cause of the regular bridge's spurious "perf gap." Now default to old-TE's
    warmup=50/repeat=100, env-overridable via
    CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. rotating_count stays 1 for
    Stream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
    C would leave rotated copies un-zeroed and corrupt the accumulation.
  • --verify correctness gate (driver + worker): opt-in fp32 numpy reference
    check (global max|out-ref|/max|ref|, verified/max_rel in the CSV); a
    mismatch counts as a failure.
  • Multi-GPU benchmarking (driver): fans the (kernel × problem) work across all
    visible GPUs via device-pinned HIP_VISIBLE_DEVICES workers (--devices, device
    CSV column); also fixes a latent proc-unbound error in the batch handler.
  • --dtype/--layout guards (driver): constrained to the supported fp16/rcr
    surface so a mismatch fails fast.
  • Copilot codegen nits: std::stoi → std::stoll for M/N/K in
    03_streamk_gemm_driver.cpp; stride-aware C zeroing via hipMemset2DAsync
    (CLayout-aware, checked HIP status) in _launch_function_streamk.
  • README: new Stream-K bridge subsection.

Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+--verify on gemm_streamk/configs/default_config.json = 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table + streamk_bridge_oldTE.csv.

Muhammed Ozturk and others added 3 commits June 5, 2026 02:48
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the
dispatcher way: a single-GEMM launch(args, stream) that allocates the
reduction workspace internally via DeviceMem (GetWorkSpaceSize /
SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an
atomic-reduction preprocess that resets C between timed iterations. No
external workspace pointer (not the Tile Engine way).

- arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints.
- unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants
  stream_k, naming, includes, _launch_function_streamk, variant->operator
  map, cshuffle-only config selection, and A/B/CLayout export in the
  CK_TILE_SINGLE_KERNEL_INCLUDE block.
- examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel
  driver that calls SelectedKernel::launch and verifies vs reference_gemm.

Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic
128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50:
dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the
dispatcher way: a single-GEMM launch(args, stream) that allocates the
reduction workspace internally via DeviceMem (GetWorkSpaceSize /
SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an
atomic-reduction preprocess that resets C between timed iterations. No
external workspace pointer (not the Tile Engine way).

- arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints.
- unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants
  stream_k, naming, includes, _launch_function_streamk, variant->operator
  map, cshuffle-only config selection, and A/B/CLayout export in the
  CK_TILE_SINGLE_KERNEL_INCLUDE block.
- examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel
  driver that calls SelectedKernel::launch and verifies vs reference_gemm.

Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic
128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50:
dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as
regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem
GEMM with the same C ABI as regular GEMM, so the Python runner side is reused
unchanged; only the .so internals and variant routing differ.

- streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI
  (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls
  SelectedKernel::launch(args, stream) directly, bypassing the registry (whose
  generated backend hard-codes the GemmHostArgs launch signature). The launch
  allocates the Atomic-reduction workspace internally; C is zeroed per run.
- gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for
  variant "stream_k"; .name appends _streamk; variant threaded through
  codegen_args and expand_sweep.
- ctypes_utils.py: pass the requested variant to codegen --variants instead of
  hard-coding "standard".
- TE driver/worker/config (new): streamk_gemm_full_benchmark.py,
  run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json.

Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4
problems), name parity holds (.so name == config .name, ends _streamk). Numeric
parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance
(max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does
multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly
reported unsupported (status -2) by the kernel and surfaced gracefully.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Composable Kernel Tile Engine (TE) → Dispatcher “single source of truth” bridge to the Stream-K GEMM variant by adding Stream-K-aware codegen, a dedicated ctypes bridge that launches Stream-K kernels directly, and TE-side benchmark/worker plumbing that reuses the existing single-problem GEMM Python ABI.

Changes:

  • Add Stream-K as a first-class GEMM codegen variant (--variants stream_k), including Stream-K kernel includes and a Stream-K launch wrapper that allocates/reuses internal workspace.
  • Introduce a Stream-K-specific ctypes bridge (streamk_gemm_ctypes_lib.cpp) that bypasses the registry and calls SelectedKernel::launch(StreamKHostArgs, stream_config) directly while keeping the same C ABI as standard GEMM.
  • Add TE benchmark driver/worker and a small default sweep config for Stream-K benchmarking via the dispatcher bridge.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
projects/composablekernel/tile_engine/ops/gemm/streamk_gemm_full_benchmark.py New 3-phase TE benchmark driver for Stream-K using variant="stream_k" and subprocess isolation.
projects/composablekernel/tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py New subprocess worker to run Stream-K kernels via the existing GpuGemmRunner ABI.
projects/composablekernel/tile_engine/ops/gemm/gemm_streamk/configs/default_config.json New default Stream-K sweep config JSON (small config set).
projects/composablekernel/dispatcher/python/gemm_utils.py Route Stream-K builds to the Stream-K ctypes source; thread variant into codegen args and sweep expansion.
projects/composablekernel/dispatcher/python/ctypes_utils.py Pass requested variant into unified_gemm_codegen.py --variants (instead of hard-coding standard).
projects/composablekernel/dispatcher/examples/gemm/cpp/03_streamk_gemm_driver.cpp New minimal standalone Stream-K driver demonstrating the “dispatcher way” launch path.
projects/composablekernel/dispatcher/codegen/unified_gemm_codegen.py Add stream_k variant: naming, includes, launch wrapper, config filtering, CLI choices.
projects/composablekernel/dispatcher/codegen/arch_filter.py Add operator type and constraints entry for Stream-K GEMM validation.
projects/composablekernel/dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp New ctypes bridge for Stream-K that launches the force-included kernel directly and reports KERNEL_NAME.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +527 to 533
# All configs in a sweep share one variant; route to the matching bridge lib.
ctypes_source = (
_cu.get_dispatcher_root() / "bindings" / "ctypes" / "gemm_ctypes_lib.cpp"
_cu.get_dispatcher_root()
/ "bindings"
/ "ctypes"
/ _ctypes_source_name(configs[0].variant)
)
Comment on lines +797 to +802
if constexpr (ReductionStrategy == ck_tile::StreamKReductionStrategy::Atomic) {{
(void)hipMemsetAsync(args.e_ptr, 0,
args.M * args.N * sizeof(CDataType), stream.stream_id_);
}} else {{
workspace_dev.SetZero();
}}
ozturkosu added a commit that referenced this pull request Jun 10, 2026
Adds a "Variant scope" section clarifying that the bridge is one shared,
variant-aware driver (not per-variant driver copies), that only gemm_universal
is wired and validated through the bridge on this PR, and that the
gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding
following the per-variant convention -- not yet working support. Notes that
grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).
ozturkosu and others added 3 commits June 12, 2026 02:48
The Stream-K bridge (#8136) was branched at #8123's first commit, so it
lacked all subsequent regular-GEMM bridge improvements (arch-validated
tile filtering, the develop merge + legacy gemm_universal retirement,
benchmark-param/--verify work on the shared driver, README). Merge the
current #8123 HEAD to pick those up; the Stream-K-specific analogues that
live in the duplicated driver/worker/ctypes lib are ported in follow-up
commits.

Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept
the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp,
.name _streamk suffix, variant through codegen_args/expand_sweep) and
adopted #8123's explanatory comment.
The Stream-K bridge keeps its own driver, worker and ctypes lib, so the
regular-GEMM bridge improvements that landed on #8123 after this branch
forked did not arrive via the merge. Port the Stream-K-specific analogues:

- streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's
  warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of
  #8123's spurious "perf gap") and are env-overridable via
  CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path,
  rotating_count defaults to 1: the Atomic preprocess re-zeros only the
  original C buffer, so rotating C would corrupt the accumulation.
- streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across
  every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices,
  device CSV column), add the --verify/--verify-tol fp32-reference gate, and
  constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a
  latent proc-unbound error in the batch handler.
- run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check
  (global max|out-ref|/max|ref|, verified/max_rel) behind --verify.
- README: document the Stream-K bridge driver/worker, flags, _streamk name
  suffix, fp16 Atomic tolerance, and the rotating_count divergence.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two Copilot findings on the Stream-K codegen carried in from #8094:

- 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi)
  before narrowing to ck_tile::index_t; stoi throws std::out_of_range past
  INT_MAX, needlessly rejecting large GEMM sizes.
- unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's
  per-iteration C reset zeroed args.M*args.N as a flat contiguous block,
  which skips elements when C has a padded leading dimension and corrupts
  the accumulation. Zero the used MxN region honoring stride_E via
  hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status
  instead of discarding it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ozturkosu

Copy link
Copy Markdown
Contributor Author

Stream-K Bridge vs Old TE — MI300X parity (perf + validation)

Ran the requested comparison on ctr-cx64-mi300x-4 (AMD Instinct MI300X / gfx942, inside the enroot container, ROCm/HIP 7.13), using tile_engine/ops/gemm/gemm_streamk/configs/default_config.json4 kernels: {compv3,compv4} × 128x128x{32,64}, 2x2x1, 32x32x16, atomic reduction, pad-all, persistent=false, fp16 / rcr.

Shapes: 1024³, 2048³, 4096³, 512×512×8192 (the bridge's default Stream-K problem set).

Method

  • Bridge: streamk_gemm_full_benchmark.py default_config.json --verify (this PR; warmup=50/repeat=100 — the new streamk_gemm_ctypes_lib.cpp defaults).
  • Old TE: develop-branch standalone benchmark_gemm_streamk_* binaries (CMake trimmed to gemm+gemm_streamk only). Same warmup/repeat; perf with -verify=0, correctness with -verify=1 separately (the atomic kernel forces repeat=1 cold under verify). Verified Old TE's -flush_cache/-rotating_count change results <1% here, so the numbers are apples-to-apples.
  • The generated device kernel is byte-identical between the two paths (same StreamKKernel codegen); only host-side workspace ownership / harness differs.

Validation

Both paths correct on all 16 measurements. Bridge: 16/16 verified=True, max_rel ≤ 3.0e-3 (within the fp16 atomic tolerance; Atomic does multiple fp16 atomic-adds so it is inherently noisier than a single fp32→fp16 store). Old TE: 16/16 correct.

Performance (TFLOPS, Bridge vs Old TE)

kernel shape Bridge Old TE Δ
compv3 128x128x32 1024³ 24.08 23.77 +1.3%
compv3 128x128x32 2048³ 136.93 137.02 −0.1%
compv3 128x128x32 4096³ 260.39 257.44 +1.1%
compv3 128x128x32 512×512×8192 38.42 38.20 +0.6%
compv4 128x128x32 1024³ 30.40 32.11 −5.3%
compv4 128x128x32 2048³ 149.63 162.88 −8.1%
compv4 128x128x32 4096³ 285.70 288.01 −0.8%
compv4 128x128x32 512×512×8192 55.04 46.90 +17.4%
compv3 128x128x64 1024³ 34.09 34.26 −0.5%
compv3 128x128x64 2048³ 168.62 168.10 +0.3%
compv3 128x128x64 4096³ 324.34 323.57 +0.2%
compv3 128x128x64 512×512×8192 49.18 49.79 −1.2%
compv4 128x128x64 1024³ 52.78 54.47 −3.1%
compv4 128x128x64 2048³ 195.18 206.78 −5.6%
compv4 128x128x64 4096³ 240.81 241.91 −0.5%
compv4 128x128x64 512×512×8192 81.41 66.66 +22.1%

Takeaway

  • Functional parity: confirmed — identical correctness on a byte-identical device kernel.
  • Performance: at parity within measurement noise. All 8 compv3 points are within ±1.3%. The compv4 points scatter wider (−8%…+22%) but center near parity; since the device kernel is byte-identical, these are harness / GPU-clock-state artifacts (standalone Old-TE benchmark process vs in-harness bridge), the same effect root-caused for regular GEMM in [CK_TILE] Add Tile Engine -> Dispatcher bridge for GEMM #8123 — not a real kernel speed difference. The largest spread is on the small/skinny + compv4 cases, which are the most clock/launch-overhead sensitive.

Raw data: streamk_bridge_oldTE.csv (16 rows; columns: kernel, M, N, K, bridge_tflops, oldte_tflops, Δ%, bridge_latency_ms, oldte_latency_ms, bridge_max_rel, bridge_verify, oldte_verify).

Note: validated on gfx942/MI300X (this comparison). The PR's own gfx942 numbers in the description above were re-confirmed: 16/16 OK, name parity, all verified.

…uction

Previously the stream-K codegen hard-coded the Atomic reduction strategy. This
makes the reduction strategy a first-class config axis so linear and tree
reductions can be generated, named, and selected alongside atomic:

- unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in
  key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk"
  suffix for name parity, linear/tree are disambiguated); _launch_function_streamk
  now emits the config's StreamKReductionStrategy (the existing reset lambda
  already zeroes C for atomic vs the workspace for linear/tree);
  _get_configs_for_variant iterates strategies from a new streamk_config section,
  which is added to the default config (atomic, linear, tree).
- gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name,
  to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly
  the requested strategy) and expand_sweep (reduction-strategy sweep axis).
- ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity.

arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile
combos and arch_filter reads no stream-K-specific keys, so adding them would be
dead data.

Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and
pass fp32 verification (max_rel 3.1e-4) at 256x256x4096.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ozturkosu

Copy link
Copy Markdown
Contributor Author

Stream-K reduction strategy is now a codegen axis (d326bdc283)

Previously the Stream-K codegen hard-coded the Atomic reduction strategy. This change makes reduction_strategy a first-class config axis so atomic / linear / tree reductions can each be generated, named, and selected.

Codegen (unified_gemm_codegen.py)

  • reduction_strategy field on KernelConfig
  • encoded in key_name() (redux_*) and KernelNaming.generate() — atomic keeps the bare _streamk suffix (name parity with the original single-strategy bridge); linear/tree are disambiguated (_streamk_linear, _streamk_tree)
  • _launch_function_streamk() now emits the config's StreamKReductionStrategy (the existing reset hook already zeroes C for atomic vs the device workspace for linear/tree)
  • _get_configs_for_variant(STREAM_K) iterates a new streamk_config.reduction_strategy, added to the default config as [atomic, linear, tree]

Bridge (gemm_utils.py)GemmKernelConfig.reduction_strategy threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy), and expand_sweep (sweep axis). ctypes_utils.py — field on its KernelConfig for parity.

arch_specs.json intentionally unchanged — Stream-K reuses the standard warp-tile combos and arch_filter reads no Stream-K-specific keys, so adding them would be dead data.

Validation (gfx942 / MI300X)

Built + ran one kernel per strategy through the isolated worker with fp32 reference verification at 256x256x4096:

strategy kernel verified max_rel
atomic ..._streamk yes 3.1e-4
linear ..._streamk_linear yes 3.1e-4
tree ..._streamk_tree yes 3.1e-4

Scope

This stays within the bridge architecture. It does not add the deep dispatcher-core integration (KernelKey / KernelInstance / Problem / Dispatcher + a generic Stream-K backend + ML selectability); that remains a separate effort.

…rategy fields

First slice of moving Stream-K into the dispatcher core (registry-addressable),
per the deep-core checklist. Additive and inert by default:

- KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm
  gains streamk / reduction_strategy / workspace. tie() includes them so the
  three strategies are distinct keys. encode_identifier() appends the Stream-K
  suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with
  unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk
  so non-Stream-K identifiers are unchanged.
- Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k().

Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical;
atomic/linear/tree suffixes correct; tie() distinguishes strategies.
Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile
unchanged:
- get_workspace_size(Problem) -> bytes (default 0)
- run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default
  forwards to the existing no-workspace run().

The Dispatcher invokes these through a base KernelInstance* pointer (so the new
overload is visible despite derived 6-arg run() overrides). The Stream-K backend
(PR-C) overrides both to size and bind the reduction workspace.

Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the
pre-existing pure virtuals compiles; default get_workspace_size==0 and the
workspace-run forwards correctly via base pointer.
…tree codegen

Adds the C++ backend that lets Stream-K ride the registry, plus the reduction
strategy codegen needed to generate the three variants on this branch.

- generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps
  a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the
  ABI-incompatible args the GemmHostArgs path could not). supports() gates on
  Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the
  registry and the Dispatcher's first-fit selection picks the requested one.
  create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>.
- codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field,
  key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree"
  (matches KernelKey::encode_identifier from PR-A), per-strategy
  StreamKReductionStrategy in the generated launch, and a streamk_config sweep
  axis. (Ported from the bridge branch reduction-strategy work.)

PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those
to Dispatcher::run() via get_workspace_size()/the workspace-aware run().

Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584
tree headers with correct names; the backend device-compiles (22s) against a
generated header and supports() accepts the matching strategy while rejecting
the others and non-Stream-K problems.
…pace

Relocate the Stream-K reduction-workspace buffer from the per-call generated
launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived
dispatcher stops paying a hipMalloc/hipFree on every invocation.

- codegen: hoist the StreamKGemmKernel type to struct scope and add
  GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace)
  overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the
  bridge ctypes lib and the standalone 03 driver keep working.
- backend: override get_workspace_size() and the workspace-aware run(); the
  no-workspace run() delegates with a null buffer. The per-iteration reset stays
  in the backend (it needs CDataType + the reduction strategy).
- dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out
  of the public header), size it via get_workspace_size(), and pass it through
  run_fused()/run_explicit(); free it in the destructor. Atomic needs none
  (size 0 -> null -> internal path); linear/tree consume the owned buffer.

Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at
unchanged perf, with linear/tree now running on the dispatcher-owned workspace.
…river

Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path
(Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on
reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch ->
verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls
SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the
registry selection and the Dispatcher-owned workspace.

Selectable strategy via --strategy {atomic,linear,tree}. Validated on
MI210/gfx90a for all three (distinct registry identifiers, each PASS).
…K backend

The dispatcher-wrapper generator emitted ONE template for every variant:
backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy
on the key. For Stream-K that is wrong twice over -- the regular backend calls
launch(GemmHostArgs,...) which the SK kernel struct does not have (so the
aggregate register_all_kernels.hpp would not compile against SK), and the key
omits the SK fields so encode_identifier() emits no _streamk suffix and
atomic/linear/tree collide in the registry.

Make the wrapper variant-aware: for STREAM_K configs include
generated_tile_backend_streamk.hpp, set key.algorithm.streamk +
reduction_strategy + workspace (and pad flags for identifier parity), and return
create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>.
All other variants are unchanged.

Validated on MI210/gfx90a: a registry populated via the generated wrappers holds
atomic+linear+tree side by side; Dispatcher::run() selects each by
Problem::reduction_strategy and all three verify vs reference_gemm.
…are atomic reset

P2: GeneratedStreamKKernelInstance::supports() now ends with
SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs
MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs
is rejected during selection, so first-fit falls back to a non-Stream-K kernel
instead of throwing std::runtime_error at launch.

P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync
(pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a
flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage
for the contiguous rcr case. Applied to both the internal and external-workspace
launch overloads.

Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a
multi-kernel registry; valid small problems are accepted (no false-negatives).
The bridge dispatcher's tile-divisibility gate rejected any problem where
M % TileM != 0 for every layout, returning status -2 ("No suitable kernel")
at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr
kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like
1024^3 -- cases Old-TE compiles, runs, and verifies as correct.

Root cause: supports() was layout-blind, while the underlying
ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an
operand whose inner axis is that dimension participates without padding:

  RowMajor A -> K, ColMajor A -> M
  RowMajor B -> N, ColMajor B -> K
  RowMajor C -> N, ColMajor C -> M

So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192
tiles on M-indivisible problems.

Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so
it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain).
Anything it now lets through is still validated by the kernel's own
IsSupportedArgument inside launch(), so the bridge stays a strict functional
equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM
.so path) and the sibling tile_backend.hpp.

Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile
.so now run at 1024^3 (Old-TE runs the same, verification correct); the 8
remaining rejects are tile N=192 cases that Old-TE also reports "Arguments
not supported" at N=1024 -- parity preserved in both directions.
…oding rcr

dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to
rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is
layout-aware, that wrong key layout makes the dispatcher reject valid problems:
a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its
inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and
returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that
Old-TE compiles, runs, and verifies (~87 TFLOPS).

Derive signature.layout_a/b/c from the force-included kernel's own
ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor.
The key now matches the kernel, so the layout-aware gate is correct for all four
layouts. Execution was already layout-correct (the kernel uses its own compile-time
layouts); only the host-side selection metadata was wrong.

Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS),
restoring parity with Old-TE.
ozturkosu added 11 commits June 17, 2026 22:22
For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker
call per side (5x fewer process startups), cache the compiled old-TE .so, and
add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores
while GPU measurement stays serial.
…eductionStrategy)

Close two review nits on the Stream-K drivers:
- Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so
  large GEMM dimensions no longer overflow/throw int range (Copilot nit).
- Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04
  driver through it, removing the driver-local strategy_name() duplicate so
  callers share one spelling that matches the codegen suffix scheme.
Adds dispatcher_test_streamk_registry, a GPU test that generates the three
reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the
04 registry driver once per strategy (each force-including its own header, since
SkReductionStrategy is a compile-time constexpr), and asserts for each that the
encode_identifier() suffix matches, the Dispatcher selects it by
Problem::reduction_strategy, and the result verifies against the reference.

This converts the previously manual deep-core validation into a regression-
guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only
CI is unaffected.
…IBRARY_PATH

meas()/meas_all() built the worker env without /opt/rocm/lib on
LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so
("libamdhip64.so.7: cannot open shared object file") and every cell
came back nan. Set it the same way ab_efficient_sweep.py does.
… guard)

The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT
bugs, not real codegen gaps:

- ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the
  bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted
  back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it
  is at parity. Now both sides build with identical flags.

- ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no
  freshness check, so 3-day-old binaries built from an obsolete codegen showed
  up as -78%/+703% gaps. Added a guard: skip any .so older than its generated
  header (treated as missing) instead of reporting a phantom gap.

With both fixes the 41 former >15% outlier stems measure within +/-10%
(median +0.01%); no bridge codegen regression exists.

Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate
-enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%;
held back pending a broader persistent-kernel no-regression sweep.
… driver

The standalone stream-K driver verified atomic results with the single-pass
GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split
partials directly into low-precision C (workspace size 0), incurring rounding
error that grows with the split factor -- correct results were flagged FAIL on
small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs.

Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from
the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with
the split-K CDataType accumulation term, and take the max with the per-split
tolerance. The driver and tile_engine now verify identically; the kernel is
unchanged.
…gine

The standalone stream-K driver built its stream_config as {stream, true, 0,
warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults
(flush_cache=false, rotating_count=1). The tile_engine benchmark instead times
with flush_cache=true and rotating_count=1000, so the driver measured a
warm-cache best case while tile_engine measured cold-cache -- the entire source
of the reported dispatcher-vs-TE "performance gap" at low tile counts.

Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine
values) and pass them through stream_config so both sides use identical timing
methodology. A validating run still times a single cold shot, mirroring
tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0
pass.
The 04 registry driver hardcoded the KernelKey signature to DataType::FP16
and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the
wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and
layout tags from the generated kernel's actual A/B/C types via compile-time
dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate
in fp32 and write fp16 C, matching Tile Engine).

Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent
core objects built once; per-dtype codegen + build + verify with per-dtype
identifier assertions). All four datatypes register, dispatch, and verify
across atomic/linear/tree on gfx942 (MI300X).
Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten
deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace
virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry +
validation driver). 3-way merge over the shared stream_k ancestor; only the
streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp
conflicted -- both resolved to the deep-core side:
- codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize +
  IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib
  calls, and adds the 3-arg dispatcher-owned-workspace launch.
- driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot.

Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly,
so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree
+ regular gemm all generate cleanly (0 failed).
Build the Stream-K bridge .so without the dispatcher static lib and with
TE-streamk-matching flags:
- compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk
  CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT
  -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern).
  This keeps the A/B fair; the regular path is unchanged.
- link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the
  force-included kernel directly, no registry/dispatcher symbols), and the
  build guard no longer requires the static lib for Stream-K.
- ensure build/examples exists before hipcc writes there (the cmake build that
  normally creates it is skipped on the Stream-K path).

Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and
verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.
Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build
verbatim (ground truth: a TE streamk build's compile_commands.json). The
-mllvm codegen flags come from the composablekernel project-root
add_compile_options applied globally to the TE benchmark -- they are NOT in the
per-target options, so the earlier "minimal flags" assumption was wrong and
would have produced a phantom A/B gap (different occupancy). Flags now:
  -std=c++20 -fno-offload-uniform-block
  -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0
  -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false
  --offload-compress
-enable-post-misched=0 is applied unconditionally (TE does so for streamk),
and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).
@ozturkosu

Copy link
Copy Markdown
Contributor Author

Update: ported the Stream-K bridge onto the deep-core #8094 engine + TE parity

Reworked this PR so the Stream-K TE→Dispatcher bridge runs on the rewritten #8094 "deep-core" Stream-K engine instead of the original codegen.

What changed

  • Merged the deep-core feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen #8094 Stream-K codegen into the bridge (3-way merge over the shared 36eb9bdc0e ancestor; KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, dispatcher-owned reduction workspace, registry + validation driver). The codegen now emits the struct-scope kernel type with GetWorkSpaceSize/IsSupported, keeps the 2-arg internal-workspace launch() the bridge ctypes lib calls, and adds the 3-arg dispatcher launch.
  • The bridge streamk_gemm_ctypes_lib.cpp keeps launching the force-included kernel directly (no registry/dispatcher symbols), so the .so is self-contained — link no longer needs libck_tile_dispatcher.a, and the build creates build/examples itself.
  • Fair-parity flags: the Stream-K .so now matches Tile Engine's gemm_streamk compile flags exactly (ground truth from a TE build's compile_commands.json): -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress. (-enable-post-misched=0 applied unconditionally, matching TE for streamk; -enable-noalias-to-md-conversion=0 is not used.)

Validation (fresh builds both sides, same node compiler, warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved; verify in a separate pass):

  • Codegen + on-GPU run + verify PASS for atomic / linear / tree.
  • MI350 (gfx950): full parity — 12/12 cases (atomic/linear/tree × 1024³/2048³/4096³/512×512×8192) |gap%| ≤ 0.59%, all correct, no gfx950 flag changes needed.
  • MI300X (gfx942): representative sweep at parity (worst real case ≈ tree −3.8%); a full sweep over the entire TE Stream-K config file (1080 fp16/rcr configs) is running, with automatic standalone re-measurement of any |gap|>15% to remove multi-GPU concurrency artifacts. Aggregate table to follow.

Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through
the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16
path; only the bit pattern differs). Dtype is inferred from the kernel name.
ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device
bf16_t so the numpy reference multiplies the same values the GPU does.
@ozturkosu ozturkosu marked this pull request as ready for review June 28, 2026 00:25
@ozturkosu ozturkosu requested review from a team as code owners June 28, 2026 00:25
…ing)

Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C
layouts (rcr/rrr/ccr/crr) work end to end:
- streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the
  kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R)
  instead of the hardcoded K/K/N.
- generated_tile_backend_streamk.hpp (registry path): same layout-derived strides.
- GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand
  per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is
  now memory-order-preserving so column-major operands stay column-major.
- run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only).
- streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr,
  SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).
@ozturkosu

Copy link
Copy Markdown
Contributor Author

Update: removed rcr hardcoding — bridge is now layout-generic (rcr/rrr/ccr/crr) + bf16

Following review, nothing layout- or dtype-specific is hardcoded anymore; strides are derived from the kernel's actual layouts everywhere.

  • streamk_gemm_ctypes_lib.cpp: stride_A/B/C derived at compile time from the force-included kernel's ALayout/BLayout/CLayout (RowMajor RxC → ld=C, ColumnMajor → ld=R) instead of the old hardcoded K/K/N.
  • generated_tile_backend_streamk.hpp (registry path): same layout-derived strides.
  • GpuGemmRunner: reads dtype and layout off the kernel name; arranges each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is memory-order-preserving so column-major operands stay column-major.
  • run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only).
  • streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS = rcr/rrr/ccr/crr, SUPPORTED_DTYPES = fp16+bf16.

Validated on GPU (verify on):

  • fp16, all 4 layouts rcr/rrr/ccr/crr — all PASS, identical max_rel_err = 8.2e-4.
  • bf16 + ccr (column-major A/B) — PASS, max_rel_err = 8.0e-3.
  • bf16 atomic/linear/tree on MI350 — PASS.

Remaining (tracked): fp8/bf8/int8 need runner codecs (codegen + Old-TE already support them); the ctypes bridge still runs one kernel per .so (registry multi-kernel path exists in the engine via the 04 driver).

Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond
fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300.

GpuGemmRunner (dispatcher/python/gemm_utils.py):
- Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887):
  bit-exact decode tables + nearest-representable/saturating encode, carried
  as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand
  C/F contiguity so the layout-generic _to_buf path holds for the new dtypes.
- run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store,
  int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged.
- Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350
  uses OCP fp8, a different bit layout) rather than silently mis-decoding.
- An int8 codec is included for when the engine supports it (see below).

Reference + surface:
- run_one_streamk_gemm_kernel.py verify reference is now dtype-aware
  (decode(encode(x)) per dtype; int8 = exact int32 matmul).
- streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8.

int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the
ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile
for every reduction strategy -- warp_gemm_dispatcher has no
Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3
path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the
PR #8094 decision to leave int8 out.

GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants:
  fp8 atomic/linear/tree rcr: PASS  (192/180/183 TFLOPS, max_rel <= 9.4e-4)
  bf8 atomic/linear/tree rcr: PASS  (192/181/181 TFLOPS, max_rel <= 7.8e-4)
  fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)
@ozturkosu

Copy link
Copy Markdown
Contributor Author

Update: full Old-TE dtype coverage — fp8/bf8 added (commit b6bea8a)

The Stream-K bridge runner now covers the exact runnable dtype set Old-TE Stream-K supports.

Added fp8 (E4M3 FNUZ) and bf8 (E5M2 FNUZ) to GpuGemmRunner (codec ported from the plain-GEMM fp8 bridge #8887): bit-exact decode to the device fp8_t/bf8_t, nearest-representable+saturate encode, output dtype fp8/bf8 → fp16. The fp16/bf16 paths and the layout-generic (rcr/rrr/ccr/crr) logic are unchanged. SUPPORTED_DTYPES is now fp16, bf16, fp8, bf8.

GPU-validated (gfx942/MI300X, 2048³, reference = decode(encode(A)) @ decode(encode(B))):

dtype atomic linear tree col-major
fp8 PASS 192 TF, rel 9.4e-4 PASS 180 PASS 183 ccr PASS
bf8 PASS 192 TF, rel 7.8e-4 PASS 181 PASS 181 crr PASS

Exact-equivalence check vs Old-TE:

  • Old-TE Stream-K's instance builder accepts fp16, fp8, bf16, bf8, fp32, fp64. fp32/fp64 have no MFMA warp tiles (not runnable on either engine). So Old-TE's runnable set = {fp16, bf16, fp8, bf8} = what the bridge now supports.
  • int8 is in the warp-tile table but Old-TE's Stream-K builder rejects --datatype int8 (invalid choice), and the ck_tile engine has no warp_gemm_dispatcher specialization for int8 streamk CompV3 (compile fails). So neither TE nor the bridge generates int8 streamk. The runner keeps an int8 codec + int32-output path ready for if/when the engine adds that instantiation.

Arch guard: the fp8/bf8 codec is FNUZ (gfx942). gfx950/MI350 uses OCP fp8 (different bit layout); the runner detects arch and raises a clear error for fp8/bf8 on non-gfx942 rather than silently producing wrong results.

@ozturkosu

Copy link
Copy Markdown
Contributor Author

Status summary — Stream-K bridge ready for review

Implementation complete and on this branch:

  • Ported onto the deep-core feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen #8094 engine (KernelKey reduction fields, workspace virtuals, StreamK backend, dispatcher-owned reduction workspace, registry/validation driver) via a clean 3-way merge.
  • Layout-generic — rcr/rrr/ccr/crr; strides derived from the kernel's ALayout/BLayout/CLayout (no rcr hardcoding) in both the ctypes lib and the registry backend.
  • Full Old-TE dtype coverage — fp16, bf16, fp8 (E4M3 FNUZ), bf8 (E5M2 FNUZ). This is the exact runnable set Old-TE Stream-K supports: TE's instance builder also has only {fp16, fp8, bf16, bf8} runnable (fp32/fp64 have no warp tiles; int8 is rejected by TE's builder and has no ck_tile warp_gemm streamk specialization). int8 codec is kept ready for if/when the engine adds it.
  • Fair flags — the .so matches TE's gemm_streamk compile flags exactly (verified against a real compile_commands.json).
  • Reductions — atomic / linear / tree, all verified.

Verification (verify on): fp16 all 4 layouts PASS (rel 8.2e-4); bf16 + col-major PASS; fp8/bf8 × atomic/linear/tree × row+col-major PASS (rel <1e-3, gfx942).

TE↔bridge parity (fair, fresh both sides, median-of-3 interleaved):

  • MI350 (gfx950): full pass — worst |gap| 0.59%.
  • MI300X (gfx942): full canonical-config sweep in progress — ~89% within ±15%, median |gap| ≈ 2.75%; the >15% outliers are small-shape multi-GPU concurrency artifacts, auto re-measured standalone.

Caveat: fp8/bf8 codec is gfx942 FNUZ; the runner raises a clear error on gfx950 (OCP) pending separate handling.

@ozturkosu

Copy link
Copy Markdown
Contributor Author

Clarification: fp8/bf8 Stream-K on gfx950 (MI350) — blocked by a shared upstream ck_tile bug, not the bridge

For completeness on why MI350 dtype coverage is scoped to fp16+bf16:

A pre-existing, shared ck_tile engine bug on gfx950. fp8/bf8 Stream-K on gfx950 produces output ≈ 4× the reference (e.g. 2792 vs 698). Verified facts:

  • fp16/bf16 Stream-K verify correct on gfx950; only fp8/bf8 fail.
  • Old-TE fails with the identical 4× error — both engines instantiate the same StreamKKernel template, so this is a pre-existing shared ck_tile kernel bug, not introduced by this bridge/PR.
  • gfx942 (MI300X) fp8/bf8 verify fine on both engines → the bug is gfx950-specific.
  • The 4× factor == the K-split partial-accumulation factor → root cause is in the fp8/bf8 reduction/accumulation path on CDNA4. Needs an upstream ck_tile fix.

Bottom line: even with an OCP fp8 codec (gfx950 uses OCP vs gfx942 FNUZ), the gfx950 fp8/bf8 Stream-K kernels themselves are broken — and Old-TE is equally broken — so there is nothing correct to be "at parity" with. fp8/bf8 parity is therefore only meaningful on gfx942/MI300X, where both engines work and the bridge is at parity. On MI350 the valid dtypes are fp16 + bf16 (both swept).

That's why MI350 is scoped to fp16/bf16, and gfx950 fp8/bf8 is flagged as a separate upstream ck_tile issue rather than a bridge gap. (The bridge runner guards this: fp8/bf8 raise a clear error on any non-gfx942 arch instead of silently producing wrong results.)

@ozturkosu

Copy link
Copy Markdown
Contributor Author

Parity analysis (interim — full canonical-config sweeps)

Running the full TE Stream-K config matrix (every bridge-generatable tile/pipeline/persistent × atomic/linear/tree × 4 shapes) per arch/dtype, fair flags (warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved per case, both sides built fresh on the same node compiler), with standalone auto-recheck of any |gap|>15%.

Aggregate so far (gap% = (bridge−TE)/TE; aggregates over measured cases):

GPU arch dtype measured within ±15% median |gap| max |gap|
MI300X gfx942 fp16 2090 93% 1.42% 28.2%
MI350 gfx950 bf16 1464 100% 0.63% 7.7%

Analysis:

  • bf16 on MI350 (single GPU) is 100% within ±15%, median 0.63%, zero outliers — i.e. with no GPU contention the bridge matches TE essentially exactly.
  • fp16 on MI300X is 93% within ±15%, median 1.42%. The ~7% over ±15% are small/sub-millisecond shapes (tiny 64×64 tiles) measured under 4-GPU concurrency. Standalone re-measure confirms these are the TE standalone-benchmark per-launch overhead on tiny kernels (the bridge runs faster there because the .so path has less harness overhead) — the device kernels are identical, so it's a measurement artifact, not a codegen regression. They collapse toward parity when re-measured single-GPU (the contention-free MI350 run shows 0 such outliers).
  • Reduction strategies atomic / linear / tree all covered and at parity.

Still running (final tables to follow): MI300X bf16/fp8/bf8 (gfx942) and MI350 fp16 (gfx950), in parallel across allocations. Raw per-row results (~5K) are kept in CSVs; only this analysis is posted here.

Bottom line so far: the Stream-K bridge is at parity with Tile-Engine across the swept configs; deviations are measurement artifacts on tiny kernels, not real gaps.

@ozturkosu

Copy link
Copy Markdown
Contributor Author

Result CSV files (raw per-row data)

All raw parity results live on the build host under /home/AMD/muozturk/ (these will be consolidated into /home/AMD/muozturk/PR8136_parity_csvs/ once all sweeps finish). Per-arch/per-dtype sheets + one combined master:

Master (all archs+dtypes, with gpu + dtype label columns):

  • PR8136_streamk_te_bridge_parity_ALL.csv

Per arch / dtype:

  • streamk_bridge_te_parity_MI300X_gfx942_FULL.csv (MI300X, fp16)
  • streamk_bridge_te_parity_MI300X_gfx942_bf16_FULL.csv (MI300X, bf16)
  • streamk_bridge_te_parity_MI300X_gfx942_fp8_FULL.csv (MI300X, fp8)
  • streamk_bridge_te_parity_MI300X_gfx942_bf8_FULL.csv (MI300X, bf8)
  • streamk_bridge_te_parity_MI350_gfx950_bf16_FULL.csv (MI350, bf16)
  • streamk_bridge_te_parity_MI350_gfx950_fp16_FULL.csv (MI350, fp16)

Columns: gpu, dtype, gfx_arch, strategy, tile, wave, warp_tile, pipeline, persistent, M, N, K, te_tflops, bridge_tflops, gap_pct, bridge_max_rel_err, note (gap_pct = (bridge−TE)/TE·100; +ve = bridge faster; note flags standalone_recheck / hang / te_timeout).

(The Confluence page carries the analysis/summary only; these CSVs hold the ~5K raw rows.)

@ozturkosu

Copy link
Copy Markdown
Contributor Author

Superseded by #9028, same commit (b6bea8a) on a policy-compliant branch (users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm) targeting develop. Please continue review there.

@ozturkosu ozturkosu closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants