[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM by ozturkosu · Pull Request #8136 · ROCm/rocm-libraries

ozturkosu · 2026-06-05T22:01:39Z

Summary

Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.

This PR is stacked on muozturk/dispatcher-gemm-bridge (#8123) — please merge
that first. Its own diff is just two commits:

[CK_TILE] Add stream_k variant to GEMM Dispatcher codegen (cherry-picked)
[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM (this work)

What Stream-K needs that regular GEMM doesn't

Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI as regular GEMM, so the Python side (GpuGemmRunner / GemmDispatcherLib
/ GemmProblem) and the GPU worker are reused unchanged. The differences are
internal to the .so:

The generated launch has a Stream-K signature
SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),
which allocates the reduction workspace internally (DeviceMem) and uses the
Atomic reduction strategy.
The registry path (generated_tile_backend.hpp::run()) hard-codes the
single-problem GemmHostArgs launch and won't compile against a Stream-K
SelectedKernel. So the Stream-K ctypes lib bypasses the registry and calls
SelectedKernel::launch(args, stream) directly, reporting the name from the
KERNEL_NAME macro (same approach grouped uses).

Changes

New

dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp — same single-problem
C ABI (dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,
memset C=0 (Atomic accumulates into C), build StreamKHostArgs with rcr
strides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
tile_engine/ops/gemm/streamk_gemm_full_benchmark.py — 3-phase driver
(expand configs → setup_multiple_gemm_dispatchers build → subprocess-isolated
benchmark), mirroring gemm_full_benchmark.py with variant="stream_k".
tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py — disposable GPU worker
(identical to the regular worker since the ABI matches).
tile_engine/ops/gemm/gemm_streamk/configs/default_config.json — small sweep
config (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
pad true, persistent false) → 4 kernels.

Modified

dispatcher/python/gemm_utils.py — _ctypes_source_name() selects
streamk_gemm_ctypes_lib.cpp for variant=="stream_k" (in both
_build_compile_jobs and setup_multiple_gemm_dispatchers); .name appends
_streamk; variant threaded into codegen_args and expand_sweep.
dispatcher/python/ctypes_utils.py — pass the requested variant to codegen
--variants instead of hard-coding "standard".

Validation (gfx942 / MI300X, fp16 / rcr)

Numeric parity vs a numpy fp32 reference (A.f32 @ B.f32). Stream-K's Atomic
reduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).

Shape (M,N,K)	TFLOPS	max_rel	frob_rel	result
1024 x 1024 x 1024	23.1	1.10e-3	5.47e-4	PASS
2048 x 2048 x 2048	129.5	7.02e-4	3.72e-4	PASS
512 x 768 x 4096	24.6	2.01e-3	8.19e-4	PASS
256 x 256 x 256	2.1	3.44e-4	2.08e-4	PASS
4096 x 4096 x 1024	149.8	5.89e-4	2.65e-4	PASS

Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):

1024 x 1024 x 1024
2048 x 2048 x 2048
4096 x 4096 x 4096
512 x 512 x 8192 (large-K skinny)

All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each .so equals
GemmKernelConfig(variant="stream_k").name, ending in _streamk.

Unsupported-shape handling: a tiny 257^3 problem is correctly reported as
unsupported by the kernel (status -2, too few tiles to partition across CUs)
and surfaced gracefully by the bridge — not a crash.

Test plan

unified_gemm_codegen.py ... --variants stream_k emits a *_streamk.hpp
whose stem == GemmKernelConfig(variant="stream_k").name
setup_multiple_gemm_dispatchers builds the Stream-K config set → .so
compiles & links against streamk_gemm_ctypes_lib.cpp
Numeric parity (table above) passes under the fp16 Atomic tolerance
Full driver run 16/16 OK; name parity verified
Unsupported tiny shape returns status -2 gracefully

Update 2026-06-12 — brought current with #8123 + Copilot fixes

This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):

Benchmark-param parity / perf-methodology fix (streamk_gemm_ctypes_lib.cpp):
benchmark knobs defaulted to warmup=3/repeat=10 — a cold, un-ramped clock, the
root cause of the regular bridge's spurious "perf gap." Now default to old-TE's
warmup=50/repeat=100, env-overridable via
CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. rotating_count stays 1 for
Stream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
C would leave rotated copies un-zeroed and corrupt the accumulation.
--verify correctness gate (driver + worker): opt-in fp32 numpy reference
check (global max|out-ref|/max|ref|, verified/max_rel in the CSV); a
mismatch counts as a failure.
Multi-GPU benchmarking (driver): fans the (kernel × problem) work across all
visible GPUs via device-pinned HIP_VISIBLE_DEVICES workers (--devices, device
CSV column); also fixes a latent proc-unbound error in the batch handler.
--dtype/--layout guards (driver): constrained to the supported fp16/rcr
surface so a mismatch fails fast.
Copilot codegen nits: std::stoi → std::stoll for M/N/K in
03_streamk_gemm_driver.cpp; stride-aware C zeroing via hipMemset2DAsync
(CLayout-aware, checked HIP status) in _launch_function_streamk.
README: new Stream-K bridge subsection.

Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+--verify on gemm_streamk/configs/default_config.json = 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table + streamk_bridge_oldTE.csv.

Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.

Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem GEMM with the same C ABI as regular GEMM, so the Python runner side is reused unchanged; only the .so internals and variant routing differ. - streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls SelectedKernel::launch(args, stream) directly, bypassing the registry (whose generated backend hard-codes the GemmHostArgs launch signature). The launch allocates the Atomic-reduction workspace internally; C is zeroed per run. - gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for variant "stream_k"; .name appends _streamk; variant threaded through codegen_args and expand_sweep. - ctypes_utils.py: pass the requested variant to codegen --variants instead of hard-coding "standard". - TE driver/worker/config (new): streamk_gemm_full_benchmark.py, run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json. Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4 problems), name parity holds (.so name == config .name, ends _streamk). Numeric parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance (max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly reported unsupported (status -2) by the kernel and surfaced gracefully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends the Composable Kernel Tile Engine (TE) → Dispatcher “single source of truth” bridge to the Stream-K GEMM variant by adding Stream-K-aware codegen, a dedicated ctypes bridge that launches Stream-K kernels directly, and TE-side benchmark/worker plumbing that reuses the existing single-problem GEMM Python ABI.

Changes:

Add Stream-K as a first-class GEMM codegen variant (--variants stream_k), including Stream-K kernel includes and a Stream-K launch wrapper that allocates/reuses internal workspace.
Introduce a Stream-K-specific ctypes bridge (streamk_gemm_ctypes_lib.cpp) that bypasses the registry and calls SelectedKernel::launch(StreamKHostArgs, stream_config) directly while keeping the same C ABI as standard GEMM.
Add TE benchmark driver/worker and a small default sweep config for Stream-K benchmarking via the dispatcher bridge.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
projects/composablekernel/tile_engine/ops/gemm/streamk_gemm_full_benchmark.py	New 3-phase TE benchmark driver for Stream-K using `variant="stream_k"` and subprocess isolation.
projects/composablekernel/tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py	New subprocess worker to run Stream-K kernels via the existing `GpuGemmRunner` ABI.
projects/composablekernel/tile_engine/ops/gemm/gemm_streamk/configs/default_config.json	New default Stream-K sweep config JSON (small config set).
projects/composablekernel/dispatcher/python/gemm_utils.py	Route Stream-K builds to the Stream-K ctypes source; thread `variant` into codegen args and sweep expansion.
projects/composablekernel/dispatcher/python/ctypes_utils.py	Pass requested variant into `unified_gemm_codegen.py --variants` (instead of hard-coding `standard`).
projects/composablekernel/dispatcher/examples/gemm/cpp/03_streamk_gemm_driver.cpp	New minimal standalone Stream-K driver demonstrating the “dispatcher way” launch path.
projects/composablekernel/dispatcher/codegen/unified_gemm_codegen.py	Add `stream_k` variant: naming, includes, launch wrapper, config filtering, CLI choices.
projects/composablekernel/dispatcher/codegen/arch_filter.py	Add operator type and constraints entry for Stream-K GEMM validation.
projects/composablekernel/dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp	New ctypes bridge for Stream-K that launches the force-included kernel directly and reports `KERNEL_NAME`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # All configs in a sweep share one variant; route to the matching bridge lib.
    ctypes_source = (
-        _cu.get_dispatcher_root() / "bindings" / "ctypes" / "gemm_ctypes_lib.cpp"
+        _cu.get_dispatcher_root()
+        / "bindings"
+        / "ctypes"
+        / _ctypes_source_name(configs[0].variant)
    )


+            if constexpr (ReductionStrategy == ck_tile::StreamKReductionStrategy::Atomic) {{
+                (void)hipMemsetAsync(args.e_ptr, 0,
+                    args.M * args.N * sizeof(CDataType), stream.stream_id_);
+            }} else {{
+                workspace_dev.SetZero();
+            }}


Adds a "Variant scope" section clarifying that the bridge is one shared, variant-aware driver (not per-variant driver copies), that only gemm_universal is wired and validated through the bridge on this PR, and that the gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding following the per-variant convention -- not yet working support. Notes that grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).

The Stream-K bridge (#8136) was branched at #8123's first commit, so it lacked all subsequent regular-GEMM bridge improvements (arch-validated tile filtering, the develop merge + legacy gemm_universal retirement, benchmark-param/--verify work on the shared driver, README). Merge the current #8123 HEAD to pick those up; the Stream-K-specific analogues that live in the duplicated driver/worker/ctypes lib are ported in follow-up commits. Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp, .name _streamk suffix, variant through codegen_args/expand_sweep) and adopted #8123's explanatory comment.

The Stream-K bridge keeps its own driver, worker and ctypes lib, so the regular-GEMM bridge improvements that landed on #8123 after this branch forked did not arrive via the merge. Port the Stream-K-specific analogues: - streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of #8123's spurious "perf gap") and are env-overridable via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path, rotating_count defaults to 1: the Atomic preprocess re-zeros only the original C buffer, so rotating C would corrupt the accumulation. - streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices, device CSV column), add the --verify/--verify-tol fp32-reference gate, and constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a latent proc-unbound error in the batch handler. - run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check (global max|out-ref|/max|ref|, verified/max_rel) behind --verify. - README: document the Stream-K bridge driver/worker, flags, _streamk name suffix, fp16 Atomic tolerance, and the rotating_count divergence. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two Copilot findings on the Stream-K codegen carried in from #8094: - 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi) before narrowing to ck_tile::index_t; stoi throws std::out_of_range past INT_MAX, needlessly rejecting large GEMM sizes. - unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's per-iteration C reset zeroed args.M*args.N as a flat contiguous block, which skips elements when C has a padded leading dimension and corrupts the accumulation. Zero the used MxN region honoring stride_E via hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status instead of discarding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ozturkosu · 2026-06-12T16:44:13Z

Stream-K Bridge vs Old TE — MI300X parity (perf + validation)

Ran the requested comparison on ctr-cx64-mi300x-4 (AMD Instinct MI300X / gfx942, inside the enroot container, ROCm/HIP 7.13), using tile_engine/ops/gemm/gemm_streamk/configs/default_config.json → 4 kernels: {compv3,compv4} × 128x128x{32,64}, 2x2x1, 32x32x16, atomic reduction, pad-all, persistent=false, fp16 / rcr.

Shapes: 1024³, 2048³, 4096³, 512×512×8192 (the bridge's default Stream-K problem set).

Method

Bridge: streamk_gemm_full_benchmark.py default_config.json --verify (this PR; warmup=50/repeat=100 — the new streamk_gemm_ctypes_lib.cpp defaults).
Old TE: develop-branch standalone benchmark_gemm_streamk_* binaries (CMake trimmed to gemm+gemm_streamk only). Same warmup/repeat; perf with -verify=0, correctness with -verify=1 separately (the atomic kernel forces repeat=1 cold under verify). Verified Old TE's -flush_cache/-rotating_count change results <1% here, so the numbers are apples-to-apples.
The generated device kernel is byte-identical between the two paths (same StreamKKernel codegen); only host-side workspace ownership / harness differs.

Validation

Both paths correct on all 16 measurements. Bridge: 16/16 verified=True, max_rel ≤ 3.0e-3 (within the fp16 atomic tolerance; Atomic does multiple fp16 atomic-adds so it is inherently noisier than a single fp32→fp16 store). Old TE: 16/16 correct.

Performance (TFLOPS, Bridge vs Old TE)

kernel	shape	Bridge	Old TE	Δ
compv3 128x128x32	1024³	24.08	23.77	+1.3%
compv3 128x128x32	2048³	136.93	137.02	−0.1%
compv3 128x128x32	4096³	260.39	257.44	+1.1%
compv3 128x128x32	512×512×8192	38.42	38.20	+0.6%
compv4 128x128x32	1024³	30.40	32.11	−5.3%
compv4 128x128x32	2048³	149.63	162.88	−8.1%
compv4 128x128x32	4096³	285.70	288.01	−0.8%
compv4 128x128x32	512×512×8192	55.04	46.90	+17.4%
compv3 128x128x64	1024³	34.09	34.26	−0.5%
compv3 128x128x64	2048³	168.62	168.10	+0.3%
compv3 128x128x64	4096³	324.34	323.57	+0.2%
compv3 128x128x64	512×512×8192	49.18	49.79	−1.2%
compv4 128x128x64	1024³	52.78	54.47	−3.1%
compv4 128x128x64	2048³	195.18	206.78	−5.6%
compv4 128x128x64	4096³	240.81	241.91	−0.5%
compv4 128x128x64	512×512×8192	81.41	66.66	+22.1%

Takeaway

Functional parity: confirmed — identical correctness on a byte-identical device kernel.
Performance: at parity within measurement noise. All 8 compv3 points are within ±1.3%. The compv4 points scatter wider (−8%…+22%) but center near parity; since the device kernel is byte-identical, these are harness / GPU-clock-state artifacts (standalone Old-TE benchmark process vs in-harness bridge), the same effect root-caused for regular GEMM in [CK_TILE] Add Tile Engine -> Dispatcher bridge for GEMM #8123 — not a real kernel speed difference. The largest spread is on the small/skinny + compv4 cases, which are the most clock/launch-overhead sensitive.

Raw data: streamk_bridge_oldTE.csv (16 rows; columns: kernel, M, N, K, bridge_tflops, oldte_tflops, Δ%, bridge_latency_ms, oldte_latency_ms, bridge_max_rel, bridge_verify, oldte_verify).

Note: validated on gfx942/MI300X (this comparison). The PR's own gfx942 numbers in the description above were re-confirmed: 16/16 OK, name parity, all verified.

…uction Previously the stream-K codegen hard-coded the Atomic reduction strategy. This makes the reduction strategy a first-class config axis so linear and tree reductions can be generated, named, and selected alongside atomic: - unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk" suffix for name parity, linear/tree are disambiguated); _launch_function_streamk now emits the config's StreamKReductionStrategy (the existing reset lambda already zeroes C for atomic vs the workspace for linear/tree); _get_configs_for_variant iterates strategies from a new streamk_config section, which is added to the default config (atomic, linear, tree). - gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy) and expand_sweep (reduction-strategy sweep axis). - ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity. arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile combos and arch_filter reads no stream-K-specific keys, so adding them would be dead data. Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and pass fp32 verification (max_rel 3.1e-4) at 256x256x4096. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ozturkosu · 2026-06-12T17:16:10Z

Stream-K reduction strategy is now a codegen axis (`d326bdc283`)

Previously the Stream-K codegen hard-coded the Atomic reduction strategy. This change makes reduction_strategy a first-class config axis so atomic / linear / tree reductions can each be generated, named, and selected.

Codegen (unified_gemm_codegen.py)

reduction_strategy field on KernelConfig
encoded in key_name() (redux_*) and KernelNaming.generate() — atomic keeps the bare _streamk suffix (name parity with the original single-strategy bridge); linear/tree are disambiguated (_streamk_linear, _streamk_tree)
_launch_function_streamk() now emits the config's StreamKReductionStrategy (the existing reset hook already zeroes C for atomic vs the device workspace for linear/tree)
_get_configs_for_variant(STREAM_K) iterates a new streamk_config.reduction_strategy, added to the default config as [atomic, linear, tree]

Bridge (gemm_utils.py) — GemmKernelConfig.reduction_strategy threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy), and expand_sweep (sweep axis). ctypes_utils.py — field on its KernelConfig for parity.

arch_specs.json intentionally unchanged — Stream-K reuses the standard warp-tile combos and arch_filter reads no Stream-K-specific keys, so adding them would be dead data.

Validation (gfx942 / MI300X)

Built + ran one kernel per strategy through the isolated worker with fp32 reference verification at 256x256x4096:

strategy	kernel	verified	max_rel
atomic	`..._streamk`	yes	3.1e-4
linear	`..._streamk_linear`	yes	3.1e-4
tree	`..._streamk_tree`	yes	3.1e-4

Scope

This stays within the bridge architecture. It does not add the deep dispatcher-core integration (KernelKey / KernelInstance / Problem / Dispatcher + a generic Stream-K backend + ML selectability); that remains a separate effort.

…rategy fields First slice of moving Stream-K into the dispatcher core (registry-addressable), per the deep-core checklist. Additive and inert by default: - KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm gains streamk / reduction_strategy / workspace. tie() includes them so the three strategies are distinct keys. encode_identifier() appends the Stream-K suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk so non-Stream-K identifiers are unchanged. - Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k(). Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical; atomic/linear/tree suffixes correct; tie() distinguishes strategies.

Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile unchanged: - get_workspace_size(Problem) -> bytes (default 0) - run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default forwards to the existing no-workspace run(). The Dispatcher invokes these through a base KernelInstance* pointer (so the new overload is visible despite derived 6-arg run() overrides). The Stream-K backend (PR-C) overrides both to size and bind the reduction workspace. Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the pre-existing pure virtuals compiles; default get_workspace_size==0 and the workspace-run forwards correctly via base pointer.

…tree codegen Adds the C++ backend that lets Stream-K ride the registry, plus the reduction strategy codegen needed to generate the three variants on this branch. - generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the ABI-incompatible args the GemmHostArgs path could not). supports() gates on Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the registry and the Dispatcher's first-fit selection picks the requested one. create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>. - codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field, key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree" (matches KernelKey::encode_identifier from PR-A), per-strategy StreamKReductionStrategy in the generated launch, and a streamk_config sweep axis. (Ported from the bridge branch reduction-strategy work.) PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those to Dispatcher::run() via get_workspace_size()/the workspace-aware run(). Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584 tree headers with correct names; the backend device-compiles (22s) against a generated header and supports() accepts the matching strategy while rejecting the others and non-Stream-K problems.

…pace Relocate the Stream-K reduction-workspace buffer from the per-call generated launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived dispatcher stops paying a hipMalloc/hipFree on every invocation. - codegen: hoist the StreamKGemmKernel type to struct scope and add GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace) overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the bridge ctypes lib and the standalone 03 driver keep working. - backend: override get_workspace_size() and the workspace-aware run(); the no-workspace run() delegates with a null buffer. The per-iteration reset stays in the backend (it needs CDataType + the reduction strategy). - dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out of the public header), size it via get_workspace_size(), and pass it through run_fused()/run_explicit(); free it in the destructor. Atomic needs none (size 0 -> null -> internal path); linear/tree consume the owned buffer. Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at unchanged perf, with linear/tree now running on the dispatcher-owned workspace.

…river Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path (Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch -> verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the registry selection and the Dispatcher-owned workspace. Selectable strategy via --strategy {atomic,linear,tree}. Validated on MI210/gfx90a for all three (distinct registry identifiers, each PASS).

…K backend The dispatcher-wrapper generator emitted ONE template for every variant: backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy on the key. For Stream-K that is wrong twice over -- the regular backend calls launch(GemmHostArgs,...) which the SK kernel struct does not have (so the aggregate register_all_kernels.hpp would not compile against SK), and the key omits the SK fields so encode_identifier() emits no _streamk suffix and atomic/linear/tree collide in the registry. Make the wrapper variant-aware: for STREAM_K configs include generated_tile_backend_streamk.hpp, set key.algorithm.streamk + reduction_strategy + workspace (and pad flags for identifier parity), and return create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>. All other variants are unchanged. Validated on MI210/gfx90a: a registry populated via the generated wrappers holds atomic+linear+tree side by side; Dispatcher::run() selects each by Problem::reduction_strategy and all three verify vs reference_gemm.

…are atomic reset P2: GeneratedStreamKKernelInstance::supports() now ends with SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs is rejected during selection, so first-fit falls back to a non-Stream-K kernel instead of throwing std::runtime_error at launch. P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync (pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage for the contiguous rcr case. Applied to both the internal and external-workspace launch overloads. Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a multi-kernel registry; valid small problems are accepted (no false-negatives).

The bridge dispatcher's tile-divisibility gate rejected any problem where M % TileM != 0 for every layout, returning status -2 ("No suitable kernel") at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like 1024^3 -- cases Old-TE compiles, runs, and verifies as correct. Root cause: supports() was layout-blind, while the underlying ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an operand whose inner axis is that dimension participates without padding: RowMajor A -> K, ColMajor A -> M RowMajor B -> N, ColMajor B -> K RowMajor C -> N, ColMajor C -> M So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192 tiles on M-indivisible problems. Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain). Anything it now lets through is still validated by the kernel's own IsSupportedArgument inside launch(), so the bridge stays a strict functional equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM .so path) and the sibling tile_backend.hpp. Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile .so now run at 1024^3 (Old-TE runs the same, verification correct); the 8 remaining rejects are tile N=192 cases that Old-TE also reports "Arguments not supported" at N=1024 -- parity preserved in both directions.

…oding rcr dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is layout-aware, that wrong key layout makes the dispatcher reject valid problems: a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that Old-TE compiles, runs, and verifies (~87 TFLOPS). Derive signature.layout_a/b/c from the force-included kernel's own ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor. The key now matches the kernel, so the layout-aware gate is correct for all four layouts. Execution was already layout-correct (the kernel uses its own compile-time layouts); only the host-side selection metadata was wrong. Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS), restoring parity with Old-TE.

For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker call per side (5x fewer process startups), cache the compiled old-TE .so, and add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores while GPU measurement stays serial.

…eductionStrategy) Close two review nits on the Stream-K drivers: - Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so large GEMM dimensions no longer overflow/throw int range (Copilot nit). - Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04 driver through it, removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.

Adds dispatcher_test_streamk_registry, a GPU test that generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the 04 registry driver once per strategy (each force-including its own header, since SkReductionStrategy is a compile-time constexpr), and asserts for each that the encode_identifier() suffix matches, the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference. This converts the previously manual deep-core validation into a regression- guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only CI is unaffected.

…IBRARY_PATH meas()/meas_all() built the worker env without /opt/rocm/lib on LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so ("libamdhip64.so.7: cannot open shared object file") and every cell came back nan. Set it the same way ab_efficient_sweep.py does.

… guard) The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT bugs, not real codegen gaps: - ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it is at parity. Now both sides build with identical flags. - ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no freshness check, so 3-day-old binaries built from an obsolete codegen showed up as -78%/+703% gaps. Added a guard: skip any .so older than its generated header (treated as missing) instead of reporting a phantom gap. With both fixes the 41 former >15% outlier stems measure within +/-10% (median +0.01%); no bridge codegen regression exists. Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate -enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%; held back pending a broader persistent-kernel no-regression sweep.

… driver The standalone stream-K driver verified atomic results with the single-pass GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split partials directly into low-precision C (workspace size 0), incurring rounding error that grows with the split factor -- correct results were flagged FAIL on small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs. Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with the split-K CDataType accumulation term, and take the max with the per-split tolerance. The driver and tile_engine now verify identically; the kernel is unchanged.

…gine The standalone stream-K driver built its stream_config as {stream, true, 0, warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults (flush_cache=false, rotating_count=1). The tile_engine benchmark instead times with flush_cache=true and rotating_count=1000, so the driver measured a warm-cache best case while tile_engine measured cold-cache -- the entire source of the reported dispatcher-vs-TE "performance gap" at low tile counts. Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine values) and pass them through stream_config so both sides use identical timing methodology. A validating run still times a single cold shot, mirroring tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0 pass.

The 04 registry driver hardcoded the KernelKey signature to DataType::FP16 and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate in fp32 and write fp16 C, matching Tile Engine). Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent core objects built once; per-dtype codegen + build + verify with per-dtype identifier assertions). All four datatypes register, dispatch, and verify across atomic/linear/tree on gfx942 (MI300X).

Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry + validation driver). 3-way merge over the shared stream_k ancestor; only the streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp conflicted -- both resolved to the deep-core side: - codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize + IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib calls, and adds the 3-arg dispatcher-owned-workspace launch. - driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot. Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly, so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree + regular gemm all generate cleanly (0 failed).

Build the Stream-K bridge .so without the dispatcher static lib and with TE-streamk-matching flags: - compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern). This keeps the A/B fair; the regular path is unchanged. - link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the force-included kernel directly, no registry/dispatcher symbols), and the build guard no longer requires the static lib for Stream-K. - ensure build/examples exists before hipcc writes there (the cmake build that normally creates it is skipped on the Stream-K path). Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.

Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build verbatim (ground truth: a TE streamk build's compile_commands.json). The -mllvm codegen flags come from the composablekernel project-root add_compile_options applied globally to the TE benchmark -- they are NOT in the per-target options, so the earlier "minimal flags" assumption was wrong and would have produced a phantom A/B gap (different occupancy). Flags now: -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress -enable-post-misched=0 is applied unconditionally (TE does so for streamk), and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).

ozturkosu · 2026-06-27T23:59:21Z

Update: ported the Stream-K bridge onto the deep-core #8094 engine + TE parity

Reworked this PR so the Stream-K TE→Dispatcher bridge runs on the rewritten #8094 "deep-core" Stream-K engine instead of the original codegen.

What changed

Merged the deep-core feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen #8094 Stream-K codegen into the bridge (3-way merge over the shared 36eb9bdc0e ancestor; KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, dispatcher-owned reduction workspace, registry + validation driver). The codegen now emits the struct-scope kernel type with GetWorkSpaceSize/IsSupported, keeps the 2-arg internal-workspace launch() the bridge ctypes lib calls, and adds the 3-arg dispatcher launch.
The bridge streamk_gemm_ctypes_lib.cpp keeps launching the force-included kernel directly (no registry/dispatcher symbols), so the .so is self-contained — link no longer needs libck_tile_dispatcher.a, and the build creates build/examples itself.
Fair-parity flags: the Stream-K .so now matches Tile Engine's gemm_streamk compile flags exactly (ground truth from a TE build's compile_commands.json): -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress. (-enable-post-misched=0 applied unconditionally, matching TE for streamk; -enable-noalias-to-md-conversion=0 is not used.)

Validation (fresh builds both sides, same node compiler, warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved; verify in a separate pass):

Codegen + on-GPU run + verify PASS for atomic / linear / tree.
MI350 (gfx950): full parity — 12/12 cases (atomic/linear/tree × 1024³/2048³/4096³/512×512×8192) |gap%| ≤ 0.59%, all correct, no gfx950 flag changes needed.
MI300X (gfx942): representative sweep at parity (worst real case ≈ tree −3.8%); a full sweep over the entire TE Stream-K config file (1080 fp16/rcr configs) is running, with automatic standalone re-measurement of any |gap|>15% to remove multi-GPU concurrency artifacts. Aggregate table to follow.

Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16 path; only the bit pattern differs). Dtype is inferred from the kernel name. ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device bf16_t so the numpy reference multiplies the same values the GPU does.

…ing) Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C layouts (rcr/rrr/ccr/crr) work end to end: - streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R) instead of the hardcoded K/K/N. - generated_tile_backend_streamk.hpp (registry path): same layout-derived strides. - GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is now memory-order-preserving so column-major operands stay column-major. - run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only). - streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr, SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).

ozturkosu · 2026-06-28T00:41:06Z

Update: removed rcr hardcoding — bridge is now layout-generic (rcr/rrr/ccr/crr) + bf16

Following review, nothing layout- or dtype-specific is hardcoded anymore; strides are derived from the kernel's actual layouts everywhere.

streamk_gemm_ctypes_lib.cpp: stride_A/B/C derived at compile time from the force-included kernel's ALayout/BLayout/CLayout (RowMajor RxC → ld=C, ColumnMajor → ld=R) instead of the old hardcoded K/K/N.
generated_tile_backend_streamk.hpp (registry path): same layout-derived strides.
GpuGemmRunner: reads dtype and layout off the kernel name; arranges each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is memory-order-preserving so column-major operands stay column-major.
run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only).
streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS = rcr/rrr/ccr/crr, SUPPORTED_DTYPES = fp16+bf16.

Validated on GPU (verify on):

fp16, all 4 layouts rcr/rrr/ccr/crr — all PASS, identical max_rel_err = 8.2e-4.
bf16 + ccr (column-major A/B) — PASS, max_rel_err = 8.0e-3.
bf16 atomic/linear/tree on MI350 — PASS.

Remaining (tracked): fp8/bf8/int8 need runner codecs (codegen + Old-TE already support them); the ctypes bridge still runs one kernel per .so (registry multi-kernel path exists in the engine via the 04 driver).

Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300. GpuGemmRunner (dispatcher/python/gemm_utils.py): - Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887): bit-exact decode tables + nearest-representable/saturating encode, carried as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand C/F contiguity so the layout-generic _to_buf path holds for the new dtypes. - run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store, int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged. - Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350 uses OCP fp8, a different bit layout) rather than silently mis-decoding. - An int8 codec is included for when the engine supports it (see below). Reference + surface: - run_one_streamk_gemm_kernel.py verify reference is now dtype-aware (decode(encode(x)) per dtype; int8 = exact int32 matmul). - streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8. int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile for every reduction strategy -- warp_gemm_dispatcher has no Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3 path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the PR #8094 decision to leave int8 out. GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants: fp8 atomic/linear/tree rcr: PASS (192/180/183 TFLOPS, max_rel <= 9.4e-4) bf8 atomic/linear/tree rcr: PASS (192/181/181 TFLOPS, max_rel <= 7.8e-4) fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)

ozturkosu · 2026-06-28T01:12:40Z

Update: full Old-TE dtype coverage — fp8/bf8 added (commit `b6bea8a`)

The Stream-K bridge runner now covers the exact runnable dtype set Old-TE Stream-K supports.

Added fp8 (E4M3 FNUZ) and bf8 (E5M2 FNUZ) to GpuGemmRunner (codec ported from the plain-GEMM fp8 bridge #8887): bit-exact decode to the device fp8_t/bf8_t, nearest-representable+saturate encode, output dtype fp8/bf8 → fp16. The fp16/bf16 paths and the layout-generic (rcr/rrr/ccr/crr) logic are unchanged. SUPPORTED_DTYPES is now fp16, bf16, fp8, bf8.

GPU-validated (gfx942/MI300X, 2048³, reference = decode(encode(A)) @ decode(encode(B))):

dtype	atomic	linear	tree	col-major
fp8	PASS 192 TF, rel 9.4e-4	PASS 180	PASS 183	ccr PASS
bf8	PASS 192 TF, rel 7.8e-4	PASS 181	PASS 181	crr PASS

Exact-equivalence check vs Old-TE:

Old-TE Stream-K's instance builder accepts fp16, fp8, bf16, bf8, fp32, fp64. fp32/fp64 have no MFMA warp tiles (not runnable on either engine). So Old-TE's runnable set = {fp16, bf16, fp8, bf8} = what the bridge now supports.
int8 is in the warp-tile table but Old-TE's Stream-K builder rejects --datatype int8 (invalid choice), and the ck_tile engine has no warp_gemm_dispatcher specialization for int8 streamk CompV3 (compile fails). So neither TE nor the bridge generates int8 streamk. The runner keeps an int8 codec + int32-output path ready for if/when the engine adds that instantiation.

Arch guard: the fp8/bf8 codec is FNUZ (gfx942). gfx950/MI350 uses OCP fp8 (different bit layout); the runner detects arch and raises a clear error for fp8/bf8 on non-gfx942 rather than silently producing wrong results.

ozturkosu · 2026-06-28T01:14:08Z

Status summary — Stream-K bridge ready for review

Implementation complete and on this branch:

Ported onto the deep-core feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen #8094 engine (KernelKey reduction fields, workspace virtuals, StreamK backend, dispatcher-owned reduction workspace, registry/validation driver) via a clean 3-way merge.
Layout-generic — rcr/rrr/ccr/crr; strides derived from the kernel's ALayout/BLayout/CLayout (no rcr hardcoding) in both the ctypes lib and the registry backend.
Full Old-TE dtype coverage — fp16, bf16, fp8 (E4M3 FNUZ), bf8 (E5M2 FNUZ). This is the exact runnable set Old-TE Stream-K supports: TE's instance builder also has only {fp16, fp8, bf16, bf8} runnable (fp32/fp64 have no warp tiles; int8 is rejected by TE's builder and has no ck_tile warp_gemm streamk specialization). int8 codec is kept ready for if/when the engine adds it.
Fair flags — the .so matches TE's gemm_streamk compile flags exactly (verified against a real compile_commands.json).
Reductions — atomic / linear / tree, all verified.

Verification (verify on): fp16 all 4 layouts PASS (rel 8.2e-4); bf16 + col-major PASS; fp8/bf8 × atomic/linear/tree × row+col-major PASS (rel <1e-3, gfx942).

TE↔bridge parity (fair, fresh both sides, median-of-3 interleaved):

MI350 (gfx950): full pass — worst |gap| 0.59%.
MI300X (gfx942): full canonical-config sweep in progress — ~89% within ±15%, median |gap| ≈ 2.75%; the >15% outliers are small-shape multi-GPU concurrency artifacts, auto re-measured standalone.

Caveat: fp8/bf8 codec is gfx942 FNUZ; the runner raises a clear error on gfx950 (OCP) pending separate handling.

ozturkosu · 2026-06-28T02:40:19Z

Clarification: fp8/bf8 Stream-K on gfx950 (MI350) — blocked by a shared upstream ck_tile bug, not the bridge

For completeness on why MI350 dtype coverage is scoped to fp16+bf16:

A pre-existing, shared ck_tile engine bug on gfx950. fp8/bf8 Stream-K on gfx950 produces output ≈ 4× the reference (e.g. 2792 vs 698). Verified facts:

fp16/bf16 Stream-K verify correct on gfx950; only fp8/bf8 fail.
Old-TE fails with the identical 4× error — both engines instantiate the same StreamKKernel template, so this is a pre-existing shared ck_tile kernel bug, not introduced by this bridge/PR.
gfx942 (MI300X) fp8/bf8 verify fine on both engines → the bug is gfx950-specific.
The 4× factor == the K-split partial-accumulation factor → root cause is in the fp8/bf8 reduction/accumulation path on CDNA4. Needs an upstream ck_tile fix.

Bottom line: even with an OCP fp8 codec (gfx950 uses OCP vs gfx942 FNUZ), the gfx950 fp8/bf8 Stream-K kernels themselves are broken — and Old-TE is equally broken — so there is nothing correct to be "at parity" with. fp8/bf8 parity is therefore only meaningful on gfx942/MI300X, where both engines work and the bridge is at parity. On MI350 the valid dtypes are fp16 + bf16 (both swept).

That's why MI350 is scoped to fp16/bf16, and gfx950 fp8/bf8 is flagged as a separate upstream ck_tile issue rather than a bridge gap. (The bridge runner guards this: fp8/bf8 raise a clear error on any non-gfx942 arch instead of silently producing wrong results.)

ozturkosu · 2026-06-28T03:06:12Z

Parity analysis (interim — full canonical-config sweeps)

Running the full TE Stream-K config matrix (every bridge-generatable tile/pipeline/persistent × atomic/linear/tree × 4 shapes) per arch/dtype, fair flags (warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved per case, both sides built fresh on the same node compiler), with standalone auto-recheck of any |gap|>15%.

Aggregate so far (gap% = (bridge−TE)/TE; aggregates over measured cases):

GPU	arch	dtype	measured	within ±15%	median \|gap\|	max \|gap\|
MI300X	gfx942	fp16	2090	93%	1.42%	28.2%
MI350	gfx950	bf16	1464	100%	0.63%	7.7%

Analysis:

bf16 on MI350 (single GPU) is 100% within ±15%, median 0.63%, zero outliers — i.e. with no GPU contention the bridge matches TE essentially exactly.
fp16 on MI300X is 93% within ±15%, median 1.42%. The ~7% over ±15% are small/sub-millisecond shapes (tiny 64×64 tiles) measured under 4-GPU concurrency. Standalone re-measure confirms these are the TE standalone-benchmark per-launch overhead on tiny kernels (the bridge runs faster there because the .so path has less harness overhead) — the device kernels are identical, so it's a measurement artifact, not a codegen regression. They collapse toward parity when re-measured single-GPU (the contention-free MI350 run shows 0 such outliers).
Reduction strategies atomic / linear / tree all covered and at parity.

Still running (final tables to follow): MI300X bf16/fp8/bf8 (gfx942) and MI350 fp16 (gfx950), in parallel across allocations. Raw per-row results (~5K) are kept in CSVs; only this analysis is posted here.

Bottom line so far: the Stream-K bridge is at parity with Tile-Engine across the swept configs; deviations are measurement artifacts on tiny kernels, not real gaps.

ozturkosu · 2026-06-28T03:06:44Z

Result CSV files (raw per-row data)

All raw parity results live on the build host under /home/AMD/muozturk/ (these will be consolidated into /home/AMD/muozturk/PR8136_parity_csvs/ once all sweeps finish). Per-arch/per-dtype sheets + one combined master:

Master (all archs+dtypes, with gpu + dtype label columns):

PR8136_streamk_te_bridge_parity_ALL.csv

Per arch / dtype:

streamk_bridge_te_parity_MI300X_gfx942_FULL.csv (MI300X, fp16)
streamk_bridge_te_parity_MI300X_gfx942_bf16_FULL.csv (MI300X, bf16)
streamk_bridge_te_parity_MI300X_gfx942_fp8_FULL.csv (MI300X, fp8)
streamk_bridge_te_parity_MI300X_gfx942_bf8_FULL.csv (MI300X, bf8)
streamk_bridge_te_parity_MI350_gfx950_bf16_FULL.csv (MI350, bf16)
streamk_bridge_te_parity_MI350_gfx950_fp16_FULL.csv (MI350, fp16)

Columns: gpu, dtype, gfx_arch, strategy, tile, wave, warp_tile, pipeline, persistent, M, N, K, te_tflops, bridge_tflops, gap_pct, bridge_max_rel_err, note (gap_pct = (bridge−TE)/TE·100; +ve = bridge faster; note flags standalone_recheck / hang / te_timeout).

(The Confluence page carries the analysis/summary only; these CSVs hold the ~5K raw rows.)

ozturkosu · 2026-07-01T17:54:53Z

Superseded by #9028, same commit (b6bea8a) on a policy-compliant branch (users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm) targeting develop. Please continue review there.

Muhammed Ozturk and others added 3 commits June 5, 2026 02:48

github-actions Bot added the project: composablekernel label Jun 5, 2026

ozturkosu self-assigned this Jun 5, 2026

ozturkosu requested a review from Copilot June 5, 2026 22:02

Copilot started reviewing on behalf of ozturkosu June 5, 2026 22:02 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

assistant-librarian Bot added the organization: ROCm label Jun 5, 2026

This was referenced Jun 10, 2026

[CK_TILE] Add Tile Engine -> Dispatcher bridge for GEMM #8123

Closed

[CK_TILE] TE->Dispatcher GEMM bridge (consolidated: key + bf16 + layouts + parity + example) #8261

Closed

ozturkosu and others added 3 commits June 12, 2026 02:48

github-actions Bot added the documentation label Jun 12, 2026

This was referenced Jun 15, 2026

feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen #8094

Closed

feat(ck_tile): TE -> Dispatcher GEMM bridge (all layouts, fp16/bf16) #8479

Closed

ozturkosu added 9 commits June 17, 2026 01:02

ozturkosu added 11 commits June 17, 2026 22:22

ozturkosu marked this pull request as ready for review June 28, 2026 00:25

ozturkosu requested review from a team as code owners June 28, 2026 00:25

ozturkosu closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136
ozturkosu wants to merge 31 commits into
muozturk/dispatcher-gemm-bridgefrom
muozturk/dispatcher-streamk-gemm-bridge

ozturkosu commented Jun 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ozturkosu commented Jun 12, 2026

Uh oh!

ozturkosu commented Jun 12, 2026

Uh oh!

ozturkosu commented Jun 27, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jun 28, 2026

Uh oh!

ozturkosu commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ozturkosu commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Stream-K needs that regular GEMM doesn't

Changes

Validation (gfx942 / MI300X, fp16 / rcr)

Test plan

Next

Update 2026-06-12 — brought current with #8123 + Copilot fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ozturkosu commented Jun 12, 2026

Stream-K Bridge vs Old TE — MI300X parity (perf + validation)

Method

Validation

Performance (TFLOPS, Bridge vs Old TE)

Takeaway

Uh oh!

ozturkosu commented Jun 12, 2026

Stream-K reduction strategy is now a codegen axis (d326bdc283)

Validation (gfx942 / MI300X)

Scope

Uh oh!

ozturkosu commented Jun 27, 2026

Update: ported the Stream-K bridge onto the deep-core #8094 engine + TE parity

Uh oh!

ozturkosu commented Jun 28, 2026

Update: removed rcr hardcoding — bridge is now layout-generic (rcr/rrr/ccr/crr) + bf16

Uh oh!

ozturkosu commented Jun 28, 2026

Update: full Old-TE dtype coverage — fp8/bf8 added (commit b6bea8a)

Uh oh!

ozturkosu commented Jun 28, 2026

Status summary — Stream-K bridge ready for review

Uh oh!

ozturkosu commented Jun 28, 2026

Clarification: fp8/bf8 Stream-K on gfx950 (MI350) — blocked by a shared upstream ck_tile bug, not the bridge

Uh oh!

ozturkosu commented Jun 28, 2026

Parity analysis (interim — full canonical-config sweeps)

Uh oh!

ozturkosu commented Jun 28, 2026

Result CSV files (raw per-row data)

Uh oh!

ozturkosu commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ozturkosu commented Jun 5, 2026 •

edited

Loading

Stream-K reduction strategy is now a codegen axis (`d326bdc283`)

Update: full Old-TE dtype coverage — fp8/bf8 added (commit `b6bea8a`)