feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen by ozturkosu · Pull Request #8094 · ROCm/rocm-libraries

ozturkosu · 2026-06-05T06:48:49Z

Add stream-K variant to the GEMM Dispatcher codegen (the dispatcher way)

This is the next slice of the Tile Engine → Dispatcher consolidation, following the same pattern as the grouped_gemm PR (#8075). It adds the stream-K GEMM variant to the unified GEMM codegen, implemented the dispatcher way (workspace owned internally via DeviceMem, clean launch(args, stream) signature), and proves numeric + performance parity against Tile Engine.

Branch is based on develop and contains only the stream-K work (no grouped_gemm commits).

What I did

codegen/arch_filter.py — added OperatorType.GEMM_STREAMK and its tile constraints.
codegen/unified_gemm_codegen.py:
- Added GemmVariant.STREAM_K, made it reachable from the CLI (--variants stream_k), wired naming (_streamk suffix), includes, and the variant→operator map.
- New _launch_function_streamk: builds a single StreamKHostArgs, MakeKernelArgs → GetWorkSpaceSize → allocate DeviceMem workspace internally + SetZero → SetWorkSpacePointer → IsSupportedArgument check → make_kernel via launch_kernel_time_mask with an Atomic-reduction preprocess that zeros C between timed iterations. No external kargs_ptr (not the Tile Engine way).
- Exported A/B/CLayout in the CK_TILE_SINGLE_KERNEL_INCLUDE block so a single-kernel driver is layout-generic.
- Restricted stream_k configs to the cshuffle epilogue (only one the kernel supports).
examples/gemm/cpp/03_streamk_gemm_driver.cpp (NEW) — minimal standalone driver: -includes one generated stream-K header, builds a single A/B/C tensor, calls SelectedKernel::launch(args, stream), verifies against ck_tile::reference_gemm, prints TFLOPS/GB/s.

Problem tried (config + shape)

Config: fp16_rcr_compv3_cshuffle_intrawave_..._128x128x64_2x2x1_32x32x16 (atomic reduction; exists identically in TE and the dispatcher).
Shape: M=3840, N=4096, K=2048, warmup=10, repeat=50, MI300X (gfx942), ROCm 7.1.1.

Performance + numerical verification (Dispatcher vs Tile Engine)

	latency (ms)	TFLOPS	GB/s	verify
Tile Engine (warmup=10, repeat=50)	0.24	266.7	264.8	correct
Dispatcher (warmup=10, repeat=50)	0.242	266.1	264.2	PASS
Δ	~0%	~0%	~0%	identical

Methodology note: TE's benchmark forces repeat=1, warmup=0 whenever verify=1 (the atomic kernel accumulates into C, so it can only verify a single run). A verify=1 invocation therefore reports a single cold iteration (~0.30 ms), which is not a representative perf number. The table above uses TE verify=0 (so warmup/repeat are honored) for the perf row and a separate TE verify=1 run for correctness. The dispatcher driver times (warmup=10/repeat=50) and verifies in the same run because it re-zeros C between timed iterations via the masked preprocess.

The generated GPU kernel (StreamKKernel<StreamKTilePartitioner, GemmPipeline, GemmEpilogue>) is identical to TE's; only host-side workspace ownership differs (internal DeviceMem vs TE's external pointer). Numerics match.

Deep-core integration (PR-A…E) — accepted design deviations

The deep-core commits make Stream-K a first-class registry citizen (selectable through Dispatcher::run() by Problem::reduction_strategy). Two deliberate deviations from the literal deep-core spec are worth calling out for reviewers:

Per-iteration reset lives in the backend, not in Dispatcher::run(). The Dispatcher owns/sizes/frees the reduction workspace (ensure_workspace, grow-on-demand, freed in dtor), but the strategy-aware reset stays inside the generated launch (generated_tile_backend_streamk / _launch_function_streamk). Reason: the reset is per-repeat (it runs inside launch_kernel_time_mask's preprocess) and dtype-dependent (atomic C-reset needs sizeof(CDataType)), which the dtype-erased Dispatcher does not have. Net: workspace owned by Dispatcher, reset owned by backend.
Hardware grid is delegated to the ck_tile partitioner. Grid sizing uses StreamKGemmKernel::GridSize(tile_partitioner) rather than a dispatcher-side NumCU/Occupancy/get_num_xccs calculation — matching the bridge and keeping ck_tile as the single source of truth for Stream-K work partitioning.

Both are sound and additive: non-Stream-K kernels are byte-identical (the encode_identifier() Stream-K suffix is guarded by algorithm.streamk), and the 2-arg internal launch is preserved for the bridge / 03 driver.

See the PR comments for the per-commit detail and the gfx942/MI300X validation table.

Update (2026-06-27) — multi-datatype support: fp16 / bf16 / fp8 / bf8

The Stream-K dispatcher path now supports every float datatype Tile Engine builds for Stream-K, and then some: fp16, bf16, fp8, and bf8. fp8/bf8 inputs accumulate in fp32 and write an fp16 C tensor (get_output_dtype), exactly matching Tile Engine.

What was fixed: the codegen (unified_gemm_codegen.py, arch_filter.py) was already datatype-generic; only fp16 had been proven. The one real lock-in was in 04_streamk_registry_driver.cpp, which hardcoded the KernelKey signature to DataType::FP16 + rcr layout — so fp8/bf8/bf16 kernels registered under the wrong key and failed dispatch. It now derives dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>() / layout_tag_of<Layout>() helpers. test_streamk_registry.py is parametrized over all four datatypes (dtype-independent core objects built once; per-dtype codegen + build + verify + identifier assertions).

Validation (gfx942 / MI300X, M=3840 N=4096 K=2048, all verify against ck_tile::reference_gemm):

dtype	atomic (TFLOPS)	linear (TFLOPS)	tree (TFLOPS)	verify
fp16	275.8	304.7	299.0	PASS
bf16	284.0	298.9	307.6	PASS
fp8	316.2	371.1	373.3	PASS
bf8	353.0	394.6	370.9	PASS

All four register with the correct identifier (fp8_rcr…, bf8_rcr…, etc.), are selected by Problem::reduction_strategy, and verify. (int8 deliberately out of scope — atomic integer reduction is unproven in TE as well.)

Update (2026-06-27, #2) — full Old-TE functional equivalence: + all layouts

Following the multi-datatype work above, Stream-K is now equivalent to legacy Tile Engine across both axes TE builds for: datatypes {fp16, bf16, fp8, bf8} × layouts {rcr, rrr, ccr, crr} × reduction strategies {atomic, linear, tree}.

Bug found + fixed (caught by an independent review pass): the Stream-K backend's make_args (generated_tile_backend_streamk.hpp) hardcoded rcr leading dims (stride_a=K, stride_b=K, stride_c=N) for every layout. The 04 driver fills host tensors with correct per-layout strides via get_default_stride and calls Dispatcher::run, but make_args then overrode them — so rrr/ccr/crr ran with wrong strides and would fail verification. Fixed by deriving leading dims from the kernel key's layouts: A is MxK (row→K, col→M), B is KxN (row→N, col→K), C is MxN (row→N, col→M). rcr is unchanged; all four TE layouts keep C row-major, so the atomic C-reset assumption still holds.

Validation — full 48-combo matrix, gfx942/MI300X, M3840 N4096 K2048, all Verification: PASS:

dtype	layout	atomic	linear	tree
fp16	rcr	282.6	302.2	308.1
fp16	rrr	235.1	288.9	298.9
fp16	ccr	250.8	272.0	278.6
fp16	crr	208.8	272.5	236.7
bf16	rcr	281.2	306.8	309.1
bf16	rrr	276.3	298.3	296.0
bf16	ccr	266.0	285.4	288.3
bf16	crr	234.7	282.6	244.7
fp8	rcr	322.3	372.3	372.7
fp8	rrr	244.4	264.2	264.6
fp8	ccr	267.5	298.4	295.1
fp8	crr	253.0	271.7	266.9
bf8	rcr	311.4	378.5	376.1
bf8	rrr	240.3	269.2	265.3
bf8	ccr	271.5	305.9	295.3
bf8	crr	240.2	264.9	272.0

(TFLOPS; all 48 verify against ck_tile::reference_gemm.) The registry test test_streamk_registry.py now covers this full matrix (with --datatypes/--layouts flags to trim for faster CI). int8/fp32/fp64 remain out of scope (TE builds no Stream-K configs for them).

Commits: 9d033fde99 (dtypes), 7190a19edd (test layout coverage), 8595984d5c (backend layout-stride fix).

Update (2026-06-30, #3) — scope correction + review blockers addressed

Scope of "equivalence" (correcting the earlier "full functional equivalence" / "~0% perf delta" framing). Numeric correctness (verify vs ck_tile::reference_gemm) is validated across the full {fp16,bf16,fp8,bf8} × {rcr,rrr,ccr,crr} × {atomic,linear,tree} matrix on matched tile configs. Performance parity vs Tile Engine is demonstrated on one config (fp16 rcr atomic, 3840×4096×2048) — the 48-combo tables above report dispatcher-only TFLOPS, not a TE comparison. Tile coverage is narrower than TE (e.g. fp16 rcr: TE=180 vs DISP=73 tiles), so "functional equivalence" should be read as per matched tile config, not over TE's whole tile surface.

Review blockers fixed (correctness):

Reduction workspace is now zeroed by the Dispatcher before every linear/tree dispatch (ensure_workspace), so correctness no longer depends on the backend's per-iteration preprocess running (the non-benchmarking nrepeat=1 path could otherwise hand the kernel a garbage buffer).
Workspace access is serialized with a mutex spanning size→zero→launch, so concurrent linear/tree dispatches on different streams cannot corrupt each other's reduction (the per-call DeviceMem that PR-D replaced was concurrency-safe; the owned buffer needed the guard back).
Atomic launch now static_asserts a row-major C — the hipMemset2D C-reset assumes row-major and would silently miszero a column-major C under atomic accumulation.
HIP memset return codes are checked in the reset preprocess (both launch overloads); a failed reset now throws instead of being (void)-discarded.

Commits: 111e1f48ce8 (workspace zero + mutex + row-major assert), aa0181a8906 (memset return checks). Re-verified on MI300X (gfx942): fp16 rcr atomic/linear/tree register, dispatch, and verify PASS across 3840×4096×2048 and 128×128×16384.

The GEMM bridge for StreamK is being developed in a separate PR**; this PR is scoped to the Stream-K deep-core dispatcher only.
Land a true TE-vs-dispatcher perf-parity sweep across the matched tile set (not just the single fp16 rcr atomic config) so the perf claim matches the breadth of the correctness claim.
Close the tile-coverage gap by feeding the missing TE tiles into the codegen tile list (fp16/bf16 rcr 124 TE-only, ccr similar; fp8/bf8 closer but still short).
Once signed off, delete tile_engine/ops/gemm_streamk/.
Continue toward a first-class dispatcher GEMM interface folder (roadmap step 5).
Optional follow-ups from the review (non-blocking): give validate() a numeric path or an explicit "no-reference" return instead of reusing supports(); de-duplicate the reset lambda across the two generated launch overloads.

Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.

Copilot

Pull request overview

This PR extends the Composable Kernel dispatcher’s unified GEMM codegen to generate a Stream-K GEMM variant (workspace owned internally via DeviceMem, launch(args, stream) API), and adds a minimal standalone C++ driver that builds/runs a single generated Stream-K kernel header.

Changes:

Added GEMM_STREAMK operator type and tile constraints to the architecture filter.
Added stream_k as a selectable variant in unified_gemm_codegen.py, including naming, includes, variant→operator mapping, config selection restrictions (cshuffle-only), and a Stream-K launch-path implementation.
Added 03_streamk_gemm_driver.cpp example for single-kernel include builds, benchmarking, and reference verification.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
projects/composablekernel/dispatcher/examples/gemm/cpp/03_streamk_gemm_driver.cpp	New minimal driver that includes a generated Stream-K header and runs/validates a single Stream-K GEMM kernel.
projects/composablekernel/dispatcher/codegen/unified_gemm_codegen.py	Adds Stream-K variant plumbing and a generated launcher that allocates/zeros workspace internally and launches via `launch_kernel_time_mask`.
projects/composablekernel/dispatcher/codegen/arch_filter.py	Introduces `OperatorType.GEMM_STREAMK` and associated tile constraints for arch filtering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ozturkosu · 2026-06-18T03:47:06Z

+    const ck_tile::index_t M = std::stoi(get_opt(argc, argv, "--m", "3840"));
+    const ck_tile::index_t N = std::stoi(get_opt(argc, argv, "--n", "4096"));
+    const ck_tile::index_t K = std::stoi(get_opt(argc, argv, "--k", "2048"));
+    const int warmup         = std::stoi(get_opt(argc, argv, "--warmup", "10"));
+    const int repeat         = std::stoi(get_opt(argc, argv, "--repeat", "50"));


Fixed in 23afd2c — M/N/K now parse with std::stoll (into ck_tile::index_t) in both 03_streamk_gemm_driver.cpp and 04_streamk_registry_driver.cpp, so large GEMM dimensions no longer overflow/throw the int range. (--warmup/--repeat stay std::stoi as they are genuinely int.)

ozturkosu · 2026-06-18T03:47:14Z

+        // Atomic reduction accumulates into C, so reset buffers before each run.
+        auto reset_data_buffers = [&]() {{
+            if constexpr (ReductionStrategy == ck_tile::StreamKReductionStrategy::Atomic) {{
+                (void)hipMemsetAsync(args.e_ptr, 0,
+                    args.M * args.N * sizeof(CDataType), stream.stream_id_);
+            }} else {{
+                workspace_dev.SetZero();
+            }}
+        }};


Resolved in e8120f5. The atomic C-reset in _launch_function_streamk (both the 2-arg internal and 3-arg external launch overloads) is now stride-aware: it uses hipMemset2DAsync with pitch stride_E * sizeof(CDataType), width N * sizeof(CDataType), height M instead of a flat M*N memset. A padded / non-contiguous row-major C is now zeroed correctly, and contiguous rcr is covered identically. Verified on MI300X/gfx942 (atomic verifies vs reference for the 128x128x64_2x2x1_32x32x16 kernel).

Two Copilot findings on the Stream-K codegen carried in from #8094: - 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi) before narrowing to ck_tile::index_t; stoi throws std::out_of_range past INT_MAX, needlessly rejecting large GEMM sizes. - unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's per-iteration C reset zeroed args.M*args.N as a flat contiguous block, which skips elements when C has a padded leading dimension and corrupts the accumulation. Zero the used MxN region honoring stride_E via hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status instead of discarding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ozturkosu · 2026-06-15T22:00:02Z

Superseded by #8136 — keeping open until that lands

This PR's stream-K codegen (commit d0c884c7ff) is carried byte-identical into #8136 ([CK_TILE] Add Tile Engine → Dispatcher bridge for Stream-K GEMM) — verified file-by-file on arch_filter.py, unified_gemm_codegen.py, and 03_streamk_gemm_driver.cpp.

#8136 is a superset: it contains this codegen and adds the full Python bridge that this PR doesn't have:

extern "C" .so ABI (streamk_gemm_ctypes_lib.cpp)
Python ctypes routing (gemm_utils.py / ctypes_utils.py variant threading)
the 3-phase TE driver + subprocess worker (streamk_gemm_full_benchmark.py, run_one_streamk_gemm_kernel.py)

It then evolves the codegen further (selectable atomic/linear/tree reduction strategy + Copilot fixes; +102/−18 on unified_gemm_codegen.py), plus --verify, multi-GPU, and ramped benchmarking.

Why keep this open for now: #8136 is stacked on #8123, which isn't in develop yet, so #8136 can't merge until that stack lands. This PR is the independent (off-develop) path for the bare codegen. I'll close this once #8136 merges — at that point it's pure redundancy.

…rategy fields First slice of moving Stream-K into the dispatcher core (registry-addressable), per the deep-core checklist. Additive and inert by default: - KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm gains streamk / reduction_strategy / workspace. tie() includes them so the three strategies are distinct keys. encode_identifier() appends the Stream-K suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk so non-Stream-K identifiers are unchanged. - Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k(). Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical; atomic/linear/tree suffixes correct; tie() distinguishes strategies.

ozturkosu · 2026-06-17T05:03:28Z

Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile unchanged: - get_workspace_size(Problem) -> bytes (default 0) - run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default forwards to the existing no-workspace run(). The Dispatcher invokes these through a base KernelInstance* pointer (so the new overload is visible despite derived 6-arg run() overrides). The Stream-K backend (PR-C) overrides both to size and bind the reduction workspace. Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the pre-existing pure virtuals compiles; default get_workspace_size==0 and the workspace-run forwards correctly via base pointer.

ozturkosu · 2026-06-17T05:05:20Z

Deep-core Stream-K integration — progress (PR-B) ✅

Commit 188033c9a5 — PR-B: KernelInstance workspace virtuals (additive, default-forwarding; existing kernels unchanged).

Deep-core checklist

C++ Dispatcher Core

KernelKey SK fields + tie() + guarded encode_identifier() — 3af08bbf65
Problem SK fields + ProblemBuilder::stream_k() — 3af08bbf65
KernelInstance::get_workspace_size() (default 0) + workspace-aware run() overload (default forwards) — 188033c9a5
Dispatcher::run() workspace alloc + pre-launch reset hook (PR-D)

C++ Backend

generated_tile_backend_streamk.hpp building StreamKHostArgs (PR-C)
occupancy grid via StreamKTilePartitioner; get_workspace_size via kernel GetWorkSpaceSize (PR-C)

Codegen / registration

register SK instances keyed by encode_identifier()==KERNEL_NAME; pull in atomic/linear/tree reduction_strategy codegen (PR-C)

Functional (re-homed from bridge into the core)

dispatcher-owned workspace (PR-C/D) · runtime grid sizing (PR-C) · pre-rep reset (PR-D)

Validation: gfx90a (hipcc 7.12) — a concrete KernelInstance overriding only the pre-existing pure virtuals compiles; get_workspace_size()==0 by default and the workspace run() forwards via base pointer (the Dispatcher's call path).

Next: PR-C — the heavy one: new generated_tile_backend_streamk.hpp (StreamKHostArgs), codegen registration, and importing the reduction_strategy (atomic/linear/tree) codegen from #8136 so linear/tree are generatable here.

…tree codegen Adds the C++ backend that lets Stream-K ride the registry, plus the reduction strategy codegen needed to generate the three variants on this branch. - generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the ABI-incompatible args the GemmHostArgs path could not). supports() gates on Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the registry and the Dispatcher's first-fit selection picks the requested one. create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>. - codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field, key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree" (matches KernelKey::encode_identifier from PR-A), per-strategy StreamKReductionStrategy in the generated launch, and a streamk_config sweep axis. (Ported from the bridge branch reduction-strategy work.) PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those to Dispatcher::run() via get_workspace_size()/the workspace-aware run(). Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584 tree headers with correct names; the backend device-compiles (22s) against a generated header and supports() accepts the matching strategy while rejecting the others and non-Stream-K problems.

ozturkosu · 2026-06-17T05:28:18Z

…pace Relocate the Stream-K reduction-workspace buffer from the per-call generated launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived dispatcher stops paying a hipMalloc/hipFree on every invocation. - codegen: hoist the StreamKGemmKernel type to struct scope and add GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace) overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the bridge ctypes lib and the standalone 03 driver keep working. - backend: override get_workspace_size() and the workspace-aware run(); the no-workspace run() delegates with a null buffer. The per-iteration reset stays in the backend (it needs CDataType + the reduction strategy). - dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out of the public header), size it via get_workspace_size(), and pass it through run_fused()/run_explicit(); free it in the destructor. Atomic needs none (size 0 -> null -> internal path); linear/tree consume the owned buffer. Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at unchanged perf, with linear/tree now running on the dispatcher-owned workspace.

…river Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path (Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch -> verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the registry selection and the Dispatcher-owned workspace. Selectable strategy via --strategy {atomic,linear,tree}. Validated on MI210/gfx90a for all three (distinct registry identifiers, each PASS).

ozturkosu · 2026-06-17T19:06:29Z

Pushed PR-D + PR-E of the Stream-K deep-core integration (head cd6ef22af9).

PR-D e36b4740d6 — Dispatcher-owned reduction workspace. Codegen hoists StreamKGemmKernel to struct scope and adds GetWorkSpaceSize() + a 3-arg launch(args, cfg, workspace); the 2-arg launch is unchanged. The SK backend overrides get_workspace_size() and the workspace-aware run(); the Dispatcher owns a grow-on-demand buffer (freed in its destructor) and threads it through run_fused/run_explicit. Atomic needs none (size 0 → internal path); linear/tree consume the owned buffer.
PR-E cd6ef22af9 — 04_streamk_registry_driver.cpp, a runnable proof of the full path: registry → Dispatcher::run() → first-fit supports() gate → SK backend → verify vs reference_gemm.

Validated on MI210/gfx90a @ 3840×4096×2048, all PASS: atomic 79.2, linear 86.4, tree 86.6 TFLOPS — linear/tree now run on the dispatcher-owned workspace at unchanged perf. gfx942/MI300X re-validation still pending.

…K backend The dispatcher-wrapper generator emitted ONE template for every variant: backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy on the key. For Stream-K that is wrong twice over -- the regular backend calls launch(GemmHostArgs,...) which the SK kernel struct does not have (so the aggregate register_all_kernels.hpp would not compile against SK), and the key omits the SK fields so encode_identifier() emits no _streamk suffix and atomic/linear/tree collide in the registry. Make the wrapper variant-aware: for STREAM_K configs include generated_tile_backend_streamk.hpp, set key.algorithm.streamk + reduction_strategy + workspace (and pad flags for identifier parity), and return create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>. All other variants are unchanged. Validated on MI210/gfx90a: a registry populated via the generated wrappers holds atomic+linear+tree side by side; Dispatcher::run() selects each by Problem::reduction_strategy and all three verify vs reference_gemm.

…are atomic reset P2: GeneratedStreamKKernelInstance::supports() now ends with SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs is rejected during selection, so first-fit falls back to a non-Stream-K kernel instead of throwing std::runtime_error at launch. P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync (pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage for the contiguous rcr case. Applied to both the internal and external-workspace launch overloads. Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a multi-kernel registry; valid small problems are accepted (no false-negatives).

…eductionStrategy) Close two review nits on the Stream-K drivers: - Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so large GEMM dimensions no longer overflow/throw int range (Copilot nit). - Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04 driver through it, removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.

Adds dispatcher_test_streamk_registry, a GPU test that generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the 04 registry driver once per strategy (each force-including its own header, since SkReductionStrategy is a compile-time constexpr), and asserts for each that the encode_identifier() suffix matches, the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference. This converts the previously manual deep-core validation into a regression- guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only CI is unaffected.

ozturkosu · 2026-06-18T03:47:23Z

Commit `23afd2cf86` — review-nit cleanup (stoll + shared `to_string`)

Closes the open Copilot review nit and a noted gotcha:

std::stoi → std::stoll for M/N/K parsing in 03_streamk_gemm_driver.cpp and 04_streamk_registry_driver.cpp so large GEMM dimensions no longer overflow/throw the int range. --warmup/--repeat stay std::stoi (genuinely int).
to_string(ReductionStrategy) added inline in kernel_key.hpp, and the 04 driver now routes through it — removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.

No functional change to the deep-core path; non-Stream-K kernels are untouched.

ozturkosu · 2026-06-18T03:47:38Z

Commit `dc2de14983` — automated CTest for the Stream-K registry path + gfx942/MI300X validation

New test: `dispatcher_test_streamk_registry`

Converts the previously manual deep-core validation into a regression-guarded CTest (tests/test_streamk_registry.py, wired into tests/CMakeLists.txt). It:

generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config via --tile-config-json + streamk_config.reduction_strategy;
builds the 04 registry driver once per strategy, each force-including its own header — necessary because SkReductionStrategy is a compile-time constexpr, so each strategy is a distinct compiled kernel;
for each strategy asserts the encode_identifier() suffix matches (..._streamk[_linear|_tree]), the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference GEMM.

It SKIPs (return 77, SKIP_RETURN_CODE) when no GPU/hipcc is present, so CPU-only CI is unaffected.

Validation on MI300X / gfx942 (ROCm 7.12.0) — closes the gfx942 gap

End-to-end deep-core path (registry → Dispatcher::run → SK backend → verify), fp16/rcr 128x128x64_2x2x1_32x32x16 @ 3840×4096×2048, each strategy in its own compiled kernel:

strategy	identifier	TFLOPS	verify
atomic	`…_streamk`	265.6	PASS
linear	`…_streamk_linear`	306.6	PASS
tree	`…_streamk_tree`	306.4	PASS

Too-small problem (64³) is rejected at selection (supports() → false via IsSupportedArgument), so first-fit falls back gracefully instead of throwing at launch — confirmed on gfx942.

Note: the earlier MI210 numbers and any run that force-includes a single header and passes --strategy linear/tree only relabel the registry key; the compiled kernel stays whatever the included header baked in. The new test (and the table above) build each strategy from its own header, so this is a genuine per-strategy validation.

… driver The standalone stream-K driver verified atomic results with the single-pass GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split partials directly into low-precision C (workspace size 0), incurring rounding error that grows with the split factor -- correct results were flagged FAIL on small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs. Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with the split-K CDataType accumulation term, and take the max with the per-split tolerance. The driver and tile_engine now verify identically; the kernel is unchanged.

…gine The standalone stream-K driver built its stream_config as {stream, true, 0, warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults (flush_cache=false, rotating_count=1). The tile_engine benchmark instead times with flush_cache=true and rotating_count=1000, so the driver measured a warm-cache best case while tile_engine measured cold-cache -- the entire source of the reported dispatcher-vs-TE "performance gap" at low tile counts. Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine values) and pass them through stream_config so both sides use identical timing methodology. A validating run still times a single cold shot, mirroring tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0 pass.

The 04 registry driver hardcoded the KernelKey signature to DataType::FP16 and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate in fp32 and write fp16 C, matching Tile Engine). Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent core objects built once; per-dtype codegen + build + verify with per-dtype identifier assertions). All four datatypes register, dispatch, and verify across atomic/linear/tree on gfx942 (MI300X).

Parametrize the registry test over all four layouts Tile Engine builds Stream-K for (rcr/rrr/ccr/crr) in addition to datatypes and reduction strategies. Full coverage is now {fp16,bf16,fp8,bf8} x {rcr,rrr,ccr,crr} x {atomic,linear,tree}; all four layouts keep C row-major, which the atomic C-reset relies on. The encode_identifier assertion is generalized to {dtype}_{layout}, and --datatypes/--layouts flags allow trimming the matrix for faster CI runs.

make_args() hardcoded rcr leading dims (stride_a=K, stride_b=K, stride_c=N) for every layout, so non-rcr Stream-K kernels (rrr/ccr/crr) ran with wrong strides and failed verification. Derive the leading dims from the kernel key's layouts instead: A is MxK (row->K, col->M), B is KxN (row->N, col->K), C is MxN (row->N, col->M). rcr is unchanged; rrr/ccr/crr now match the host tensor strides the driver builds via get_default_stride. Verified on gfx942/MI300X: full {fp16,bf16,fp8,bf8} x {rcr,rrr,ccr,crr} x {atomic,linear,tree} matrix (48/48) registers, dispatches, and verifies.

Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry + validation driver). 3-way merge over the shared stream_k ancestor; only the streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp conflicted -- both resolved to the deep-core side: - codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize + IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib calls, and adds the 3-arg dispatcher-owned-workspace launch. - driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot. Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly, so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree + regular gemm all generate cleanly (0 failed).

Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300. GpuGemmRunner (dispatcher/python/gemm_utils.py): - Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887): bit-exact decode tables + nearest-representable/saturating encode, carried as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand C/F contiguity so the layout-generic _to_buf path holds for the new dtypes. - run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store, int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged. - Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350 uses OCP fp8, a different bit layout) rather than silently mis-decoding. - An int8 codec is included for when the engine supports it (see below). Reference + surface: - run_one_streamk_gemm_kernel.py verify reference is now dtype-aware (decode(encode(x)) per dtype; int8 = exact int32 matmul). - streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8. int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile for every reduction strategy -- warp_gemm_dispatcher has no Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3 path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the PR #8094 decision to leave int8 out. GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants: fp8 atomic/linear/tree rcr: PASS (192/180/183 TFLOPS, max_rel <= 9.4e-4) bf8 atomic/linear/tree rcr: PASS (192/181/181 TFLOPS, max_rel <= 7.8e-4) fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)

ozturkosu · 2026-06-28T01:28:37Z

Stream-K dispatcher ⇄ Old-TE equivalence — validation summary (MI300X + MI350)

Validated the Stream-K dispatcher against legacy Tile Engine (Old-TE) across datatypes {fp16, bf16, fp8, bf8} × layouts {rcr, rrr, ccr, crr} × reduction {atomic, linear, tree} on MI300X (gfx942) and MI350 (gfx950).

Equivalence verdict: ✅ 100% equivalent to Old-TE

There is no case where Old-TE works but the dispatcher fails.

arch · dtype	Old-TE	Dispatcher	equivalent
gfx942 · fp16/bf16/fp8/bf8	works	works	✅ both pass
gfx950 · fp16/bf16	works	works	✅ both pass
gfx950 · fp8/bf8	fails (4× ref)	fails (4× ref)	✅ both fail identically

MI300X (gfx942) — fully passing

48/48 registry combos (dtype×layout×strategy) register, dispatch, and verify against ck_tile::reference_gemm.
TE-vs-dispatcher A/B (matched config per dtype, warmup50/repeat100, interleaved, M3840×N4096×K2048): parity within ±3% across all 48 combos; dispatcher consistently ~9% faster on fp16/crr/atomic (re-measured 3×), others collapse to parity.

MI350 (gfx950) — fp16/bf16 pass; fp8/bf8 fail (pre-existing shared bug)

fp16 + bf16: all verify correct on both engines.
fp8 + bf8: both Old-TE and the dispatcher fail with output = exactly 4× reference (Old-TE fp8 out 2792 vs ref 698; dispatcher identical). This is a shared, pre-existing ck_tile kernel bug, not introduced here — both engines instantiate the same StreamKKernel.
Root cause isolation (K-sweep): the fp8 error reproduces even at K=64 (single k-tile, no stream-K split) while the fp16 control passes at every K → the defect is in the fp8/bf8 compute/epilogue path on CDNA4, not the stream-K K-split reduction. Warp-tile-independent (16/32/64 all 4×).
Action: this gfx950 fp8/bf8 4× defect should be filed as a separate upstream ck_tile issue. It does not affect dispatcher⇄TE equivalence (Old-TE is equally affected).

Code in this PR

9d033fde99 — dtype support fp16/bf16/fp8/bf8 (derive KernelKey dtype/layout from generated types).
7190a19edd — registry test parametrized over all four layouts.
8595984d5c — backend make_args derives strides from layout (fixes rrr/ccr/crr; found by independent review).

Full report + CSV: MLSE Confluence (PR #8094 Stream-K equivalence page).

…wing, timing, coverage) - 04 registry driver: port the split-K-aware verification tolerance from the 03 driver (a0ff521) so the deep-core path uses kbatch-derived rtol/atol instead of the plain single-pass threshold that spuriously FAILs correct atomic results on small-M/N, large-K shapes. - Stream-K backend make_args: guard the int64->int32 (index_t) narrowing of M/N/K and derived leading dims; throw on overflow instead of silently wrapping (the parser was widened to std::stoll for exactly this reason). - Stream-K backend run(): flush the L2 between timed iterations so the measurement is cold like tile_engine/the 03 driver (warm cache over-reported TFlops); document that this path is cold-but-non-rotated and not the calibrated apple-to-apple perf surface. - codegen: document the Stream-K tile-coverage limitation (dispatcher emits a narrower tile surface than TE; equivalence is per matched tile config). Verified on MI300X (gfx942): fp16 rcr atomic/linear/tree register, dispatch, and verify PASS.

ozturkosu · 2026-06-30T07:41:02Z

Addressed review majors (commit `ae7b571`)

Pushed fixes for the four "major" findings from the deep-core review.

M1 — 04_streamk_registry_driver.cpp used the non-split-K tolerance.
Ported the kbatch-aware calculate_rtol_atol from the 03 driver (a0ff521) so the deep-core registry path uses max(per-split, split-K) rtol/atol derived from the kernel's own tile partitioner, instead of the plain get_relative/absolute_threshold(K) that can spuriously FAIL correct atomic results on small-M/N + large-K.

M4 — silent int64→int32 narrowing in make_args.
Problem::M/N/K are int64_t; ck_tile::index_t is int32_t. Added a checked to_idx() that throws on overflow for M/N/K and the derived leading dims, instead of silently wrapping (the parser was widened to std::stoll for exactly this reason).

M2 — registry run() baked in warm-cache timing.
Set flush_cache_ = bench so timed iterations are cold like tile_engine / the 03 driver (warm cache was what over-reported TFlops). Kept rotating_count_ = 1 (atomic accumulates into C and this path also feeds caller verification) and documented that it is therefore cold-but-non-rotated — not the calibrated apple-to-apple surface; use the 03 driver / --validate 0 for TE-comparable numbers.

Validation on MI300X (gfx942)

Run via test_streamk_registry.py inside the CK ROCm container on an allocated gfx942 node:

fp16 rcr atomic/linear/tree @ 3840×4096×2048 — PASS (270 / 299 / 284 TFlops)
fp16 rcr atomic/linear/tree @ 128×128×16384 (small-M / large-K, stresses the M1 split-K tolerance) — PASS
bf16 × {rcr,rrr,ccr,crr} × {atomic,linear,tree} (12 combos) — PASS

Still open from the review (not in this commit): rebase onto develop (B1), gate/track gfx950 fp8/bf8 (B2), and the redundancy decision vs #8136 (B3).

…tadata, arch-filter, test) - Backend validate(): stop returning a blind "true". Without a host reference it cannot do a numeric check, but it now validates what it can -- non-null operands, a well-formed problem, and that this instance supports() it -- so an unrunnable config is not mis-reported as valid. - 04 registry driver make_streamk_key(): derive wave_shape/block_size/ transpose_c/double_buffer/persistent/preshuffle/num_wave_groups from the generated kernel's own static traits instead of hardcoding (the wave_shape was wrong, e.g. {2,2,1} vs the actual {1,4,1}); the registry identifier now describes the kernel that was built. pipeline/scheduler stay fixed (baked into the kernel type, not on the SK selection axis) -- now documented. - arch_filter GEMM_STREAMK: document that the min_tile constraints are copied from plain GEMM and the real feasibility gate (enough tiles to partition K across CUs) is runtime IsSupportedArgument/supports(), not these numbers. - test_streamk_registry.py: replace the fragile rocminfo line-splitting in detect_arch with a regex; verify each built kernel against the CLI shape AND a small-M/large-K (128x128x16384) shape that stresses the split-K tolerance. Compile-validated for gfx942; small-M/large-K shape verified PASS on MI300X.

ozturkosu · 2026-06-30T16:30:48Z

Addressed review minors (commit `9470ae9`)

m1 — backend validate() no longer returns a blind true; it now checks non-null operands, problem.is_valid(), and supports(problem) (a host-reference numeric check stays the harness's job).
m2 — make_streamk_key() now derives wave_shape/block_size/transpose_c/double_buffer/persistent/preshuffle/num_wave_groups from the generated kernel's static traits instead of hardcoding (the wave_shape was wrong — {2,2,1} vs the actual {1,4,1}). pipeline/scheduler are baked into the kernel type and not on the SK selection axis — now documented.
m3 — documented that the arch_filter GEMM_STREAMK min_tile numbers are copied from plain GEMM and the real feasibility gate is runtime IsSupportedArgument/supports().
m4 — replaced the fragile rocminfo line-splitting in detect_arch with a regex.
m5 — the registry CTest now verifies each built kernel against the CLI shape and a small-M/large-K (128×128×16384) shape that stresses the split-K tolerance.

Compile-validated for gfx942; the small-M/large-K shape verifies PASS on MI300X.

Document the Stream-K deep-core path (the feature this branch adds): generate kernels via unified_gemm_codegen --variants stream_k, build/run the 03 standalone driver (perf surface) and the 04 registry driver (Registry -> Dispatcher -> verify), run the CTest, plus the reduction strategies, supported dtypes/layouts, the cold-cache perf methodology, the split-K-aware tolerance, and known limitations (gfx950 fp8/bf8, tile-coverage gap). Linked from the dispatcher README.

…, thread-safety, row-major C assert) Three correctness fixes from a tough review pass on the deep-core path: 1. Zero the reduction workspace in Dispatcher::ensure_workspace() before every Linear/Tree dispatch. The dispatcher previously only hipMalloc'd the buffer and relied on the backend's per-iteration preprocess reset to zero it, so on the non-benchmarking (nrepeat=1) path -- or with set_benchmarking(false) -- the buffer could be garbage and corrupt the reduction. Now correctness is independent of whether the preprocess runs, mirroring the internal DeviceMem::SetZero() the standalone launch already does. 2. Serialize the shared workspace_ buffer with a mutex. workspace_ is mutable state reused across const run() calls; two concurrent linear/tree dispatches on different streams could share one buffer and corrupt each other. The lock spans size->zero->launch (the buffer is in use for the whole run). Atomic / non-Stream-K paths use no workspace and take no lock. 3. Emit a static_assert(is_row_major<CLayout>) in the generated Atomic launch. The atomic C-reset uses a row-major hipMemset2D; a column-major C would be zeroed incorrectly and silently corrupt atomic accumulation. The assert is atomic-only (Linear/Tree zero the workspace, not C). Verified on MI300X (gfx942): fp16 rcr atomic/linear/tree register, dispatch, and verify PASS across 3840x4096x2048 and 128x128x16384.

…t preprocess The atomic C-reset (hipMemset2DAsync) and the linear/tree workspace reset (hipMemsetAsync) in the generated launch() were (void)-cast, so a failed reset silently corrupted atomic accumulation / the reduction buffer with no signal. Both now throw std::runtime_error on non-hipSuccess, consistent with the IsSupportedArgument failure path in the same function. Applied to both the internal-workspace and external-workspace launch overloads. Verified on MI300X (gfx942): fp16 rcr atomic/linear/tree compile and verify PASS.

therock-pr-bot · 2026-06-30T20:57:41Z

❌ PR Check — Action Required

Check	Status	Details
🌿 Branch Name	❌ Fail	Branch name does not match allowed patterns. Branch: `muozturk/dispatcher-streamk-gemm` Allowed patterns: - `^users\/[A-Za-z0-9][A-Za-z0-9\-]\/.+` - `^shared\/.+` - `^[A-Za-z0-9][A-Za-z0-9\-_]$` - `^dependabot\/.+` - `^revert-[0-9]+-.+`
📝 PR Title/Description	❌ Fail	Error: Title does not follow Conventional Commits style. Expected: start with a valid type (feat, fix, docs, …). Desired format: `type(optional-scope): short description`
⛔ Forbidden Files	✅ Pass	—
🧪 Unit Test	✅ Pass	—
🔎 pre-commit	✅ Pass	—
🚫 Draft PR	🔜 To Be Enabled	—
🚩 Feature Flag	🔜 To Be Enabled	—
📊 Code Coverage	🔜 To Be Enabled	—

⚠️ 2 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

❌ Branch Name

❌ PR Title/Description

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

therock-pr-bot · 2026-06-30T20:57:42Z

🚫 Please fix the failed policies before requesting reviews.

The following policy checks failed:

❌ Branch Name
❌ PR Title/Description

The Not ready to Review label has been added to this PR.
Once all policies pass, the label will be removed automatically.

ozturkosu · 2026-06-30T22:09:00Z

Superseded by #8985. This PR auto-closed when its branch was renamed to the policy-compliant path users/muozturk/ck-tile/dispatcher-streamk-gemm; the new PR carries the same commits (head ce8ba43e45).

github-actions Bot added the project: composablekernel label Jun 5, 2026

ozturkosu self-assigned this Jun 5, 2026

ozturkosu requested a review from Copilot June 5, 2026 06:50

Copilot started reviewing on behalf of ozturkosu June 5, 2026 06:50 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

assistant-librarian Bot added the organization: ROCm label Jun 5, 2026

ozturkosu added 2 commits June 17, 2026 13:46

ozturkosu added 2 commits June 17, 2026 16:12

ozturkosu force-pushed the muozturk/dispatcher-streamk-gemm branch from b74ada4 to e8120f5 Compare June 17, 2026 20:26

ozturkosu added 2 commits June 17, 2026 23:38

ozturkosu requested a review from arai713 June 18, 2026 17:31

ozturkosu marked this pull request as ready for review June 18, 2026 21:39

ozturkosu requested a review from a team as a code owner June 18, 2026 21:39

ozturkosu added 2 commits June 21, 2026 06:32

ozturkosu requested review from ecamartins and yraparti June 24, 2026 15:48

ozturkosu added 3 commits June 27, 2026 18:25

ozturkosu mentioned this pull request Jun 27, 2026

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM #8136

Closed

5 tasks

ozturkosu requested a review from a team as a code owner June 30, 2026 18:11

github-actions Bot added the documentation label Jun 30, 2026

ozturkosu and others added 3 commits June 30, 2026 15:13

Merge branch 'develop' into muozturk/dispatcher-streamk-gemm

ce8ba43

ozturkosu changed the title ~~[CK_TILE] Add stream_k variant to GEMM Dispatcher codegen~~ feat(ck_tile): add stream_k variant to GEMM Dispatcher codegen Jun 30, 2026

ozturkosu closed this Jun 30, 2026

ozturkosu deleted the muozturk/dispatcher-streamk-gemm branch June 30, 2026 21:58

ozturkosu mentioned this pull request Jun 30, 2026

feat(ck-tile): add stream_k variant to GEMM Dispatcher codegen #8985

Open

Uh oh!

Conversation

ozturkosu commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add stream-K variant to the GEMM Dispatcher codegen (the dispatcher way)

What I did

Problem tried (config + shape)

Performance + numerical verification (Dispatcher vs Tile Engine)

Next

Deep-core integration (PR-A…E) — accepted design deviations

Update (2026-06-27) — multi-datatype support: fp16 / bf16 / fp8 / bf8

Update (2026-06-27, #2) — full Old-TE functional equivalence: + all layouts

Update (2026-06-30, #3) — scope correction + review blockers addressed

Next

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ozturkosu Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ozturkosu Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ozturkosu commented Jun 15, 2026

Superseded by #8136 — keeping open until that lands

Uh oh!

ozturkosu commented Jun 17, 2026

Deep-core Stream-K integration — progress (PR-A) ✅

Deep-core checklist

Uh oh!

ozturkosu commented Jun 17, 2026

Deep-core Stream-K integration — progress (PR-B) ✅

Deep-core checklist

Uh oh!

ozturkosu commented Jun 17, 2026

Deep-core Stream-K integration — progress (PR-C) ✅

Deep-core checklist

Uh oh!

ozturkosu commented Jun 17, 2026

Uh oh!

ozturkosu commented Jun 18, 2026

Commit 23afd2cf86 — review-nit cleanup (stoll + shared to_string)

Uh oh!

ozturkosu commented Jun 18, 2026

Commit dc2de14983 — automated CTest for the Stream-K registry path + gfx942/MI300X validation

New test: dispatcher_test_streamk_registry

Validation on MI300X / gfx942 (ROCm 7.12.0) — closes the gfx942 gap

Uh oh!

ozturkosu commented Jun 28, 2026

Stream-K dispatcher ⇄ Old-TE equivalence — validation summary (MI300X + MI350)

Equivalence verdict: ✅ 100% equivalent to Old-TE

MI300X (gfx942) — fully passing

MI350 (gfx950) — fp16/bf16 pass; fp8/bf8 fail (pre-existing shared bug)

Code in this PR

Uh oh!

ozturkosu commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed review majors (commit ae7b571)

Validation on MI300X (gfx942)

Uh oh!

ozturkosu commented Jun 30, 2026

Addressed review minors (commit 9470ae9)

Uh oh!

therock-pr-bot Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ PR Check — Action Required

Uh oh!

therock-pr-bot Bot commented Jun 30, 2026

Uh oh!

ozturkosu commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ozturkosu commented Jun 5, 2026 •

edited

Loading

Commit `23afd2cf86` — review-nit cleanup (stoll + shared `to_string`)

Commit `dc2de14983` — automated CTest for the Stream-K registry path + gfx942/MI300X validation

New test: `dispatcher_test_streamk_registry`

ozturkosu commented Jun 30, 2026 •

edited

Loading

Addressed review majors (commit `ae7b571`)

Addressed review minors (commit `9470ae9`)

therock-pr-bot Bot commented Jun 30, 2026 •

edited

Loading