[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136
[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM#8136ozturkosu wants to merge 31 commits into
Conversation
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem GEMM with the same C ABI as regular GEMM, so the Python runner side is reused unchanged; only the .so internals and variant routing differ. - streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls SelectedKernel::launch(args, stream) directly, bypassing the registry (whose generated backend hard-codes the GemmHostArgs launch signature). The launch allocates the Atomic-reduction workspace internally; C is zeroed per run. - gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for variant "stream_k"; .name appends _streamk; variant threaded through codegen_args and expand_sweep. - ctypes_utils.py: pass the requested variant to codegen --variants instead of hard-coding "standard". - TE driver/worker/config (new): streamk_gemm_full_benchmark.py, run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json. Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4 problems), name parity holds (.so name == config .name, ends _streamk). Numeric parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance (max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly reported unsupported (status -2) by the kernel and surfaced gracefully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the Composable Kernel Tile Engine (TE) → Dispatcher “single source of truth” bridge to the Stream-K GEMM variant by adding Stream-K-aware codegen, a dedicated ctypes bridge that launches Stream-K kernels directly, and TE-side benchmark/worker plumbing that reuses the existing single-problem GEMM Python ABI.
Changes:
- Add Stream-K as a first-class GEMM codegen variant (
--variants stream_k), including Stream-K kernel includes and a Stream-K launch wrapper that allocates/reuses internal workspace. - Introduce a Stream-K-specific ctypes bridge (
streamk_gemm_ctypes_lib.cpp) that bypasses the registry and callsSelectedKernel::launch(StreamKHostArgs, stream_config)directly while keeping the same C ABI as standard GEMM. - Add TE benchmark driver/worker and a small default sweep config for Stream-K benchmarking via the dispatcher bridge.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/composablekernel/tile_engine/ops/gemm/streamk_gemm_full_benchmark.py | New 3-phase TE benchmark driver for Stream-K using variant="stream_k" and subprocess isolation. |
| projects/composablekernel/tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py | New subprocess worker to run Stream-K kernels via the existing GpuGemmRunner ABI. |
| projects/composablekernel/tile_engine/ops/gemm/gemm_streamk/configs/default_config.json | New default Stream-K sweep config JSON (small config set). |
| projects/composablekernel/dispatcher/python/gemm_utils.py | Route Stream-K builds to the Stream-K ctypes source; thread variant into codegen args and sweep expansion. |
| projects/composablekernel/dispatcher/python/ctypes_utils.py | Pass requested variant into unified_gemm_codegen.py --variants (instead of hard-coding standard). |
| projects/composablekernel/dispatcher/examples/gemm/cpp/03_streamk_gemm_driver.cpp | New minimal standalone Stream-K driver demonstrating the “dispatcher way” launch path. |
| projects/composablekernel/dispatcher/codegen/unified_gemm_codegen.py | Add stream_k variant: naming, includes, launch wrapper, config filtering, CLI choices. |
| projects/composablekernel/dispatcher/codegen/arch_filter.py | Add operator type and constraints entry for Stream-K GEMM validation. |
| projects/composablekernel/dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp | New ctypes bridge for Stream-K that launches the force-included kernel directly and reports KERNEL_NAME. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # All configs in a sweep share one variant; route to the matching bridge lib. | ||
| ctypes_source = ( | ||
| _cu.get_dispatcher_root() / "bindings" / "ctypes" / "gemm_ctypes_lib.cpp" | ||
| _cu.get_dispatcher_root() | ||
| / "bindings" | ||
| / "ctypes" | ||
| / _ctypes_source_name(configs[0].variant) | ||
| ) |
| if constexpr (ReductionStrategy == ck_tile::StreamKReductionStrategy::Atomic) {{ | ||
| (void)hipMemsetAsync(args.e_ptr, 0, | ||
| args.M * args.N * sizeof(CDataType), stream.stream_id_); | ||
| }} else {{ | ||
| workspace_dev.SetZero(); | ||
| }} |
Adds a "Variant scope" section clarifying that the bridge is one shared, variant-aware driver (not per-variant driver copies), that only gemm_universal is wired and validated through the bridge on this PR, and that the gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding following the per-variant convention -- not yet working support. Notes that grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).
The Stream-K bridge (#8136) was branched at #8123's first commit, so it lacked all subsequent regular-GEMM bridge improvements (arch-validated tile filtering, the develop merge + legacy gemm_universal retirement, benchmark-param/--verify work on the shared driver, README). Merge the current #8123 HEAD to pick those up; the Stream-K-specific analogues that live in the duplicated driver/worker/ctypes lib are ported in follow-up commits. Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp, .name _streamk suffix, variant through codegen_args/expand_sweep) and adopted #8123's explanatory comment.
The Stream-K bridge keeps its own driver, worker and ctypes lib, so the regular-GEMM bridge improvements that landed on #8123 after this branch forked did not arrive via the merge. Port the Stream-K-specific analogues: - streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of #8123's spurious "perf gap") and are env-overridable via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path, rotating_count defaults to 1: the Atomic preprocess re-zeros only the original C buffer, so rotating C would corrupt the accumulation. - streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices, device CSV column), add the --verify/--verify-tol fp32-reference gate, and constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a latent proc-unbound error in the batch handler. - run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check (global max|out-ref|/max|ref|, verified/max_rel) behind --verify. - README: document the Stream-K bridge driver/worker, flags, _streamk name suffix, fp16 Atomic tolerance, and the rotating_count divergence. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two Copilot findings on the Stream-K codegen carried in from #8094: - 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi) before narrowing to ck_tile::index_t; stoi throws std::out_of_range past INT_MAX, needlessly rejecting large GEMM sizes. - unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's per-iteration C reset zeroed args.M*args.N as a flat contiguous block, which skips elements when C has a padded leading dimension and corrupts the accumulation. Zero the used MxN region honoring stride_E via hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status instead of discarding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stream-K Bridge vs Old TE — MI300X parity (perf + validation)Ran the requested comparison on ctr-cx64-mi300x-4 (AMD Instinct MI300X / gfx942, inside the enroot container, ROCm/HIP 7.13), using Shapes: 1024³, 2048³, 4096³, 512×512×8192 (the bridge's default Stream-K problem set). Method
ValidationBoth paths correct on all 16 measurements. Bridge: 16/16 Performance (TFLOPS, Bridge vs Old TE)
Takeaway
Raw data:
|
…uction Previously the stream-K codegen hard-coded the Atomic reduction strategy. This makes the reduction strategy a first-class config axis so linear and tree reductions can be generated, named, and selected alongside atomic: - unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk" suffix for name parity, linear/tree are disambiguated); _launch_function_streamk now emits the config's StreamKReductionStrategy (the existing reset lambda already zeroes C for atomic vs the workspace for linear/tree); _get_configs_for_variant iterates strategies from a new streamk_config section, which is added to the default config (atomic, linear, tree). - gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy) and expand_sweep (reduction-strategy sweep axis). - ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity. arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile combos and arch_filter reads no stream-K-specific keys, so adding them would be dead data. Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and pass fp32 verification (max_rel 3.1e-4) at 256x256x4096. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stream-K reduction strategy is now a codegen axis (
|
| strategy | kernel | verified | max_rel |
|---|---|---|---|
| atomic | ..._streamk |
yes | 3.1e-4 |
| linear | ..._streamk_linear |
yes | 3.1e-4 |
| tree | ..._streamk_tree |
yes | 3.1e-4 |
Scope
This stays within the bridge architecture. It does not add the deep dispatcher-core integration (KernelKey / KernelInstance / Problem / Dispatcher + a generic Stream-K backend + ML selectability); that remains a separate effort.
…rategy fields
First slice of moving Stream-K into the dispatcher core (registry-addressable),
per the deep-core checklist. Additive and inert by default:
- KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm
gains streamk / reduction_strategy / workspace. tie() includes them so the
three strategies are distinct keys. encode_identifier() appends the Stream-K
suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with
unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk
so non-Stream-K identifiers are unchanged.
- Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k().
Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical;
atomic/linear/tree suffixes correct; tie() distinguishes strategies.
Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile unchanged: - get_workspace_size(Problem) -> bytes (default 0) - run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default forwards to the existing no-workspace run(). The Dispatcher invokes these through a base KernelInstance* pointer (so the new overload is visible despite derived 6-arg run() overrides). The Stream-K backend (PR-C) overrides both to size and bind the reduction workspace. Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the pre-existing pure virtuals compiles; default get_workspace_size==0 and the workspace-run forwards correctly via base pointer.
…tree codegen Adds the C++ backend that lets Stream-K ride the registry, plus the reduction strategy codegen needed to generate the three variants on this branch. - generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the ABI-incompatible args the GemmHostArgs path could not). supports() gates on Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the registry and the Dispatcher's first-fit selection picks the requested one. create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>. - codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field, key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree" (matches KernelKey::encode_identifier from PR-A), per-strategy StreamKReductionStrategy in the generated launch, and a streamk_config sweep axis. (Ported from the bridge branch reduction-strategy work.) PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those to Dispatcher::run() via get_workspace_size()/the workspace-aware run(). Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584 tree headers with correct names; the backend device-compiles (22s) against a generated header and supports() accepts the matching strategy while rejecting the others and non-Stream-K problems.
…pace Relocate the Stream-K reduction-workspace buffer from the per-call generated launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived dispatcher stops paying a hipMalloc/hipFree on every invocation. - codegen: hoist the StreamKGemmKernel type to struct scope and add GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace) overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the bridge ctypes lib and the standalone 03 driver keep working. - backend: override get_workspace_size() and the workspace-aware run(); the no-workspace run() delegates with a null buffer. The per-iteration reset stays in the backend (it needs CDataType + the reduction strategy). - dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out of the public header), size it via get_workspace_size(), and pass it through run_fused()/run_explicit(); free it in the destructor. Atomic needs none (size 0 -> null -> internal path); linear/tree consume the owned buffer. Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at unchanged perf, with linear/tree now running on the dispatcher-owned workspace.
…river
Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path
(Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on
reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch ->
verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls
SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the
registry selection and the Dispatcher-owned workspace.
Selectable strategy via --strategy {atomic,linear,tree}. Validated on
MI210/gfx90a for all three (distinct registry identifiers, each PASS).
…K backend The dispatcher-wrapper generator emitted ONE template for every variant: backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy on the key. For Stream-K that is wrong twice over -- the regular backend calls launch(GemmHostArgs,...) which the SK kernel struct does not have (so the aggregate register_all_kernels.hpp would not compile against SK), and the key omits the SK fields so encode_identifier() emits no _streamk suffix and atomic/linear/tree collide in the registry. Make the wrapper variant-aware: for STREAM_K configs include generated_tile_backend_streamk.hpp, set key.algorithm.streamk + reduction_strategy + workspace (and pad flags for identifier parity), and return create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>. All other variants are unchanged. Validated on MI210/gfx90a: a registry populated via the generated wrappers holds atomic+linear+tree side by side; Dispatcher::run() selects each by Problem::reduction_strategy and all three verify vs reference_gemm.
…are atomic reset P2: GeneratedStreamKKernelInstance::supports() now ends with SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs is rejected during selection, so first-fit falls back to a non-Stream-K kernel instead of throwing std::runtime_error at launch. P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync (pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage for the contiguous rcr case. Applied to both the internal and external-workspace launch overloads. Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a multi-kernel registry; valid small problems are accepted (no false-negatives).
The bridge dispatcher's tile-divisibility gate rejected any problem where
M % TileM != 0 for every layout, returning status -2 ("No suitable kernel")
at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr
kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like
1024^3 -- cases Old-TE compiles, runs, and verifies as correct.
Root cause: supports() was layout-blind, while the underlying
ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an
operand whose inner axis is that dimension participates without padding:
RowMajor A -> K, ColMajor A -> M
RowMajor B -> N, ColMajor B -> K
RowMajor C -> N, ColMajor C -> M
So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192
tiles on M-indivisible problems.
Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so
it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain).
Anything it now lets through is still validated by the kernel's own
IsSupportedArgument inside launch(), so the bridge stays a strict functional
equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM
.so path) and the sibling tile_backend.hpp.
Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile
.so now run at 1024^3 (Old-TE runs the same, verification correct); the 8
remaining rejects are tile N=192 cases that Old-TE also reports "Arguments
not supported" at N=1024 -- parity preserved in both directions.
…oding rcr dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is layout-aware, that wrong key layout makes the dispatcher reject valid problems: a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that Old-TE compiles, runs, and verifies (~87 TFLOPS). Derive signature.layout_a/b/c from the force-included kernel's own ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor. The key now matches the kernel, so the layout-aware gate is correct for all four layouts. Execution was already layout-correct (the kernel uses its own compile-time layouts); only the host-side selection metadata was wrong. Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS), restoring parity with Old-TE.
For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker call per side (5x fewer process startups), cache the compiled old-TE .so, and add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores while GPU measurement stays serial.
…eductionStrategy) Close two review nits on the Stream-K drivers: - Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so large GEMM dimensions no longer overflow/throw int range (Copilot nit). - Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04 driver through it, removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.
Adds dispatcher_test_streamk_registry, a GPU test that generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the 04 registry driver once per strategy (each force-including its own header, since SkReductionStrategy is a compile-time constexpr), and asserts for each that the encode_identifier() suffix matches, the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference. This converts the previously manual deep-core validation into a regression- guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only CI is unaffected.
…IBRARY_PATH
meas()/meas_all() built the worker env without /opt/rocm/lib on
LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so
("libamdhip64.so.7: cannot open shared object file") and every cell
came back nan. Set it the same way ab_efficient_sweep.py does.
… guard) The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT bugs, not real codegen gaps: - ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it is at parity. Now both sides build with identical flags. - ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no freshness check, so 3-day-old binaries built from an obsolete codegen showed up as -78%/+703% gaps. Added a guard: skip any .so older than its generated header (treated as missing) instead of reporting a phantom gap. With both fixes the 41 former >15% outlier stems measure within +/-10% (median +0.01%); no bridge codegen regression exists. Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate -enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%; held back pending a broader persistent-kernel no-regression sweep.
… driver The standalone stream-K driver verified atomic results with the single-pass GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split partials directly into low-precision C (workspace size 0), incurring rounding error that grows with the split factor -- correct results were flagged FAIL on small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs. Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with the split-K CDataType accumulation term, and take the max with the per-split tolerance. The driver and tile_engine now verify identically; the kernel is unchanged.
…gine
The standalone stream-K driver built its stream_config as {stream, true, 0,
warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults
(flush_cache=false, rotating_count=1). The tile_engine benchmark instead times
with flush_cache=true and rotating_count=1000, so the driver measured a
warm-cache best case while tile_engine measured cold-cache -- the entire source
of the reported dispatcher-vs-TE "performance gap" at low tile counts.
Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine
values) and pass them through stream_config so both sides use identical timing
methodology. A validating run still times a single cold shot, mirroring
tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0
pass.
The 04 registry driver hardcoded the KernelKey signature to DataType::FP16 and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate in fp32 and write fp16 C, matching Tile Engine). Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent core objects built once; per-dtype codegen + build + verify with per-dtype identifier assertions). All four datatypes register, dispatch, and verify across atomic/linear/tree on gfx942 (MI300X).
Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry + validation driver). 3-way merge over the shared stream_k ancestor; only the streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp conflicted -- both resolved to the deep-core side: - codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize + IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib calls, and adds the 3-arg dispatcher-owned-workspace launch. - driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot. Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly, so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree + regular gemm all generate cleanly (0 failed).
Build the Stream-K bridge .so without the dispatcher static lib and with TE-streamk-matching flags: - compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern). This keeps the A/B fair; the regular path is unchanged. - link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the force-included kernel directly, no registry/dispatcher symbols), and the build guard no longer requires the static lib for Stream-K. - ensure build/examples exists before hipcc writes there (the cmake build that normally creates it is skipped on the Stream-K path). Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.
Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build verbatim (ground truth: a TE streamk build's compile_commands.json). The -mllvm codegen flags come from the composablekernel project-root add_compile_options applied globally to the TE benchmark -- they are NOT in the per-target options, so the earlier "minimal flags" assumption was wrong and would have produced a phantom A/B gap (different occupancy). Flags now: -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress -enable-post-misched=0 is applied unconditionally (TE does so for streamk), and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).
Update: ported the Stream-K bridge onto the deep-core #8094 engine + TE parityReworked this PR so the Stream-K TE→Dispatcher bridge runs on the rewritten #8094 "deep-core" Stream-K engine instead of the original codegen. What changed
Validation (fresh builds both sides, same node compiler, warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved; verify in a separate pass):
|
Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16 path; only the bit pattern differs). Dtype is inferred from the kernel name. ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device bf16_t so the numpy reference multiplies the same values the GPU does.
…ing) Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C layouts (rcr/rrr/ccr/crr) work end to end: - streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R) instead of the hardcoded K/K/N. - generated_tile_backend_streamk.hpp (registry path): same layout-derived strides. - GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is now memory-order-preserving so column-major operands stay column-major. - run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only). - streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr, SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).
Update: removed rcr hardcoding — bridge is now layout-generic (rcr/rrr/ccr/crr) + bf16Following review, nothing layout- or dtype-specific is hardcoded anymore; strides are derived from the kernel's actual layouts everywhere.
Validated on GPU (verify on):
Remaining (tracked): fp8/bf8/int8 need runner codecs (codegen + Old-TE already support them); the ctypes bridge still runs one kernel per |
Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300. GpuGemmRunner (dispatcher/python/gemm_utils.py): - Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887): bit-exact decode tables + nearest-representable/saturating encode, carried as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand C/F contiguity so the layout-generic _to_buf path holds for the new dtypes. - run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store, int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged. - Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350 uses OCP fp8, a different bit layout) rather than silently mis-decoding. - An int8 codec is included for when the engine supports it (see below). Reference + surface: - run_one_streamk_gemm_kernel.py verify reference is now dtype-aware (decode(encode(x)) per dtype; int8 = exact int32 matmul). - streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8. int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile for every reduction strategy -- warp_gemm_dispatcher has no Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3 path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the PR #8094 decision to leave int8 out. GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants: fp8 atomic/linear/tree rcr: PASS (192/180/183 TFLOPS, max_rel <= 9.4e-4) bf8 atomic/linear/tree rcr: PASS (192/181/181 TFLOPS, max_rel <= 7.8e-4) fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)
Update: full Old-TE dtype coverage — fp8/bf8 added (commit b6bea8a)The Stream-K bridge runner now covers the exact runnable dtype set Old-TE Stream-K supports. Added fp8 (E4M3 FNUZ) and bf8 (E5M2 FNUZ) to GPU-validated (gfx942/MI300X, 2048³, reference = decode(encode(A)) @ decode(encode(B))):
Exact-equivalence check vs Old-TE:
Arch guard: the fp8/bf8 codec is FNUZ (gfx942). gfx950/MI350 uses OCP fp8 (different bit layout); the runner detects arch and raises a clear error for fp8/bf8 on non-gfx942 rather than silently producing wrong results. |
Status summary — Stream-K bridge ready for reviewImplementation complete and on this branch:
Verification (verify on): fp16 all 4 layouts PASS (rel 8.2e-4); bf16 + col-major PASS; fp8/bf8 × atomic/linear/tree × row+col-major PASS (rel <1e-3, gfx942). TE↔bridge parity (fair, fresh both sides, median-of-3 interleaved):
Caveat: fp8/bf8 codec is gfx942 FNUZ; the runner raises a clear error on gfx950 (OCP) pending separate handling. |
Clarification: fp8/bf8 Stream-K on gfx950 (MI350) — blocked by a shared upstream ck_tile bug, not the bridgeFor completeness on why MI350 dtype coverage is scoped to fp16+bf16: A pre-existing, shared ck_tile engine bug on gfx950. fp8/bf8 Stream-K on gfx950 produces output ≈ 4× the reference (e.g. 2792 vs 698). Verified facts:
Bottom line: even with an OCP fp8 codec (gfx950 uses OCP vs gfx942 FNUZ), the gfx950 fp8/bf8 Stream-K kernels themselves are broken — and Old-TE is equally broken — so there is nothing correct to be "at parity" with. fp8/bf8 parity is therefore only meaningful on gfx942/MI300X, where both engines work and the bridge is at parity. On MI350 the valid dtypes are fp16 + bf16 (both swept). That's why MI350 is scoped to fp16/bf16, and gfx950 fp8/bf8 is flagged as a separate upstream ck_tile issue rather than a bridge gap. (The bridge runner guards this: fp8/bf8 raise a clear error on any non-gfx942 arch instead of silently producing wrong results.) |
Parity analysis (interim — full canonical-config sweeps)Running the full TE Stream-K config matrix (every bridge-generatable tile/pipeline/persistent × atomic/linear/tree × 4 shapes) per arch/dtype, fair flags (warmup50/repeat100, flush off, rotating 1, gpu timer, median-of-3 interleaved per case, both sides built fresh on the same node compiler), with standalone auto-recheck of any |gap|>15%. Aggregate so far (gap% = (bridge−TE)/TE; aggregates over measured cases):
Analysis:
Still running (final tables to follow): MI300X bf16/fp8/bf8 (gfx942) and MI350 fp16 (gfx950), in parallel across allocations. Raw per-row results (~5K) are kept in CSVs; only this analysis is posted here. Bottom line so far: the Stream-K bridge is at parity with Tile-Engine across the swept configs; deviations are measurement artifacts on tiny kernels, not real gaps. |
Result CSV files (raw per-row data)All raw parity results live on the build host under Master (all archs+dtypes, with
Per arch / dtype:
Columns: (The Confluence page carries the analysis/summary only; these CSVs hold the ~5K raw rows.) |
Summary
Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.
This PR is stacked on
muozturk/dispatcher-gemm-bridge(#8123) — please mergethat first. Its own diff is just two commits:
[CK_TILE] Add stream_k variant to GEMM Dispatcher codegen(cherry-picked)[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM(this work)What Stream-K needs that regular GEMM doesn't
Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI as regular GEMM, so the Python side (
GpuGemmRunner/GemmDispatcherLib/
GemmProblem) and the GPU worker are reused unchanged. The differences areinternal to the
.so:SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),which allocates the reduction workspace internally (
DeviceMem) and uses theAtomic reduction strategy.
generated_tile_backend.hpp::run()) hard-codes thesingle-problem
GemmHostArgslaunch and won't compile against a Stream-KSelectedKernel. So the Stream-K ctypes lib bypasses the registry and callsSelectedKernel::launch(args, stream)directly, reporting the name from theKERNEL_NAMEmacro (same approach grouped uses).Changes
New
dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp— same single-problemC ABI (
dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,memset C=0 (Atomic accumulates into C), build
StreamKHostArgswith rcrstrides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
tile_engine/ops/gemm/streamk_gemm_full_benchmark.py— 3-phase driver(expand configs →
setup_multiple_gemm_dispatchersbuild → subprocess-isolatedbenchmark), mirroring
gemm_full_benchmark.pywithvariant="stream_k".tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py— disposable GPU worker(identical to the regular worker since the ABI matches).
tile_engine/ops/gemm/gemm_streamk/configs/default_config.json— small sweepconfig (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
pad true, persistent false) → 4 kernels.
Modified
dispatcher/python/gemm_utils.py—_ctypes_source_name()selectsstreamk_gemm_ctypes_lib.cppforvariant=="stream_k"(in both_build_compile_jobsandsetup_multiple_gemm_dispatchers);.nameappends_streamk;variantthreaded intocodegen_argsandexpand_sweep.dispatcher/python/ctypes_utils.py— pass the requested variant to codegen--variantsinstead of hard-coding"standard".Validation (gfx942 / MI300X, fp16 / rcr)
Numeric parity vs a numpy fp32 reference (
A.f32 @ B.f32). Stream-K's Atomicreduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).
Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):
All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each
.soequalsGemmKernelConfig(variant="stream_k").name, ending in_streamk.Unsupported-shape handling: a tiny
257^3problem is correctly reported asunsupported by the kernel (
status -2, too few tiles to partition across CUs)and surfaced gracefully by the bridge — not a crash.
Test plan
unified_gemm_codegen.py ... --variants stream_kemits a*_streamk.hppwhose stem ==
GemmKernelConfig(variant="stream_k").namesetup_multiple_gemm_dispatchersbuilds the Stream-K config set →.socompiles & links against
streamk_gemm_ctypes_lib.cppNext
Land #8123, then this; afterwards delete the legacy
tile_engine/ops/gemm_streamk/machinery (Phase 4).Update 2026-06-12 — brought current with #8123 + Copilot fixes
This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):
streamk_gemm_ctypes_lib.cpp):benchmark knobs defaulted to
warmup=3/repeat=10— a cold, un-ramped clock, theroot cause of the regular bridge's spurious "perf gap." Now default to old-TE's
warmup=50/repeat=100, env-overridable viaCK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING.rotating_countstays 1 forStream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
C would leave rotated copies un-zeroed and corrupt the accumulation.
--verifycorrectness gate (driver + worker): opt-in fp32 numpy referencecheck (global
max|out-ref|/max|ref|,verified/max_relin the CSV); amismatch counts as a failure.
visible GPUs via device-pinned
HIP_VISIBLE_DEVICESworkers (--devices,deviceCSV column); also fixes a latent proc-unbound error in the batch handler.
--dtype/--layoutguards (driver): constrained to the supportedfp16/rcrsurface so a mismatch fails fast.
std::stoi → std::stollfor M/N/K in03_streamk_gemm_driver.cpp; stride-aware C zeroing viahipMemset2DAsync(CLayout-aware, checked HIP status) in
_launch_function_streamk.Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+
--verifyongemm_streamk/configs/default_config.json= 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table +streamk_bridge_oldTE.csv.