Skip to content

[CK Tile] Prepare mixed batch-prefill FP8 KV contract#3745

Draft
ysmkone wants to merge 2 commits into
ROCm:developfrom
ysmkone:issue-3744-batch-prefill-mixed-kv-diagnostic
Draft

[CK Tile] Prepare mixed batch-prefill FP8 KV contract#3745
ysmkone wants to merge 2 commits into
ROCm:developfrom
ysmkone:issue-3744-batch-prefill-mixed-kv-diagnostic

Conversation

@ysmkone

@ysmkone ysmkone commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Replaces the thin one-off diagnostic with a reusable FMHA dtype-contract classifier shared by the Python dispatcher runner and AOT codegen validation.
  • Classifies the requested serving contract, BF16/FP16 Q + FP8 K/V + BF16/FP16 O, as an explicit unsupported mixed_q_fp8_kv contract instead of letting it look like a normal BF16/FP16 batch-prefill request.
  • Preserves the existing all-FP8 Q/K/V fp8bf16 batch-prefill path and adjacent all-BF16/all-FP16 paths.
  • Expands tests around the GLM-style decode shape from CK Tile FMHA/batch-prefill: support BF16/FP16 Q with FP8 KV for GQA decode #3744 and future explicit q_data_type / kv_data_type codegen configs.

Current status

This remains a draft preparatory PR. It does not implement full mixed-dtype CK Tile batch-prefill kernels for #3744.

The branch now makes the support boundary precise in dispatcher and codegen: CK Tile batch-prefill currently selects Q/K/V operand types from one data_type token, and fp8bf16 means FP8 Q/K/V with BF16 output. The requested contract needs separate Q activation dtype, KV storage dtype, output dtype, descale plumbing, and generated kernel instances before this PR can claim full support.

Tests

  • python -m unittest dispatcher.tests.test_fmha_utils dispatcher.tests.test_fmha_rules dispatcher.tests.test_fmha_codegen

Local validation limits

  • fleet, hipcc, and rocminfo were not available on PATH in this Windows environment, so I could not run a gfx942 compile/probe smoke.

Remaining kernel work

  • Add explicit mixed dtype fields through the CK Tile batch-prefill traits/args and dispatcher signatures.
  • Add a mixed TypeConfig or equivalent kernel template path where Q/O are BF16 or FP16 while K/V cache storage is FP8 E4M3.
  • Thread K/V descales through the batch-prefill API and generated instances.
  • Generate and validate gfx942 instances for the GLM GQA decode case.

Next validation spec

Run on MI300X/gfx942 with HIP tooling available:

  • API: CK Tile mha_batch_prefill_func / batch-prefill dispatcher
  • Shape: batch=4, q_len=1, ctx_len=1024, num_q_heads=96, num_kv_heads=8, head_dim=128, GQA ratio 12
  • KV metadata: paged/batch-prefill metadata, page_size=1, SGLang-style lookup if available
  • Dtypes: Q/O bf16 and fp16 variants, K/V cache uint8 storage interpreted as FP8 E4M3 with descales
  • Target: gfx942
  • Expected for this draft: the dispatcher reports the explicit unsupported mixed activation/FP8-KV contract. Expected after kernel implementation: CK Tile compiles and runs without requiring Q quantization to FP8.

Made with Cursor

ysingh-amd and others added 2 commits June 8, 2026 16:38
Co-authored-by: Cursor <cursoragent@cursor.com>
Make the unsupported BF16/FP16 Q with FP8 KV path explicit across dispatcher and codegen validation so future kernel plumbing has a single contract to extend.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ysmkone ysmkone changed the title [CK Tile] Diagnose unsupported mixed batch-prefill FP8 KV [CK Tile] Prepare mixed batch-prefill FP8 KV contract Jun 9, 2026
@ThomasNing

ThomasNing commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks for the fast responsed PR! I believe we need to put it in to the rocm-libraries (https://github.com/ROCm/rocm-libraries/pulls). CK repo is just a mirrored repo right now.

@ecamartins @CongMa13 Do you mind puttingit into the rocm library to run the preliminary CI?

cc. @illsilin @cgmillette

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants