CK Tile Dispatcher Examples

Comprehensive examples for GEMM and Grouped Convolution operations with GPU execution.

Quick Start

Step 1: Build

cd /path/to/composable_kernel/dispatcher
mkdir -p build && cd build

cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx942" \
  -DBUILD_DISPATCHER_EXAMPLES=ON

# Build everything (C++ examples + Python libraries)
make -j$(nproc)

# Or build ONLY Python libraries (faster)
make python_libs -j$(nproc)

Step 2: Run C++ Examples

cd build/examples

# GEMM
./gemm_01_basic
./gemm_02_multi_size
./gemm_03_benchmark_validation
./gemm_04_heuristics
./gemm_05_json_export
./gemm_06_multi_registry

Step 3: Run Python Examples

cd /path/to/composable_kernel/dispatcher

# GEMM
python3 examples/gemm/python/01_basic_gemm.py
python3 examples/gemm/python/04_validation.py
python3 examples/gemm/python/07_stress_test.py
python3 examples/gemm/python/08_heuristics.py

Directory Structure

examples/
|---- gemm/
|   |---- cpp/           # 7 C++ GEMM examples
|   +---- python/        # 11 Python GEMM examples
|
|---- grouped_conv/
|   |---- cpp/           # 7 C++ Grouped Conv examples
|   +---- python/        # 6 Python Grouped Conv examples
|
|---- fmha/
|   |---- cpp/           # 35 C++ FMHA examples (all variants)
|   +---- python/        # 38 Python FMHA examples (JIT-compiled)
|
+---- README.md

GEMM Examples

C++ Examples

#	Example	Description
01	`gemm_01_basic`	Basic GEMM with declarative API, autofill, autocorrect
02	`gemm_02_multi_size`	Wildcard expansion for multiple configurations
03	`gemm_03_benchmark_validation`	Performance benchmarking with CPU/GPU validation
04	`gemm_04_heuristics`	Heuristic-based kernel selection
05	`gemm_05_json_export`	Registry JSON export for external tools
06	`gemm_06_multi_registry`	Multiple registries with named kernel sets

Details: gemm/cpp/README.md

Python Examples

#	Example	Description
01	`01_basic_gemm.py`	Basic GEMM with multi-kernel support
02	`02_batch_gemm.py`	Batched GEMM operations
03	`03_benchmark.py`	Performance benchmarking
04	`04_validation.py`	CPU reference validation
05	`05_numpy_integration.py`	NumPy array integration
06	`06_json_export.py`	Registry JSON export
07	`07_stress_test.py`	Multi-kernel stress testing (48 configs)
08	`08_heuristics.py`	Heuristic-based kernel selection (24 configs)
09	`09_multi_registry.py`	Multiple registries
10	`10_advanced_benchmark.py`	Advanced benchmark with full control
11	`11_json_import.py`	Import kernels from JSON

Details: gemm/python/README.md

Key Features

Declarative Kernel API

Both C++ and Python examples use a declarative approach:

C++ (DECL_KERNEL_SET macro):

DECL_KERNEL_SET(my_kernels,
    .add(
        Signature().dtype("fp16").layout("rcr"),
        Algorithm().tile(256, 256, 32).wave(2, 2, 1).warp(32, 32, 16)
                   .pipeline("compv4").scheduler("intrawave"),
        "gfx942"
    )
);

Python (KernelConfig):

config = KernelConfig(
    tile_m=256, tile_n=256, tile_k=32,
    wave_m=2, wave_n=2, wave_k=1,
    warp_tile_m=32, warp_tile_n=32, warp_tile_k=16,
    pipeline="compv4", scheduler="intrawave"
)

Autofill and Autocorrect

The build system automatically:

Autofills missing parameters with sensible defaults
Autocorrects invalid parameters based on architecture constraints
Expands wildcards (*, -1, ANY_INT) to all valid configurations

Architecture Filtering

Kernel configurations are validated against GPU architecture constraints:

Tile divisibility requirements
Warp tile constraints
Pipeline compatibility

Invalid configurations are automatically pruned during code generation.

Validation Examples

C++ Validation

./gemm_03_benchmark_validation --verify 1    # GEMM with CPU reference
./gemm_03_benchmark_validation --verify 2    # GEMM with GPU reference

Python Validation

python3 examples/gemm/python/04_validation.py
python3 examples/gemm/python/07_stress_test.py   # Multi-kernel validation

Troubleshooting

Python: Library not found

# Run from dispatcher directory
cd /path/to/composable_kernel/dispatcher
python3 examples/gemm/python/01_basic_gemm.py

C++: Executables not found

# Build with examples enabled
cmake .. -DBUILD_DISPATCHER_EXAMPLES=ON
make -j$(nproc)

# Run from build/examples
cd build/examples
./gemm_01_basic

GPU not detected

rocminfo | grep "Name:"
# Should show: gfx942, gfx90a, etc.

Grouped Convolution

Grouped convolution support has been re-introduced with a unified infrastructure shared with GEMM.

Infrastructure

The grouped convolution code generation, utilities, and build scripts are available:

Component	Location
C++ Headers	`include/ck_tile/dispatcher/grouped_conv_*.hpp`
Python Codegen	`codegen/unified_grouped_conv_codegen.py`
Python Utils	`python/grouped_conv_utils.py`
Build Script	`scripts/compile_grouped_conv_examples.py`

Building Grouped Conv Kernels

# Generate grouped conv kernels
python3 codegen/unified_grouped_conv_codegen.py \
    --output-dir build/generated_kernels \
    --datatype fp16 --variant forward --ndim-spatial 2

# Compile a grouped conv example
python3 scripts/compile_grouped_conv_examples.py my_grouped_conv_example.cpp

See the main README for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CK Tile Dispatcher Examples

Quick Start

Step 1: Build

Step 2: Run C++ Examples

Step 3: Run Python Examples

Directory Structure

GEMM Examples

C++ Examples

Python Examples

Key Features

Declarative Kernel API

Autofill and Autocorrect

Architecture Filtering

Validation Examples

C++ Validation

Python Validation

Troubleshooting

Python: Library not found

C++: Executables not found

GPU not detected

Grouped Convolution

Infrastructure

Building Grouped Conv Kernels

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

CK Tile Dispatcher Examples

Quick Start

Step 1: Build

Step 2: Run C++ Examples

Step 3: Run Python Examples

Directory Structure

GEMM Examples

C++ Examples

Python Examples

Key Features

Declarative Kernel API

Autofill and Autocorrect

Architecture Filtering

Validation Examples

C++ Validation

Python Validation

Troubleshooting

Python: Library not found

C++: Executables not found

GPU not detected

Grouped Convolution

Infrastructure

Building Grouped Conv Kernels