|
| 1 | +# rocm_ck |
| 2 | + |
| 3 | +A C++20 constexpr API for configuring and distributing |
| 4 | +[CK Tile](../include/ck_tile/) GPU kernels across multiple architectures. |
| 5 | + |
| 6 | +> **Status**: Early development. The current code establishes the directory |
| 7 | +> structure, build integration, and CI pipeline. A single unit test verifies |
| 8 | +> that the build and test infrastructure works end-to-end in Jenkins. |
| 9 | +> The schema types, device bridge, and kernel tests described below are |
| 10 | +> under active development. |
| 11 | +
|
| 12 | +## Why rocm_ck exists |
| 13 | + |
| 14 | +CK Tile kernels are C++ templates. A GEMM kernel's tile size, pipeline |
| 15 | +strategy, data types, and epilogue are all template parameters — fixed at |
| 16 | +compile time. This is excellent for performance (zero-overhead abstraction, |
| 17 | +full inlining), but it creates a problem for multi-architecture distribution: |
| 18 | +the host program must be compiled separately from device code, and the host |
| 19 | +compiler must never see CK Tile headers. |
| 20 | + |
| 21 | +rocm_ck solves this by introducing a **host-device boundary** built on |
| 22 | +constexpr data rather than template parameters: |
| 23 | + |
| 24 | +1. **On the host side**, kernel configurations are plain C++20 structs |
| 25 | + (`Signature`, `Algorithm`, `GemmSpec`). These are constexpr data — |
| 26 | + they describe *what* to compute and *how*, without instantiating any |
| 27 | + templates. Host code reasons about kernels using values, not types. |
| 28 | + |
| 29 | +2. **On the device side**, a thin bridge layer lowers these constexpr |
| 30 | + descriptions into CK Tile template instantiations. Each `GemmSpec` |
| 31 | + maps to exactly one `ck_tile::GemmPipeline<...>` specialization. |
| 32 | + |
| 33 | +3. **At the boundary**, pre-compiled kernels are packaged into |
| 34 | + [kpack archives](https://github.com/ROCm/TheRock/blob/main/docs/rfcs/RFC0008-Multi-Arch-Packaging.md) — |
| 35 | + self-describing, compressed, multi-architecture bundles. The host loads kernels at runtime |
| 36 | + by matching a `GemmSpec` against the kpack table of contents. No |
| 37 | + recompilation, no template instantiation on the host. |
| 38 | + |
| 39 | +This separation is what makes CK Tile viable in |
| 40 | +[TheRock](https://github.com/ROCm/TheRock)'s multi-arch build system, |
| 41 | +where a single host binary must work with device code compiled for |
| 42 | +many GPU targets (e.g. gfx90a, gfx942, gfx1151). |
| 43 | + |
| 44 | +## The constexpr schema model |
| 45 | + |
| 46 | +Traditional GPU kernel libraries select kernels through template |
| 47 | +parameters or runtime enums. rocm_ck uses a third approach: **constexpr |
| 48 | +structs that are validated at compile time and lowered to templates on |
| 49 | +the device side.** |
| 50 | + |
| 51 | +A kernel configuration has two axes: |
| 52 | + |
| 53 | +- **Signature** — *what* the kernel computes: a directed graph of |
| 54 | + operators (`GemmOp`, `AddOp`, `ReluOp`, ...) connecting named tensor |
| 55 | + slots. Data types, layouts, and batch dimensions are part of the |
| 56 | + signature. |
| 57 | + |
| 58 | +- **Algorithm** — *how* the kernel computes it: tile geometry, pipeline |
| 59 | + strategy, warp layout, padding, and scheduling. These are tuning |
| 60 | + parameters that don't change the mathematical result. |
| 61 | + |
| 62 | +The `Signature` and `Algorithm` are plain aggregate structs with |
| 63 | +designated initializers — no constructors, no inheritance, no runtime |
| 64 | +polymorphism. Validation happens in `consteval` functions: invalid |
| 65 | +configurations (unsupported tile size, incompatible data types, missing |
| 66 | +tensor slots) fail at compile time with actionable error messages. |
| 67 | + |
| 68 | +Here is a preview of the API direction (not yet implemented): |
| 69 | + |
| 70 | +```cpp |
| 71 | +// Host side — pure constexpr, any C++20 compiler, no CK headers |
| 72 | +constexpr Signature sig = { |
| 73 | + .dtype = DataType::FP16, |
| 74 | + .ops = { |
| 75 | + GemmOp{.lhs = "A", .rhs = "B", .out = "C"}, |
| 76 | + AddOp{.lhs = "C", .rhs = "bias", .out = "D"}, |
| 77 | + ReluOp{.in = "D", .out = "E"}, |
| 78 | + }, |
| 79 | +}; |
| 80 | + |
| 81 | +// Device side — make_kernel lowers to a CK Tile template instantiation. |
| 82 | +// Compiled separately per architecture, packaged into .kpack archives. |
| 83 | +``` |
| 84 | + |
| 85 | +## Directory layout |
| 86 | + |
| 87 | +```text |
| 88 | +rocm_ck/ |
| 89 | +├── CMakeLists.txt # INTERFACE library, C++20, ck_tile_headers target |
| 90 | +├── include/rocm_ck/ # Public headers — host-safe, no CK/HIP deps |
| 91 | +├── src/ # (planned) Device bridge, kpack loading |
| 92 | +└── tests/ |
| 93 | + ├── CMakeLists.txt # Test tiers: ROCM_CK_SMOKE, ROCM_CK_KERNEL |
| 94 | + ├── unit/ # Fast host-only tests (< 1s, no GPU) |
| 95 | + └── kernel/ # (planned) GPU kernel tests |
| 96 | +``` |
| 97 | + |
| 98 | +## Build |
| 99 | + |
| 100 | +rocm_ck is a CK feature, gated by `CK_ENABLE_ROCM_CK`: |
| 101 | + |
| 102 | +```bash |
| 103 | +cd composablekernel |
| 104 | +cmake -B build -S . -G Ninja \ |
| 105 | + -DCK_ENABLE_ROCM_CK=ON \ |
| 106 | + -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ |
| 107 | + |
| 108 | +ninja -C build smoke-rocm-ck # host-only smoke tests |
| 109 | +ninja -C build check-rocm-ck # all rocm_ck tests |
| 110 | +ctest --test-dir build -L ROCM_CK_SMOKE --output-on-failure |
| 111 | +``` |
| 112 | + |
| 113 | +Default CK builds (`CK_ENABLE_ROCM_CK=OFF`) are unaffected. |
0 commit comments