feat(kernel): add KernelProvider SPI for matmul dispatch (Scalar baseline)#554
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
Conversation
…line) Closes #553. Introduces a small SPI between high-level tensor ops (`TensorOps.matmul` et al.) and the actual numeric kernels that do the FLOPs. This is the groundwork that lets a SIMD-accelerated matmul be plugged in without re-implementing the rest of an op-level backend, and lets a hand-written kernel be tested against a scalar reference. What lands: * `sk.ainet.backend.api.kernel.Fp32MatmulKernel` - `C(m, n) = A(m, k) · B(k, n)` row-major - element-stride parameters for caller sub-blocks (no copy needed) - implementations must not mutate inputs / must overwrite the m×n block of out * `sk.ainet.backend.api.kernel.KernelProvider` - `name` / `priority` / `isAvailable()` / per-kernel accessors - per-accessor `null` lets callers fall through to a lower-priority provider when the higher one doesn't ship the kernel * `sk.ainet.backend.api.kernel.KernelRegistry` - process-wide manual registration - `register()` / `find(name)` / `bestAvailable()` / `availableNames()` - `clearForTesting()` for tests - JVM ServiceLoader auto-discovery deferred to a follow-up PR (only one provider ships today; the registry shape supports it without further interface changes) * `sk.ainet.exec.kernel.ScalarMatmulKernel` + `ScalarKernelProvider` (in `skainet-backend-cpu`) - triple-nested-loop reference; honours stride parameters - priority = 0; always available - guaranteed correctness reference and runtime fallback * Tests: - `ScalarMatmulKernelTest`: small / medium / strided sub-blocks on both A and out / zero-m / zero-k / rejects negatives - `KernelRegistryTest`: empty / scalar-only / priority ordering / skip-unavailable / case-insensitive name lookup / re-register no-op Out of scope (separate issues / PRs): * Panama Vector matmul (the actual perf win on JVM). * Native FFM matmul. * Wiring `DefaultCpuOps.matmul` to consult the registry — needs at least one accelerated provider to make the dispatch worth doing. * SDPA kernel API. * Quantized kernels (Q4_K, Q8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #553. First step on the M5 track (KernelProvider + accelerated kernels for
matmul/SDPA/ quantized) — this PR lands only the SPI plus the scalar baseline; Panama Vector and native FFM kernels follow in separate PRs once this lands.Summary
sk.ainet.backend.api.kernel.Fp32MatmulKernel—C(m,n) = A(m,k)·B(k,n)row-major. Stride parameters let callers pass sub-blocks of larger arrays without copying. Implementations must not mutate inputs and must fully overwrite them × nblock ofout.KernelProvider—name/priority/isAvailable()plus per-kernel accessors (matmulFp32(): Fp32MatmulKernel?). Per-accessornulllets callers fall through to a lower-priority provider when the higher one doesn't ship that kernel.KernelRegistry— manual register /find(name)/bestAvailable()/availableNames(). JVMServiceLoaderauto-discovery is deferred to a follow-up PR (only one provider ships today; the shape supports it without further interface changes).ScalarMatmulKernel+ScalarKernelProviderinskainet-backend-cpu— triple-nested-loop reference, priority=0, always available. Acts as the correctness benchmark accelerated kernels must match, and as the runtime fallback.Test plan
:skainet-backends:skainet-backend-cpu:jvmTest:ScalarMatmulKernelTest— small/medium shapes; strided sub-blocks on both A and out; m=0; k=0; rejects negative dimensionsKernelRegistryTest— empty / scalar-only / priority-ordering / skip-unavailable / case-insensitive name lookup / re-register no-opPlus pre-existing
:skainet-lang:skainet-lang-core:jvmTestand:skainet-compile:skainet-compile-dag:jvmTeststill green.Out of scope (separate issues/PRs)
DefaultCpuOps.matmulto consult the registry — needs at least one accelerated provider to make the dispatch worth doing. Until thenScalarKernelProvideris reachable but unused by the existing op layer.🤖 Generated with Claude Code