Skip to content

Commit 8f4ac53

Browse files
Merge pull request #715 from SKaiNET-developers/feature/708-phase3-panama-q5
Feature/708 phase3 panama q5
2 parents 0c17a6b + 2c38e31 commit 8f4ac53

7 files changed

Lines changed: 333 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
104104
|---|---|
105105
| Examples and sample projects | [SKaiNET-examples](https://github.com/SKaiNET-developers/SKaiNET-examples) |
106106
| Interactive notebooks | [SKaiNET-notebook](https://github.com/SKaiNET-developers/SKaiNET-notebook) |
107+
| Eager backends & kernels (what runs where) | [Backends & kernels mindmap](docs/eager-execution-backends-and-kernels.md) |
107108

108109
---
109110

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Eager execution: backends & kernels
2+
3+
A map of SKaiNET's **eager** compute path — the `TensorOps` backend and its pluggable
4+
matmul **kernel providers** — showing what exists today (✅), what's in progress (🚧), and
5+
what's missing (❌). The eager path is `DirectCpuExecutionContext → DefaultCpuOps*
6+
KernelRegistry → KernelProvider`, distinct from the StableHLO/IREE export path.
7+
8+
Legend: ✅ available · 🚧 partial / works via a legacy path · ❌ missing.
9+
10+
```mermaid
11+
mindmap
12+
root((SKaiNET eager execution))
13+
CPU backend
14+
Scalar floor ✅
15+
commonMain — all KMP targets
16+
FP32 ✅
17+
BF16 ✅
18+
Q8_0 ✅
19+
Q4_0 ✅
20+
Q4_K ✅ new
21+
Q6_K ✅ new
22+
Q5_1 ✅ new
23+
Q5_0 ✅ new
24+
Panama Vector ✅
25+
JVM SIMD — jdk.incubator.vector
26+
FP32 BF16 Q8_0 Q4_0 ✅
27+
Q4_K ✅
28+
Q5_1 Q5_0 ✅ new
29+
Q6_K 🚧 legacy SIMD path
30+
Native FFM ✅
31+
JVM only — C kernels via CMake
32+
FP32 BF16 Q8_0 Q4_0 Q4_K ✅
33+
Q4_K MemSeg zero-copy ✅
34+
Q5_1 Q5_0 Q6_K ❌
35+
Apple Accelerate ✅
36+
Native macOS iOS — cinterop
37+
dense FP32 matmul ✅
38+
elementwise reductions ✅
39+
packed quant via scalar
40+
Platforms
41+
JVM ✅ scalar + Panama + FFM
42+
Native linux ✅ scalar only
43+
Native apple ✅ scalar + Accelerate
44+
JS and WASM ✅ scalar only
45+
Gaps and roadmap
46+
Native FFM Q5 and Q6_K ❌ issue 708
47+
Native SIMD on linux ❌
48+
Panama SPI Q6_K kernel 🚧
49+
Q5_K Q2_K Q3_K IQ4 packed ❌ dequant only
50+
GPU backends IREE Metal ❌ future
51+
```
52+
53+
## Kernel × provider (matmul, FP32 activations)
54+
55+
| Weight format | Scalar (all targets) | Panama Vector (JVM SIMD) | Native FFM (JVM) |
56+
|---|:--:|:--:|:--:|
57+
| FP32 ||||
58+
| BF16 ||||
59+
| Q8_0 ||||
60+
| Q4_0 ||||
61+
| Q4_K ||||
62+
| Q6_K || 🚧 legacy `JvmQuantizedVectorKernels` (no SPI kernel) ||
63+
| Q5_1 ||||
64+
| Q5_0 ||||
65+
| Q5_K / Q2_K / Q3_K / Q8_K / IQ4 | ❌ (dequant-to-FP32 only) |||
66+
67+
Resolution is by priority: **Native FFM (100) → Panama (50) → Scalar (0)** — the best
68+
*available* provider that carries the kernel wins; otherwise it cascades down.
69+
70+
## Platform × what runs
71+
72+
| Target | Providers available | Notes |
73+
|---|---|---|
74+
| **JVM / Android(JVM)** | Scalar + Panama + Native-FFM | full SIMD/native acceleration |
75+
| **Kotlin/Native — linux x64/arm64** | Scalar | no SIMD yet (scalar floor) |
76+
| **Kotlin/Native — macOS/iOS** | Scalar + Apple Accelerate | Accelerate accelerates *dense* FP32; packed-quant via scalar |
77+
| **JS / WASM (Js, Wasi)** | Scalar | no SIMD |
78+
79+
**Packed-quant matmul now works on every target** (Q4_K/Q6_K/Q5_1/Q5_0 gained a commonMain
80+
scalar kernel, and `DefaultCpuOpsBase` dispatches packed weights via the registry). Before,
81+
those formats were JVM-only and broke on Native.
82+
83+
## In progress / missing (with trackers)
84+
85+
- 🚧 **Q6_K Panama SPI kernel** — Q6_K is SIMD on JVM via the legacy `JvmQuantizedVectorKernels.matmulQ6_KVec`, but has no `PanamaVectorQ6KMatmulKernel`/`KernelProvider.matmulQ6K()` SPI entry yet.
86+
-**Native FFM Q5_1/Q5_0/Q6_K** — the C kernel set covers FP32/BF16/Q8_0/Q4_0/Q4_K only. Tracked by **SKaiNET#708** (core kernel) and **SKaiNET-transformers#170** (converter wiring).
87+
-**Native SIMD on linux** — Kotlin/Native linux targets run the scalar floor; no cinterop/OpenBLAS or SIMD path. (Apple has Accelerate for dense ops.)
88+
-**Other GGML quant formats** (Q5_K, Q2_K, Q3_K, Q8_K, IQ4_NL/XS) — loadable via dequant-to-FP32, but no packed matmul kernel.
89+
-**Non-CPU eager backends** (IREE, Metal, GPU) — the `KernelProvider` SPI anticipates them, but none are implemented for the eager path today.
90+
91+
> Generated as a hand-authored overview. A machine-generated kernel × platform matrix
92+
> (derived from the registered `KernelProvider`s) is a planned follow-up so this stays in sync.

skainet-backends/skainet-backend-cpu/api/jvm/skainet-backend-cpu.api

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,16 @@ public final class sk/ainet/exec/kernel/PanamaVectorQ4_0MatmulKernel : sk/ainet/
9292
public fun matmul ([FI[BIII[FI)V
9393
}
9494

95+
public final class sk/ainet/exec/kernel/PanamaVectorQ5_0MatmulKernel : sk/ainet/backend/api/kernel/Q5_0MatmulKernel {
96+
public static final field INSTANCE Lsk/ainet/exec/kernel/PanamaVectorQ5_0MatmulKernel;
97+
public fun matmul ([FI[BIII[FI)V
98+
}
99+
100+
public final class sk/ainet/exec/kernel/PanamaVectorQ5_1MatmulKernel : sk/ainet/backend/api/kernel/Q5_1MatmulKernel {
101+
public static final field INSTANCE Lsk/ainet/exec/kernel/PanamaVectorQ5_1MatmulKernel;
102+
public fun matmul ([FI[BIII[FI)V
103+
}
104+
95105
public final class sk/ainet/exec/kernel/PanamaVectorQ8_0MatmulKernel : sk/ainet/backend/api/kernel/Q8_0MatmulKernel {
96106
public static final field INSTANCE Lsk/ainet/exec/kernel/PanamaVectorQ8_0MatmulKernel;
97107
public fun matmul ([FI[BIII[FI)V

skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/kernel/PanamaVectorKernelProvider.kt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ import sk.ainet.backend.api.kernel.Fp32MatmulKernel
55
import sk.ainet.backend.api.kernel.KernelProvider
66
import sk.ainet.backend.api.kernel.Q4KMatmulKernel
77
import sk.ainet.backend.api.kernel.Q4_0MatmulKernel
8+
import sk.ainet.backend.api.kernel.Q5_0MatmulKernel
9+
import sk.ainet.backend.api.kernel.Q5_1MatmulKernel
810
import sk.ainet.backend.api.kernel.Q8_0MatmulKernel
911
import sk.ainet.exec.tensor.ops.JvmCpuBackendConfig
1012

@@ -53,6 +55,12 @@ public object PanamaVectorKernelProvider : KernelProvider {
5355
override fun matmulQ4_0(): Q4_0MatmulKernel? =
5456
if (isAvailable()) PanamaVectorQ4_0MatmulKernel else null
5557

58+
override fun matmulQ5_1(): Q5_1MatmulKernel? =
59+
if (isAvailable()) PanamaVectorQ5_1MatmulKernel else null
60+
61+
override fun matmulQ5_0(): Q5_0MatmulKernel? =
62+
if (isAvailable()) PanamaVectorQ5_0MatmulKernel else null
63+
5664
private fun isVectorApiClassLoaded(): Boolean = runCatching {
5765
Class.forName("jdk.incubator.vector.FloatVector")
5866
Class.forName("jdk.incubator.vector.VectorSpecies")
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
package sk.ainet.exec.kernel
2+
3+
import jdk.incubator.vector.FloatVector
4+
import jdk.incubator.vector.VectorOperators
5+
import jdk.incubator.vector.VectorSpecies
6+
import sk.ainet.backend.api.kernel.Q5_0MatmulKernel
7+
8+
/**
9+
* SIMD-vectorized FP32 × Q5_0 matmul on the JDK Vector API (scratch-dequant then FMA).
10+
* Dequant `d*(code + (highBit shl 4) - 16)` (symmetric, no per-block min). Numerically
11+
* equivalent to [ScalarQ5_0MatmulKernel]. Block-major layout `(blockIdx*outputDim+o)*22`.
12+
*/
13+
public object PanamaVectorQ5_0MatmulKernel : Q5_0MatmulKernel {
14+
15+
private const val BLOCK_SIZE = 32
16+
private const val BYTES_PER_BLOCK = 22
17+
private val floatSpecies: VectorSpecies<Float> = FloatVector.SPECIES_PREFERRED
18+
19+
override fun matmul(
20+
input: FloatArray, inputOffset: Int,
21+
weight: ByteArray, weightByteOffset: Int,
22+
inputDim: Int, outputDim: Int,
23+
output: FloatArray, outputOffset: Int,
24+
) {
25+
require(inputDim % BLOCK_SIZE == 0) {
26+
"PanamaVectorQ5_0MatmulKernel: inputDim must be a multiple of $BLOCK_SIZE; got $inputDim"
27+
}
28+
if (outputDim == 0) return
29+
if (inputDim == 0) { for (o in 0 until outputDim) output[outputOffset + o] = 0f; return }
30+
val blocksPerInputDim = inputDim / BLOCK_SIZE
31+
val step = floatSpecies.length()
32+
val loopBound = floatSpecies.loopBound(BLOCK_SIZE)
33+
val codeBuf = FloatArray(BLOCK_SIZE)
34+
35+
for (o in 0 until outputDim) {
36+
var acc = 0f
37+
for (blockIdx in 0 until blocksPerInputDim) {
38+
val base = weightByteOffset + (blockIdx * outputDim + o) * BYTES_PER_BLOCK
39+
val d = halfToFloat(((weight[base + 1].toInt() and 0xFF) shl 8) or (weight[base].toInt() and 0xFF))
40+
val qh0 = weight[base + 2].toInt() and 0xFF
41+
val qh1 = weight[base + 3].toInt() and 0xFF
42+
val qh2 = weight[base + 4].toInt() and 0xFF
43+
val qh3 = weight[base + 5].toInt() and 0xFF
44+
val qsBase = base + 6
45+
for (j in 0 until 16) {
46+
val q = weight[qsBase + j].toInt() and 0xFF
47+
val bitLo = ((if (j < 8) qh0 else qh1) ushr (j and 7)) and 1
48+
val bitHi = ((if (j < 8) qh2 else qh3) ushr (j and 7)) and 1
49+
codeBuf[j] = d * ((q and 0x0F) + (bitLo shl 4) - 16)
50+
codeBuf[16 + j] = d * ((q ushr 4) + (bitHi shl 4) - 16)
51+
}
52+
val inputBase = inputOffset + blockIdx * BLOCK_SIZE
53+
var accVec = FloatVector.zero(floatSpecies)
54+
var k = 0
55+
while (k < loopBound) {
56+
accVec = FloatVector.fromArray(floatSpecies, input, inputBase + k)
57+
.fma(FloatVector.fromArray(floatSpecies, codeBuf, k), accVec)
58+
k += step
59+
}
60+
acc += accVec.reduceLanes(VectorOperators.ADD)
61+
while (k < BLOCK_SIZE) { acc += input[inputBase + k] * codeBuf[k]; k++ }
62+
}
63+
output[outputOffset + o] = acc
64+
}
65+
}
66+
67+
private fun halfToFloat(hbits: Int): Float {
68+
val sign = (hbits and 0x8000) shl 16
69+
val exp = (hbits and 0x7C00) shr 10
70+
val mant = hbits and 0x03FF
71+
return when (exp) {
72+
0 -> if (mant == 0) Float.fromBits(sign) else {
73+
var m = mant; var e = -14
74+
while ((m and 0x400) == 0) { m = m shl 1; e-- }
75+
Float.fromBits(sign or ((e + 127) shl 23) or ((m and 0x3FF) shl 13))
76+
}
77+
31 -> Float.fromBits(sign or (0xFF shl 23) or (mant shl 13))
78+
else -> Float.fromBits(sign or ((exp - 15 + 127) shl 23) or (mant shl 13))
79+
}
80+
}
81+
}
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
package sk.ainet.exec.kernel
2+
3+
import jdk.incubator.vector.FloatVector
4+
import jdk.incubator.vector.VectorOperators
5+
import jdk.incubator.vector.VectorSpecies
6+
import sk.ainet.backend.api.kernel.Q5_1MatmulKernel
7+
8+
/**
9+
* SIMD-vectorized FP32 × Q5_1 matmul on the JDK Vector API. Per 32-element block:
10+
* decode `d`/`m`/`qh`, dequant the 32 codes (`d*(code + (highBit shl 4)) + m`, split
11+
* nibble layout) into a reusable scratch buffer, then SIMD-FMA against the matching
12+
* input window. Numerically equivalent to [ScalarQ5_1MatmulKernel] within FMA +
13+
* reordered-reduction tolerance. Block-major weight layout `(blockIdx*outputDim+o)*24`.
14+
*/
15+
public object PanamaVectorQ5_1MatmulKernel : Q5_1MatmulKernel {
16+
17+
private const val BLOCK_SIZE = 32
18+
private const val BYTES_PER_BLOCK = 24
19+
private val floatSpecies: VectorSpecies<Float> = FloatVector.SPECIES_PREFERRED
20+
21+
override fun matmul(
22+
input: FloatArray, inputOffset: Int,
23+
weight: ByteArray, weightByteOffset: Int,
24+
inputDim: Int, outputDim: Int,
25+
output: FloatArray, outputOffset: Int,
26+
) {
27+
require(inputDim % BLOCK_SIZE == 0) {
28+
"PanamaVectorQ5_1MatmulKernel: inputDim must be a multiple of $BLOCK_SIZE; got $inputDim"
29+
}
30+
if (outputDim == 0) return
31+
if (inputDim == 0) { for (o in 0 until outputDim) output[outputOffset + o] = 0f; return }
32+
val blocksPerInputDim = inputDim / BLOCK_SIZE
33+
val step = floatSpecies.length()
34+
val loopBound = floatSpecies.loopBound(BLOCK_SIZE)
35+
val codeBuf = FloatArray(BLOCK_SIZE)
36+
37+
for (o in 0 until outputDim) {
38+
var acc = 0f
39+
for (blockIdx in 0 until blocksPerInputDim) {
40+
val base = weightByteOffset + (blockIdx * outputDim + o) * BYTES_PER_BLOCK
41+
val d = halfToFloat(((weight[base + 1].toInt() and 0xFF) shl 8) or (weight[base].toInt() and 0xFF))
42+
val m = halfToFloat(((weight[base + 3].toInt() and 0xFF) shl 8) or (weight[base + 2].toInt() and 0xFF))
43+
val qh0 = weight[base + 4].toInt() and 0xFF
44+
val qh1 = weight[base + 5].toInt() and 0xFF
45+
val qh2 = weight[base + 6].toInt() and 0xFF
46+
val qh3 = weight[base + 7].toInt() and 0xFF
47+
val qsBase = base + 8
48+
for (j in 0 until 16) {
49+
val q = weight[qsBase + j].toInt() and 0xFF
50+
val bitLo = ((if (j < 8) qh0 else qh1) ushr (j and 7)) and 1
51+
val bitHi = ((if (j < 8) qh2 else qh3) ushr (j and 7)) and 1
52+
codeBuf[j] = d * ((q and 0x0F) + (bitLo shl 4)) + m
53+
codeBuf[16 + j] = d * ((q ushr 4) + (bitHi shl 4)) + m
54+
}
55+
val inputBase = inputOffset + blockIdx * BLOCK_SIZE
56+
var accVec = FloatVector.zero(floatSpecies)
57+
var k = 0
58+
while (k < loopBound) {
59+
accVec = FloatVector.fromArray(floatSpecies, input, inputBase + k)
60+
.fma(FloatVector.fromArray(floatSpecies, codeBuf, k), accVec)
61+
k += step
62+
}
63+
acc += accVec.reduceLanes(VectorOperators.ADD)
64+
while (k < BLOCK_SIZE) { acc += input[inputBase + k] * codeBuf[k]; k++ }
65+
}
66+
output[outputOffset + o] = acc
67+
}
68+
}
69+
70+
/** Same FP16 → FP32 conversion as [ScalarQ5_1MatmulKernel]. */
71+
private fun halfToFloat(hbits: Int): Float {
72+
val sign = (hbits and 0x8000) shl 16
73+
val exp = (hbits and 0x7C00) shr 10
74+
val mant = hbits and 0x03FF
75+
return when (exp) {
76+
0 -> if (mant == 0) Float.fromBits(sign) else {
77+
var m = mant; var e = -14
78+
while ((m and 0x400) == 0) { m = m shl 1; e-- }
79+
Float.fromBits(sign or ((e + 127) shl 23) or ((m and 0x3FF) shl 13))
80+
}
81+
31 -> Float.fromBits(sign or (0xFF shl 23) or (mant shl 13))
82+
else -> Float.fromBits(sign or ((exp - 15 + 127) shl 23) or (mant shl 13))
83+
}
84+
}
85+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
package sk.ainet.exec.kernel
2+
3+
import kotlin.math.abs
4+
import kotlin.random.Random
5+
import kotlin.test.Test
6+
import kotlin.test.assertTrue
7+
8+
/** Panama SIMD Q5_1/Q5_0 kernels must match the scalar reference within FMA tolerance. */
9+
class PanamaVectorQ5ParityTest {
10+
11+
private fun half(v: Float): Int {
12+
val b = v.toRawBits(); val s = (b ushr 16) and 0x8000
13+
val e = ((b ushr 23) and 0xFF) - 127 + 15; val m = b and 0x7FFFFF
14+
if (e <= 0) return s; if (e >= 31) return s or 0x7C00
15+
return s or (e shl 10) or (m ushr 13)
16+
}
17+
18+
/** Block-major packed bytes with VALID (finite) f16 scales; random qh/qs codes. */
19+
private fun bytes(bpb: Int, inDim: Int, outDim: Int, rng: Random): ByteArray {
20+
val out = ByteArray(outDim * (inDim / 32) * bpb)
21+
var off = 0
22+
while (off < out.size) {
23+
val d = half(rng.nextFloat() * 0.05f + 0.01f)
24+
out[off] = (d and 0xFF).toByte(); out[off + 1] = ((d ushr 8) and 0xFF).toByte()
25+
var codeStart = off + 2
26+
if (bpb == 24) { // Q5_1 has a per-block min `m`
27+
val m = half(rng.nextFloat() - 0.5f)
28+
out[off + 2] = (m and 0xFF).toByte(); out[off + 3] = ((m ushr 8) and 0xFF).toByte()
29+
codeStart = off + 4
30+
}
31+
for (k in codeStart until off + bpb) out[k] = rng.nextInt(256).toByte()
32+
off += bpb
33+
}
34+
return out
35+
}
36+
37+
private fun check(q5_1: Boolean, inDim: Int, outDim: Int, seed: Int) {
38+
val rng = Random(seed)
39+
val w = bytes(if (q5_1) 24 else 22, inDim, outDim, rng)
40+
val input = FloatArray(inDim) { rng.nextFloat() - 0.5f }
41+
val a = FloatArray(outDim); val b = FloatArray(outDim)
42+
if (q5_1) {
43+
ScalarQ5_1MatmulKernel.matmul(input, 0, w, 0, inDim, outDim, a, 0)
44+
PanamaVectorQ5_1MatmulKernel.matmul(input, 0, w, 0, inDim, outDim, b, 0)
45+
} else {
46+
ScalarQ5_0MatmulKernel.matmul(input, 0, w, 0, inDim, outDim, a, 0)
47+
PanamaVectorQ5_0MatmulKernel.matmul(input, 0, w, 0, inDim, outDim, b, 0)
48+
}
49+
var maxErr = 0f; var maxAbs = 1f
50+
for (o in 0 until outDim) { maxErr = maxOf(maxErr, abs(a[o] - b[o])); maxAbs = maxOf(maxAbs, abs(a[o])) }
51+
assertTrue(maxErr < 1e-4f * maxAbs + 1e-4f, "${if (q5_1) "Q5_1" else "Q5_0"} Panama≠Scalar: maxErr=$maxErr (maxAbs=$maxAbs)")
52+
}
53+
54+
@Test fun q5_1_panama_matches_scalar() = check(true, 256, 64, 1)
55+
@Test fun q5_0_panama_matches_scalar() = check(false, 256, 48, 2)
56+
}

0 commit comments

Comments
 (0)