Skip to content

perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565

Merged
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4_0-simd-dot
Apr 28, 2026
Merged

perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4_0-simd-dot

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Replaces the fully-scalar inner loop in dotQ4_0BlockMemSeg with the two-stage pattern Q4_K used pre-#562:

  1. Scalar byte-pair unpack writes 32 sign-corrected floats into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic of the prior scalar dot product, which loaded a byte per element).
  2. Vector FMA reduction dot-products the scratch against the matching input slice via FloatVector.fma + reduceLanes(ADD).

Per-block scratch is hoisted out of matmulF32Q4_0MemSeg's loops to amortize allocation. A backwards-compatible no-codeBuf overload preserves the prior signature for any external callers.

Why partial, not fully fused?

Q4_0's nibble layout has adjacent elements sharing a byte (code[2k] = byte[k] & 0x0F, code[2k+1] = byte[k] >>> 4), unlike Q4_K's strided layout where lo/hi nibbles map to separate sub-blocks. Getting Q4_0 codes in natural element order from a ByteVector would need a lane-interleave shuffle or strided gather — significantly more code, and on NEON gather has no native instruction so it falls back to scalar.

The two-stage pattern is the path Q4_K took before #562 promoted it to a fully-fused pipeline. If Q4_0 ever becomes a hot path again (it's rarely used — Q4_K_M / Q4_K_S cover Gemma 4, Llama, Qwen), the same upgrade can follow.

Why now

Q4_0 was the last fully-scalar kernel in JvmQuantizedVectorKernels. Closing it brings the JVM Vector path to parity across Q4_0, Q4_K, Q6_K, Q8_0 — every quantized format we support is now SIMD'd to some degree.

Test plan

  • ./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 pass; QuantizedMemSegMatmulTest exercises the Q4_0 MemSeg matmul end-to-end through ctx.ops.matmul.

🤖 Generated with Claude Code

Replaces the fully-scalar inner loop in dotQ4_0BlockMemSeg with the
two-stage pattern Q4_K used pre-#562: a scalar byte-pair unpack
writes 32 sign-corrected floats into a caller-supplied scratch
FloatArray (16 byte loads, two nibbles each — half the byte traffic
of the prior scalar dot product), then a FloatVector FMA reduction
takes the dot product against the matching input slice.

Per-block scratch is hoisted out of matmulF32Q4_0MemSeg's loops to
amortize allocation. A backwards-compatible no-codeBuf overload
preserves the prior signature for any external callers (creates a
fresh scratch on each invocation).

Q4_0 isn't dominant in modern quantized weights (Q4_K_M / Q4_K_S
cover Gemma 4, Llama, Qwen) but it was the last fully-scalar kernel
in JvmQuantizedVectorKernels — closing it brings the JVM Vector path
to parity across Q4_0, Q4_K, Q6_K, Q8_0.

A fully-fused ByteVector pipeline (a la PanamaVectorQ4KMatmulKernel)
is awkward for Q4_0's interleaved nibble layout — adjacent elements
share a byte, so getting codes in natural order needs a strided
gather or lane-interleave shuffle. Out of scope; if Q4_0 becomes a
hot path again, that's the next move.

Tests: cpu jvmTest 218/218 pass, including QuantizedMemSegMatmulTest's
parity coverage of the Q4_0 MemSeg matmul.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal marked this pull request as ready for review April 28, 2026 21:04
@michalharakal michalharakal merged commit d48f172 into develop Apr 28, 2026
6 checks passed
@michalharakal michalharakal mentioned this pull request Apr 28, 2026
3 tasks
@michalharakal michalharakal deleted the feature/jvm-q4_0-simd-dot branch April 29, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant