perf(q6_k): SIMD-fy dequantQ6_KBlock via ByteVector ql + qh extraction#564
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
perf(q6_k): SIMD-fy dequantQ6_KBlock via ByteVector ql + qh extraction#564michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
Replaces the scalar 32-iteration inner loop in dequantQ6_KBlock — the hot path under matmulQ6_KVec — with a fused ByteVector pipeline. Per floatStep-wide chunk of l: load slices of ql[qlBase+l], ql[qlBase+l+32], qh[qhBase+l]; assemble q1..q4 = (qlNibble) | ((qhSlice) << 4) − 32 via byte AND/LSHR/OR ops; widen to FloatVector; multiply by per-sub-block d·scale; store to four 32-element regions of the scratch FloatArray. Inline replacement; doesn't change matmulQ6_KVec's outer structure (scratch FloatArray + SIMD dot product remain). Future: full fused matmul (no scratch) is a fair follow-up but more involved because sub-block scales differ across l ∈ 0..15 vs 16..31, requiring 16 parallel accumulators per output cell per block. Q6_K is the dominant format for Gemma 4 E2B Q4_K_M's embedding, lm_head, and FFN matrices, so the dequant cost is non-trivial in real LLM decode. Tests: cpu jvmTest 218/218 pass, including Q6KMatmulTest's parity vs the canonical ggml dequant reference (DequantOps.dequantQ6KFromBytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the scalar 32-iteration inner loop in
dequantQ6_KBlock— the dequant step undermatmulQ6_KVec— with a fusedByteVectorpipeline. PerfloatStep-wide chunk ofl:ql[qlBase + l],ql[qlBase + l + 32],qh[qhBase + l]q1 = (ql0 & 0x0F) | ((qh & 0x03) << 4) − 32q2 = (ql32 & 0x0F) | ((qh >>> 2 & 0x03) << 4) − 32q3 = (ql0 >>> 4) | ((qh >>> 4 & 0x03) << 4) − 32q4 = (ql32 >>> 4) | (qh >>> 6 << 4) − 32FloatVector, multiply by per-sub-blockd·scale, store to four 32-element regions of the scratch FloatArray.Inline replacement —
matmulQ6_KVec's outer structure (scratch + SIMD dot product) is unchanged.Why this matters
Q6_K is the dominant format for the embedding, lm_head, and FFN matrices of Gemma 4 E2B Q4_K_M. The scalar dequant was the hot path inside the per-cell, per-block loop under
matmulQ6_KVec. Closing it brings Q6_K up to the SIMD pipeline standard set by Q4_K (#562/#563) and Q8_0 (already SIMD).Scope
JvmQuantizedVectorKernels.dequantQ6_KBlock. No new SPI surface (fullQ6KMatmulKernelSPI is a fair follow-up if a native FFM provider needs to register here).matmulQ6_KVecouter loop / parallelism / scratch-allocation strategy unchanged.Out of scope (next M5 follow-ups)
l ∈ 0..15vs16..31, requiring 16 parallel accumulators per output cell per block.Q6KMatmulKernelsibling SPI (similar shape toQ4KMatmulKernelfrom feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI #562).MemorySegment. Closes M5.Test plan
./gradlew :skainet-backends:skainet-backend-cpu:jvmTest— 218/218 tests pass, includingQ6KMatmulTest's parity vs the canonical ggmlDequantOps.dequantQ6KFromBytesreference (covers single output row + multi-row × multi-block within1e-4relative tolerance).🤖 Generated with Claude Code