Perf/q6k neon kernel#787
Merged
Merged
Conversation
…eads Apply the same cache-locality fix as q4k_matmul (d998feb) to the Q5_K and Q8_0 kernels: iterate block-OUTER / output-row-INNER so the block-major weight (blockIdx*output_dim + o)*bytes is read sequentially (stride = one block) instead of striding output_dim*bytes per step — the strided pattern makes every weight read a cold miss on the in-order A55. out_base[o] accumulates across blocks; accumulation order is unchanged so results are numerically identical. Both validated on host against the Panama reference (NativeQ5KMatmulKernelParityTest, NativeQ8_0MatmulKernelParityTest green). Not exercised by TinyLlama Q4_K_M (Q4_K + Q6_K + F32 only), so no board delta for that model — this keeps the K-quant kernels consistent and benefits any model that uses Q5_K/Q8_0 weights. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the same cache-locality reorder as q4k/q5k/q8_0 to the Q6_K kernel: iterate block-OUTER / output-row-INNER so the block-major weight (blockIdx*output_dim + o)*210 is read sequentially. out_base[o] accumulates across blocks; numerically identical (NativeQ6KMatmulKernel parity green). NOTE: unlike Q4_K (memory-stall-bound → reorder gave 2.07×), Q6_K showed NO board speedup (matmul 20133 → 20168 ms, within noise). Q6_K materializes a full 256-float scratch via scalar 6-bit unpack (skainet_q6k_dequant_block) before the dot, so it is dequant-COMPUTE-bound, not weight-read-bound — sequential reads don't help. The reorder is kept for consistency and because it cannot hurt; the real Q6_K lever is vectorizing/fusing the 6-bit dequant (NEON unpack or Q8 int-dot), a separate rewrite. Q6_K is ~13% of tensors (10 ffn_down [5632,2048], 10 attn_v, output [2048,32000]). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the q4k fused-int8 kernel: pre-quantize the input row to symmetric
int8 (Q8) once per 256-block (reused across all output rows), unpack the
6-bit weight to centered int8 codes, and run each scale-group as an int8
dot (vdotq_s32 on dotprod targets, scalar fallback otherwise). Drops the
256-float scratch dequant + per-element float multiply.
acc = d · d_in · Σ_g sc[g]·Σ_{i∈g} q8[i]·codes[i].
This is deliberately lossy (ggml-style activation quant, ~1-3% on
worst-case uniform-random fixtures) so it is no longer bit-exact vs the
float/scalar reference. Both parity tests (jvmTest Panama, nativeTest
cinterop on linuxX64 + linuxArm64) switch from per-row relative error —
unbounded on near-zero rows of zero-mean fixtures — to the aggregate
error-energy gate RMS(error)/RMS(signal) < 0.03.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
michalharakal
added a commit
that referenced
this pull request
Jul 5, 2026
…nt 2.07× Bumps VERSION_NAME 0.33.0 -> 0.34.0. Bundles the develop changes since 0.33.0: the new skainet-data-source module (URI-backed sources, HF auth, raw format parsers, suspend data pipeline DSL) + dataset operation views and richer batches (#784/#785), the bf16-native DSL -> StableHLO export path and the pluggable per-phase/per-target compile-optimization seam (#788/#791), NEON K-quant matmul perf (block-outer order + fused Q8 int8 dot, 2.07x Q4_K on Cortex-A55) with aarch64 board verification (#786/#787), LayerNorm f32 normalization + rank-0 tensor-type emission fixes, macOS host build fix (#789), Code of Conduct (#790), and the offline markup-antora docs image (#781). Minor bump (not patch): new published module skainet-data-source; all data-api additions are default-bearing (no source-incompatible changes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.