Skip to content

Perf/q6k neon kernel#787

Merged
michalharakal merged 3 commits into
developfrom
perf/q6k-neon-kernel
Jul 2, 2026
Merged

Perf/q6k neon kernel#787
michalharakal merged 3 commits into
developfrom
perf/q6k-neon-kernel

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

No description provided.

michalharakal and others added 3 commits July 2, 2026 22:19
…eads

Apply the same cache-locality fix as q4k_matmul (d998feb) to the Q5_K
and Q8_0 kernels: iterate block-OUTER / output-row-INNER so the
block-major weight (blockIdx*output_dim + o)*bytes is read sequentially
(stride = one block) instead of striding output_dim*bytes per step — the
strided pattern makes every weight read a cold miss on the in-order A55.
out_base[o] accumulates across blocks; accumulation order is unchanged so
results are numerically identical.

Both validated on host against the Panama reference
(NativeQ5KMatmulKernelParityTest, NativeQ8_0MatmulKernelParityTest green).
Not exercised by TinyLlama Q4_K_M (Q4_K + Q6_K + F32 only), so no board
delta for that model — this keeps the K-quant kernels consistent and
benefits any model that uses Q5_K/Q8_0 weights.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the same cache-locality reorder as q4k/q5k/q8_0 to the Q6_K kernel:
iterate block-OUTER / output-row-INNER so the block-major weight
(blockIdx*output_dim + o)*210 is read sequentially. out_base[o]
accumulates across blocks; numerically identical (NativeQ6KMatmulKernel
parity green).

NOTE: unlike Q4_K (memory-stall-bound → reorder gave 2.07×), Q6_K showed
NO board speedup (matmul 20133 → 20168 ms, within noise). Q6_K
materializes a full 256-float scratch via scalar 6-bit unpack
(skainet_q6k_dequant_block) before the dot, so it is dequant-COMPUTE-bound,
not weight-read-bound — sequential reads don't help. The reorder is kept
for consistency and because it cannot hurt; the real Q6_K lever is
vectorizing/fusing the 6-bit dequant (NEON unpack or Q8 int-dot), a
separate rewrite. Q6_K is ~13% of tensors (10 ffn_down [5632,2048], 10
attn_v, output [2048,32000]).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the q4k fused-int8 kernel: pre-quantize the input row to symmetric
int8 (Q8) once per 256-block (reused across all output rows), unpack the
6-bit weight to centered int8 codes, and run each scale-group as an int8
dot (vdotq_s32 on dotprod targets, scalar fallback otherwise). Drops the
256-float scratch dequant + per-element float multiply.

acc = d · d_in · Σ_g sc[g]·Σ_{i∈g} q8[i]·codes[i].

This is deliberately lossy (ggml-style activation quant, ~1-3% on
worst-case uniform-random fixtures) so it is no longer bit-exact vs the
float/scalar reference. Both parity tests (jvmTest Panama, nativeTest
cinterop on linuxX64 + linuxArm64) switch from per-row relative error —
unbounded on near-zero rows of zero-mean fixtures — to the aggregate
error-energy gate RMS(error)/RMS(signal) < 0.03.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 7c9f83d into develop Jul 2, 2026
7 checks passed
@michalharakal michalharakal deleted the perf/q6k-neon-kernel branch July 2, 2026 21:13
michalharakal added a commit that referenced this pull request Jul 5, 2026
…nt 2.07×

Bumps VERSION_NAME 0.33.0 -> 0.34.0. Bundles the develop changes since 0.33.0:
the new skainet-data-source module (URI-backed sources, HF auth, raw format
parsers, suspend data pipeline DSL) + dataset operation views and richer
batches (#784/#785), the bf16-native DSL -> StableHLO export path and the
pluggable per-phase/per-target compile-optimization seam (#788/#791), NEON
K-quant matmul perf (block-outer order + fused Q8 int8 dot, 2.07x Q4_K on
Cortex-A55) with aarch64 board verification (#786/#787), LayerNorm f32
normalization + rank-0 tensor-type emission fixes, macOS host build fix
(#789), Code of Conduct (#790), and the offline markup-antora docs image (#781).

Minor bump (not patch): new published module skainet-data-source; all data-api
additions are default-bearing (no source-incompatible changes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant