Perf/q6k neon kernel by michalharakal · Pull Request #787 · SKaiNET-developers/SKaiNET

michalharakal · 2026-07-02T21:11:38Z

No description provided.

…eads Apply the same cache-locality fix as q4k_matmul (d998feb) to the Q5_K and Q8_0 kernels: iterate block-OUTER / output-row-INNER so the block-major weight (blockIdx*output_dim + o)*bytes is read sequentially (stride = one block) instead of striding output_dim*bytes per step — the strided pattern makes every weight read a cold miss on the in-order A55. out_base[o] accumulates across blocks; accumulation order is unchanged so results are numerically identical. Both validated on host against the Panama reference (NativeQ5KMatmulKernelParityTest, NativeQ8_0MatmulKernelParityTest green). Not exercised by TinyLlama Q4_K_M (Q4_K + Q6_K + F32 only), so no board delta for that model — this keeps the K-quant kernels consistent and benefits any model that uses Q5_K/Q8_0 weights. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Apply the same cache-locality reorder as q4k/q5k/q8_0 to the Q6_K kernel: iterate block-OUTER / output-row-INNER so the block-major weight (blockIdx*output_dim + o)*210 is read sequentially. out_base[o] accumulates across blocks; numerically identical (NativeQ6KMatmulKernel parity green). NOTE: unlike Q4_K (memory-stall-bound → reorder gave 2.07×), Q6_K showed NO board speedup (matmul 20133 → 20168 ms, within noise). Q6_K materializes a full 256-float scratch via scalar 6-bit unpack (skainet_q6k_dequant_block) before the dot, so it is dequant-COMPUTE-bound, not weight-read-bound — sequential reads don't help. The reorder is kept for consistency and because it cannot hurt; the real Q6_K lever is vectorizing/fusing the 6-bit dequant (NEON unpack or Q8 int-dot), a separate rewrite. Q6_K is ~13% of tensors (10 ffn_down [5632,2048], 10 attn_v, output [2048,32000]). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mirror the q4k fused-int8 kernel: pre-quantize the input row to symmetric int8 (Q8) once per 256-block (reused across all output rows), unpack the 6-bit weight to centered int8 codes, and run each scale-group as an int8 dot (vdotq_s32 on dotprod targets, scalar fallback otherwise). Drops the 256-float scratch dequant + per-element float multiply. acc = d · d_in · Σ_g sc[g]·Σ_{i∈g} q8[i]·codes[i]. This is deliberately lossy (ggml-style activation quant, ~1-3% on worst-case uniform-random fixtures) so it is no longer bit-exact vs the float/scalar reference. Both parity tests (jvmTest Panama, nativeTest cinterop on linuxX64 + linuxArm64) switch from per-row relative error — unbounded on near-zero rows of zero-mean fixtures — to the aggregate error-energy gate RMS(error)/RMS(signal) < 0.03. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt 2.07× Bumps VERSION_NAME 0.33.0 -> 0.34.0. Bundles the develop changes since 0.33.0: the new skainet-data-source module (URI-backed sources, HF auth, raw format parsers, suspend data pipeline DSL) + dataset operation views and richer batches (#784/#785), the bf16-native DSL -> StableHLO export path and the pluggable per-phase/per-target compile-optimization seam (#788/#791), NEON K-quant matmul perf (block-outer order + fused Q8 int8 dot, 2.07x Q4_K on Cortex-A55) with aarch64 board verification (#786/#787), LayerNorm f32 normalization + rank-0 tensor-type emission fixes, macOS host build fix (#789), Code of Conduct (#790), and the offline markup-antora docs image (#781). Minor bump (not patch): new published module skainet-data-source; all data-api additions are default-bearing (no source-incompatible changes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

michalharakal and others added 3 commits July 2, 2026 22:19

michalharakal merged commit 7c9f83d into develop Jul 2, 2026
7 checks passed

michalharakal deleted the perf/q6k-neon-kernel branch July 2, 2026 21:13

michalharakal mentioned this pull request Jul 5, 2026

release: 0.34.0 — URI data sources, bf16 StableHLO export, NEON K-qua… #792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf/q6k neon kernel#787

Perf/q6k neon kernel#787
michalharakal merged 3 commits into
developfrom
perf/q6k-neon-kernel

michalharakal commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant