Skip to content

ggml-cpu: optimize ggml_gemm_q4_K_8x8_q8_K interleaving/staging for AVX-512 (and AVX2)#22525

Open
HyeongiJeon wants to merge 4 commits intoggml-org:masterfrom
HyeongiJeon:optimization/ggml_gemm_q4_K_8x8_q8_K
Open

ggml-cpu: optimize ggml_gemm_q4_K_8x8_q8_K interleaving/staging for AVX-512 (and AVX2)#22525
HyeongiJeon wants to merge 4 commits intoggml-org:masterfrom
HyeongiJeon:optimization/ggml_gemm_q4_K_8x8_q8_K

Conversation

@HyeongiJeon
Copy link
Copy Markdown

@HyeongiJeon HyeongiJeon commented Apr 29, 2026

Overview

This PR optimizes the CPU implementation of ggml_gemm_q4_K_8x8_q8_K, mainly for the AVX-512 path.

The change reduces RHS/LHS staging overhead.
The mathematical computation is unchanged.
The optimization only changes how the packed q4_K/q8_K data is prepared and accumulated.

The main performance gain is on AVX-512.
The AVX2 path was also checked and is roughly neutral to slightly positive in local testing.

Additional information

Correctness

I compared the optimized implementation against the baseline/reference implementation with multiple q4_K GEMM shapes.

Shape (n, nr, nc) Max relative error Result
256, 4, 8 7.54e-06 PASS
512, 4, 8 2.52e-06 PASS
4096, 4, 8 4.42e-06 PASS
4096, 16, 8 2.81e-05 PASS
4096, 16, 16 5.47e-05 PASS
4096, 16, 32 8.54e-05 PASS
4096, 64, 64 9.04e-05 PASS

All tested correctness cases passed.

Benchmark notes

The micro-benchmark measures only ggml_gemm_q4_K_8x8_q8_K(). Quantization and packing are done before timing.
Tested shapes are based on common LLM projection GEMMs:

  • (4096, *, 4096) – 7B/8B-class attention projection
  • (4096, *, 11008) – 7B-class FFN gate/up projection
  • (8192, *, 8192) – larger hidden-size attention projection

Results should be compared before/after on the same CPU.
(Absolute GFLOPS should not be compared across CPUs because clock behavior and power policy differ significantly between systems.)

Test systems

CPU Path
Intel Xeon 6443N (Sapphire Rapids) AVX-512
Intel Core i7-11700 (Rocket Lake) AVX-512
Intel Core i9-10900X (Cascade Lake) AVX-512
AMD Ryzen 7 5800X AVX2

Micro-benchmark summary

CPU Path Geomean speedup Min Max
Intel Xeon 6443N AVX-512 1.196x 1.181x 1.214x
Intel Core i7-11700 AVX-512 1.105x 1.065x 1.154x
Intel Core i9-10900X AVX-512 1.139x 1.119x 1.161x
AMD Ryzen 7 5800X AVX2 ~1.01x 0.997x 1.03x

Detailed micro-benchmark results

Intel Xeon 6443N, AVX-512
Shape (n, nr, nc) Before GFLOPS After GFLOPS Speedup
4096, 4, 4096 86.6 105.1 1.214x
4096, 32, 4096 119.0 140.5 1.181x
4096, 128, 4096 119.1 140.8 1.182x
4096, 4, 11008 86.7 105.2 1.213x
4096, 32, 11008 119.2 141.0 1.183x
8192, 4, 8192 87.0 105.6 1.214x
8192, 32, 8192 119.3 141.6 1.187x
Intel Core i7-11700, AVX-512
Shape (n, nr, nc) Before GFLOPS After GFLOPS Speedup
4096, 4, 4096 145.8 162.9 1.117x
4096, 32, 4096 219.0 234.9 1.073x
4096, 128, 4096 217.8 232.7 1.068x
4096, 4, 11008 133.3 153.8 1.154x
4096, 32, 11008 204.3 233.6 1.143x
8192, 4, 8192 141.3 158.2 1.120x
8192, 32, 8192 221.0 235.3 1.065x
Intel Core i9-10900X, AVX-512
Shape (n, nr, nc) Before GFLOPS After GFLOPS Speedup
4096, 4, 4096 130.1 151.0 1.161x
4096, 32, 4096 177.9 202.1 1.136x
4096, 128, 4096 178.4 201.8 1.131x
4096, 4, 11008 118.7 136.4 1.149x
4096, 32, 11008 175.3 196.1 1.119x
8192, 4, 8192 115.9 133.3 1.150x
8192, 32, 8192 174.8 197.9 1.132x
AMD Ryzen 7 5800X, AVX2
Shape (n, nr, nc) Before GFLOPS After GFLOPS Speedup
4096, 4, 4096 102.4 102.8 1.004x
4096, 32, 4096 135.7 136.2 1.004x
4096, 128, 4096 132.7 134.1 1.011x
4096, 4, 11008 98.0 100.4 1.024x
4096, 32, 11008 133.4 134.5 1.008x
8192, 4, 8192 97.6 97.3 0.997x
8192, 32, 8192 131.7 135.4 1.028x

End-to-end sanity check

I also ran llama-bench on a Qwen3-14B-Q4_K_M.gguf model as an end-to-end sanity check. (pp1024, tg128, 1 thread):

CPU Before pp1024 After pp1024 pp speedup Before tg128 After tg128
Intel Xeon 6443N 3.43 t/s 3.80 t/s 1.108x 0.90 t/s 0.90 t/s
Intel Core i7-11700 6.10 t/s 6.35 t/s 1.041x 2.21 t/s 2.20 t/s
Intel Core i9-10900X 4.71 t/s 4.96 t/s 1.053x 1.32 t/s 1.31 t/s

As expected, the end-to-end gain is smaller than the isolated micro-kernel benchmark because full inference includes other kernels and runtime overhead. The improvement is mainly visible in prompt processing; token generation is roughly neutral in this test.

Tests

  • Custom q4_K GEMM correctness test: passed (max rel. error ≤ 9.04e-05)
  • ctest --test-dir build-test --output-on-failure: passed
  • llama-bench sanity check: passed

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES.
    AI assistance was used only for English translation and for writing a private/local micro-benchmark scaffolding script (not included in this PR). All submitted code was manually implemented, reviewed, and tested by me.

@HyeongiJeon HyeongiJeon requested a review from ggerganov as a code owner April 29, 2026 15:09
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant