ggml-cpu: optimize ggml_gemm_q4_K_8x8_q8_K interleaving/staging for AVX-512 (and AVX2)#22525
Open
HyeongiJeon wants to merge 4 commits intoggml-org:masterfrom
Open
ggml-cpu: optimize ggml_gemm_q4_K_8x8_q8_K interleaving/staging for AVX-512 (and AVX2)#22525HyeongiJeon wants to merge 4 commits intoggml-org:masterfrom
HyeongiJeon wants to merge 4 commits intoggml-org:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR optimizes the CPU implementation of
ggml_gemm_q4_K_8x8_q8_K, mainly for the AVX-512 path.The change reduces RHS/LHS staging overhead.
The mathematical computation is unchanged.
The optimization only changes how the packed q4_K/q8_K data is prepared and accumulated.
The main performance gain is on AVX-512.
The AVX2 path was also checked and is roughly neutral to slightly positive in local testing.
Additional information
Correctness
I compared the optimized implementation against the baseline/reference implementation with multiple q4_K GEMM shapes.
(n, nr, nc)7.54e-062.52e-064.42e-062.81e-055.47e-058.54e-059.04e-05All tested correctness cases passed.
Benchmark notes
The micro-benchmark measures only
ggml_gemm_q4_K_8x8_q8_K(). Quantization and packing are done before timing.Tested shapes are based on common LLM projection GEMMs:
(4096, *, 4096)– 7B/8B-class attention projection(4096, *, 11008)– 7B-class FFN gate/up projection(8192, *, 8192)– larger hidden-size attention projectionResults should be compared before/after on the same CPU.
(Absolute GFLOPS should not be compared across CPUs because clock behavior and power policy differ significantly between systems.)
Test systems
Micro-benchmark summary
Detailed micro-benchmark results
Intel Xeon 6443N, AVX-512
(n, nr, nc)Intel Core i7-11700, AVX-512
(n, nr, nc)Intel Core i9-10900X, AVX-512
(n, nr, nc)AMD Ryzen 7 5800X, AVX2
(n, nr, nc)End-to-end sanity check
I also ran
llama-benchon a Qwen3-14B-Q4_K_M.gguf model as an end-to-end sanity check. (pp1024,tg128, 1 thread):As expected, the end-to-end gain is smaller than the isolated micro-kernel benchmark because full inference includes other kernels and runtime overhead. The improvement is mainly visible in prompt processing; token generation is roughly neutral in this test.
Tests
9.04e-05)ctest --test-dir build-test --output-on-failure: passedllama-benchsanity check: passedRequirements
AI assistance was used only for English translation and for writing a private/local micro-benchmark scaffolding script (not included in this PR). All submitted code was manually implemented, reviewed, and tested by me.