Commit 3b176b1
committed
ggml-ve graph compiler: compile N>1 prompt eval as one fused kernel (16x)
Prompt eval ran on the interpreter (the compiler refused N>1). ftrace showed
it pinned ~1.6 cores: the matvec was invoked ~82k times across the prompt
graph, each spawning its own 8-thread region, so per-op fork/join dominated.
This compiles the whole prompt graph into one #pragma omp parallel (fork once,
ops share via #pragma omp for) — the same mechanism that gave decode 2-3x.
Llama-3.2-3B prompt eval 3.56 -> 57.2 tok/s (16x); decode unchanged at 48.2.
Output is token-for-token identical to the interpreter on Llama-3.2-3B (BF16)
and Ternary-Bonsai-8B (VEBP).
Design (size-independent: one .so serves any prompt length):
- Kernel gains a runtime n_tok arg + an HMEM positions[] array (staged from the
ROPE inp_pos / SET_ROWS index leaf, i64->i32). Per-token element/row counts
are baked (full / n_tok_baked_) and re-scaled by n_tok at runtime.
- MUL_MAT loops n_tok activation columns (colmajor BF16 -> one cblas_sgemm for
N>1); MUL/ADD/GLU scale the flat range by n_tok; RMS_NORM goes row-parallel
for the prompt; ROPE loops (token,head) with positions[t]; SET_ROWS writes KV
cell positions[t]; GET_ROWS loops n_tok input ids.
- FLASH_ATTN: new strided _inner (attention_f32q_bf16kv_fused_gqa_inner_strided)
takes Q/out head strides in bytes, called once per query token with
seq_len = positions[t]+1 as the causal mask (no explicit mask needed for a
single-sequence prompt). Decode is the n_tok==1 / contiguous special case.
The graph is NOT uniformly N tokens: llama slices to the output tokens
(GET_ROWS inp_out_ids, n_out=1) before the final norm + lm_head. Per-op
`scales_n` keeps that tail on its own count, and the output-selection GET_ROWS
copies the trailing n_out rows instead of a contiguous prefix. (Both were the
N>1 correctness bug; see kb/llama-cpp-integration/prompt-graph-last-token-slice.)
GGML_VE_GC_NO_NGT1=1 forces decode-only routing (interpreter prompt eval).
F32-KV prompt eval (no strided F32 _inner) and ne[3]>1 batches stay refused.1 parent 2974f4b commit 3b176b1
3 files changed
Lines changed: 406 additions & 183 deletions
0 commit comments