Skip to content

Commit 3b176b1

Browse files
committed
ggml-ve graph compiler: compile N>1 prompt eval as one fused kernel (16x)
Prompt eval ran on the interpreter (the compiler refused N>1). ftrace showed it pinned ~1.6 cores: the matvec was invoked ~82k times across the prompt graph, each spawning its own 8-thread region, so per-op fork/join dominated. This compiles the whole prompt graph into one #pragma omp parallel (fork once, ops share via #pragma omp for) — the same mechanism that gave decode 2-3x. Llama-3.2-3B prompt eval 3.56 -> 57.2 tok/s (16x); decode unchanged at 48.2. Output is token-for-token identical to the interpreter on Llama-3.2-3B (BF16) and Ternary-Bonsai-8B (VEBP). Design (size-independent: one .so serves any prompt length): - Kernel gains a runtime n_tok arg + an HMEM positions[] array (staged from the ROPE inp_pos / SET_ROWS index leaf, i64->i32). Per-token element/row counts are baked (full / n_tok_baked_) and re-scaled by n_tok at runtime. - MUL_MAT loops n_tok activation columns (colmajor BF16 -> one cblas_sgemm for N>1); MUL/ADD/GLU scale the flat range by n_tok; RMS_NORM goes row-parallel for the prompt; ROPE loops (token,head) with positions[t]; SET_ROWS writes KV cell positions[t]; GET_ROWS loops n_tok input ids. - FLASH_ATTN: new strided _inner (attention_f32q_bf16kv_fused_gqa_inner_strided) takes Q/out head strides in bytes, called once per query token with seq_len = positions[t]+1 as the causal mask (no explicit mask needed for a single-sequence prompt). Decode is the n_tok==1 / contiguous special case. The graph is NOT uniformly N tokens: llama slices to the output tokens (GET_ROWS inp_out_ids, n_out=1) before the final norm + lm_head. Per-op `scales_n` keeps that tail on its own count, and the output-selection GET_ROWS copies the trailing n_out rows instead of a contiguous prefix. (Both were the N>1 correctness bug; see kb/llama-cpp-integration/prompt-graph-last-token-slice.) GGML_VE_GC_NO_NGT1=1 forces decode-only routing (interpreter prompt eval). F32-KV prompt eval (no strided F32 _inner) and ne[3]>1 batches stay refused.
1 parent 2974f4b commit 3b176b1

3 files changed

Lines changed: 406 additions & 183 deletions

File tree

0 commit comments

Comments
 (0)