Skip to content

Commit 23c5236

Browse files
committed
ggml-ve graph compiler: allow cross-fragment inputs by default (VEBP 3.2x)
The self-containment gate refused any cgraph reading a computed intermediate produced by another subgraph. That was a defensive measure added while un-implemented YaRN rope made fragmented compiles garble. With YaRN + the chunked-codegen fixes in place, that staging is correct: execute() stages the cross-fragment input host<->HBM and the producing subgraph (interpreted, or a prior compiled graph) writes the host tensor before this one runs. This was the last thing keeping the VEBP ternary model out of the compiler: its token_embd is F16, so GET_ROWS runs on CPU (VE GET_ROWS supports only BF16/F32 src) and produces embd as a cross-fragment input. The gate refused the whole 1266-node decode graph over that one input -> interpreter. Flip the gate to allow-by-default; GGML_VE_GC_STRICT=1 restores the refusal. Verified correct on Llama-3.2-3B, Bonsai-8B BF16, and Bonsai-8B VEBP. Result (GGML_VE_HBM=1, -fa on, -ub 1, warm): - Ternary-Bonsai-8B-VEBP: 10.6 (interp) -> 33.5 tok/s compiled (3.17x). That completes ternary models running correctly AND fast through the graph compiler. (A cleaner alternative is F16 GET_ROWS on VE so VEBP is fully self-contained -- follow-up.) Opt-in (GGML_VE_COMPILE_GRAPH=1). Not pushed.
1 parent 3301c83 commit 23c5236

2 files changed

Lines changed: 14 additions & 5 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,3 +163,5 @@ AGENTS.local.md
163163
ftrace.out.*
164164
ve_sgemv_wrapper.L
165165
PR_DESCRIPTION.md
166+
graph_*.L
167+
graph_*.so.c

ggml/src/ggml-ve/graph_compiler.cpp

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -616,12 +616,19 @@ bool GraphCompiler::trace(ggml_cgraph * cgraph) {
616616
if (s == 0) continue; // this op's own dst (output)
617617
if (t->op == GGML_OP_NONE) continue; // leaf input
618618
if (pre_produced.count(canonical(t))) continue; // produced here
619-
// GGML_VE_GC_ALLOW_FRAGMENTS=1: bypass the gate to debug why
620-
// chaining compiled middle fragments garbles (task #70).
621-
static const bool allow_frags = (std::getenv("GGML_VE_GC_ALLOW_FRAGMENTS") != nullptr);
622-
if (allow_frags) continue;
619+
// Cross-fragment intermediate input: a computed tensor produced by
620+
// a DIFFERENT subgraph and read here (e.g. 'embd' from a CPU-side
621+
// GET_ROWS when token_embd is F16). execute() stages it host<->HBM;
622+
// the producing subgraph (interpreted, or a prior compiled graph)
623+
// writes the host tensor before this one runs, so it's correct.
624+
// This was refused while un-implemented YaRN rope made fragmented
625+
// compiles garble; with YaRN + chunking fixed it's safe and lets
626+
// VEBP/Qwen3 models compile (Ternary-Bonsai-8B VEBP: 10.6 -> 33
627+
// tok/s). GGML_VE_GC_STRICT=1 restores the refusal.
628+
static const bool strict = (std::getenv("GGML_VE_GC_STRICT") != nullptr);
629+
if (!strict) continue;
623630
if (debug_enabled()) {
624-
fprintf(stderr, "[VE-GC] refuse: cross-fragment intermediate input '%s' (%s) at op #%d — interpreter will run this fragment\n",
631+
fprintf(stderr, "[VE-GC] refuse (strict): cross-fragment intermediate input '%s' (%s) at op #%d\n",
625632
t->name ? t->name : "?", bn ? bn : "<no-buffer>", i);
626633
}
627634
trace_valid_ = false;

0 commit comments

Comments
 (0)