TL;DR
LlamaSampler::sample(ctx, idx) crashes the process with:
llama-grammar.cpp:940: GGML_ASSERT(!stacks.empty()) failed
…on the first sample call when the chain contains LlamaSampler::grammar(...), even though LlamaSampler::grammar returned Ok and the grammar is well-formed (e.g. root ::= "a"). Reproduces deterministically on 0.1.145 (latest published version), CUDA + CPU, single-threaded, real LlamaContext.
Versions
llama-cpp-2 = "=0.1.145" (newest as of 2026-04-24)
llama-cpp-sys-2 = "=0.1.145" with cuda feature
- Vendored llama.cpp (whatever ships in
llama-cpp-sys-2-0.1.145/llama.cpp/)
- Linux x86_64, NVIDIA RTX PRO 6000 (Blackwell), CUDA 13.2 — but reproduces with
n_gpu_layers=0 too
- Tested with Qwen3.6-27B-UD-Q6_K_XL.gguf — but content of the model doesn't change the failure
What works (rules out half the search space)
| path |
result |
LlamaSampler::grammar(...) constructor |
✅ returns Ok for any of the grammars below |
grammar.apply(&mut arr) directly on a synthetic LlamaTokenDataArray |
✅ no crash |
chain_simple([grammar, greedy]).apply(&mut arr) on a synthetic array |
✅ no crash |
| Same chain with a full-vocab (152 064 ids) synthetic array |
✅ no crash |
What crashes
LlamaSampler::sample(ctx, idx) on the same chain after ctx.decode(&mut prefill_batch). Crashes on the first sample call before any tokens are generated.
Minimal reproducer
This crashes:
use llama_cpp_2::{
llama_backend::LlamaBackend,
model::{params::LlamaModelParams, AddBos, LlamaModel},
context::params::LlamaContextParams,
llama_batch::LlamaBatch,
sampling::LlamaSampler,
};
use std::num::NonZeroU32;
let backend = LlamaBackend::init().unwrap();
let model = LlamaModel::load_from_file(
&backend,
std::env::var("CRP_MODEL_PATH").unwrap(),
&LlamaModelParams::default().with_n_gpu_layers(0), // CPU also crashes
).unwrap();
let mut ctx = model.new_context(
&backend,
LlamaContextParams::default().with_n_ctx(NonZeroU32::new(2048)),
).unwrap();
// Even the simplest grammar reproduces.
let grammar = LlamaSampler::grammar(&model, "root ::= \"a\"\n", "root").unwrap();
let mut sampler = LlamaSampler::chain_simple([grammar, LlamaSampler::greedy()]);
let prompt = "<|im_start|>user\nReply with a.<|im_end|>\n<|im_start|>assistant\n";
let tokens = model.str_to_token(prompt, AddBos::Always).unwrap();
let mut batch = LlamaBatch::new(2048, 1);
let last = tokens.len() as i32 - 1;
for (i, t) in tokens.iter().enumerate() {
batch.add(*t, i as i32, &[0], i as i32 == last).unwrap();
}
ctx.decode(&mut batch).unwrap();
// CRASH here:
let _token = sampler.sample(&ctx, batch.n_tokens() - 1);
Replacing sampler.sample(&ctx, ...) with manual arr.apply_sampler(&sampler) → token = arr.data[arr.selected].id over a LlamaTokenDataArray::from_iter (built from ctx.get_logits_ith(...)) does not crash. So the bug is specifically in the llama_sampler_sample → chain_apply → grammar_apply_impl interaction with a real context, not in the grammar sampler itself.
Suspected mechanism
llama-grammar.cpp:940:
static llama_grammar_candidates llama_grammar_reject_candidates(
const llama_grammar_rules & rules,
const llama_grammar_stacks & stacks,
const llama_grammar_candidates & candidates) {
GGML_ASSERT(!stacks.empty()); // REVIEW
The upstream // REVIEW comment suggests maintainers already think this assert is fishy. Hypothesis: backend-pre-sampled tokens from decode() cause llama_grammar_accept_impl to advance the grammar to an empty-stack state before our explicit sample() runs apply(). Then apply_impl calls reject_candidates with empty stacks and the assert fires.
Either:
- The assert at line 940 should be a graceful fallback (e.g. return empty rejects) — the
// REVIEW comment hints this is on the maintainers' radar.
chain_apply should re-init the grammar's stacks before calling apply when the grammar's state is invalid.
- Something in the
llama_sampler_sample → backend-sample path is calling grammar_accept when it shouldn't.
Workaround (downstream)
For users who hit this in the meantime: build the candidate array yourself from ctx.get_logits_ith(idx) and call arr.apply_sampler(&chain) instead of sampler.sample(&ctx, idx). The grammar sampler itself works fine on synthetic arrays.
We're shipping a different workaround (drop grammar + post-validate JSON shape, since our parser is strict) but that's specific to our use case.
Full investigation writeup
Full reproducer binaries + ruled-out hypothesis matrix + decision log:
[github.com/sethclawd-prog/... — happy to share if useful]
Filing here rather than ggml-org/llama.cpp because the user-visible API surface is LlamaSampler::sample(ctx, idx). Cross-link to upstream is fine if maintainers prefer that.
TL;DR
LlamaSampler::sample(ctx, idx)crashes the process with:…on the first sample call when the chain contains
LlamaSampler::grammar(...), even thoughLlamaSampler::grammarreturnedOkand the grammar is well-formed (e.g.root ::= "a"). Reproduces deterministically on0.1.145(latest published version), CUDA + CPU, single-threaded, realLlamaContext.Versions
llama-cpp-2 = "=0.1.145"(newest as of 2026-04-24)llama-cpp-sys-2 = "=0.1.145"withcudafeaturellama-cpp-sys-2-0.1.145/llama.cpp/)n_gpu_layers=0tooWhat works (rules out half the search space)
LlamaSampler::grammar(...)constructorgrammar.apply(&mut arr)directly on a syntheticLlamaTokenDataArraychain_simple([grammar, greedy]).apply(&mut arr)on a synthetic arrayWhat crashes
LlamaSampler::sample(ctx, idx)on the same chain afterctx.decode(&mut prefill_batch). Crashes on the first sample call before any tokens are generated.Minimal reproducer
This crashes:
Replacing
sampler.sample(&ctx, ...)with manualarr.apply_sampler(&sampler) → token = arr.data[arr.selected].idover aLlamaTokenDataArray::from_iter(built fromctx.get_logits_ith(...)) does not crash. So the bug is specifically in thellama_sampler_sample→chain_apply→grammar_apply_implinteraction with a real context, not in the grammar sampler itself.Suspected mechanism
llama-grammar.cpp:940:The upstream
// REVIEWcomment suggests maintainers already think this assert is fishy. Hypothesis: backend-pre-sampled tokens fromdecode()causellama_grammar_accept_implto advance the grammar to an empty-stack state before our explicitsample()runsapply(). Thenapply_implcallsreject_candidateswith empty stacks and the assert fires.Either:
// REVIEWcomment hints this is on the maintainers' radar.chain_applyshould re-init the grammar's stacks before calling apply when the grammar's state is invalid.llama_sampler_sample→ backend-sample path is callinggrammar_acceptwhen it shouldn't.Workaround (downstream)
For users who hit this in the meantime: build the candidate array yourself from
ctx.get_logits_ith(idx)and callarr.apply_sampler(&chain)instead ofsampler.sample(&ctx, idx). The grammar sampler itself works fine on synthetic arrays.We're shipping a different workaround (drop grammar + post-validate JSON shape, since our parser is strict) but that's specific to our use case.
Full investigation writeup
Full reproducer binaries + ruled-out hypothesis matrix + decision log:
[github.com/sethclawd-prog/... — happy to share if useful]
Filing here rather than ggml-org/llama.cpp because the user-visible API surface is
LlamaSampler::sample(ctx, idx). Cross-link to upstream is fine if maintainers prefer that.