Skip to content

Grammar sampler crashes with GGML_ASSERT(!stacks.empty()) on first sample() call (0.1.145) #1007

@sethclawd-prog

Description

@sethclawd-prog

TL;DR

LlamaSampler::sample(ctx, idx) crashes the process with:

llama-grammar.cpp:940: GGML_ASSERT(!stacks.empty()) failed

…on the first sample call when the chain contains LlamaSampler::grammar(...), even though LlamaSampler::grammar returned Ok and the grammar is well-formed (e.g. root ::= "a"). Reproduces deterministically on 0.1.145 (latest published version), CUDA + CPU, single-threaded, real LlamaContext.

Versions

  • llama-cpp-2 = "=0.1.145" (newest as of 2026-04-24)
  • llama-cpp-sys-2 = "=0.1.145" with cuda feature
  • Vendored llama.cpp (whatever ships in llama-cpp-sys-2-0.1.145/llama.cpp/)
  • Linux x86_64, NVIDIA RTX PRO 6000 (Blackwell), CUDA 13.2 — but reproduces with n_gpu_layers=0 too
  • Tested with Qwen3.6-27B-UD-Q6_K_XL.gguf — but content of the model doesn't change the failure

What works (rules out half the search space)

path result
LlamaSampler::grammar(...) constructor ✅ returns Ok for any of the grammars below
grammar.apply(&mut arr) directly on a synthetic LlamaTokenDataArray ✅ no crash
chain_simple([grammar, greedy]).apply(&mut arr) on a synthetic array ✅ no crash
Same chain with a full-vocab (152 064 ids) synthetic array ✅ no crash

What crashes

LlamaSampler::sample(ctx, idx) on the same chain after ctx.decode(&mut prefill_batch). Crashes on the first sample call before any tokens are generated.

Minimal reproducer

This crashes:

use llama_cpp_2::{
    llama_backend::LlamaBackend,
    model::{params::LlamaModelParams, AddBos, LlamaModel},
    context::params::LlamaContextParams,
    llama_batch::LlamaBatch,
    sampling::LlamaSampler,
};
use std::num::NonZeroU32;

let backend = LlamaBackend::init().unwrap();
let model = LlamaModel::load_from_file(
    &backend,
    std::env::var("CRP_MODEL_PATH").unwrap(),
    &LlamaModelParams::default().with_n_gpu_layers(0),  // CPU also crashes
).unwrap();
let mut ctx = model.new_context(
    &backend,
    LlamaContextParams::default().with_n_ctx(NonZeroU32::new(2048)),
).unwrap();

// Even the simplest grammar reproduces.
let grammar = LlamaSampler::grammar(&model, "root ::= \"a\"\n", "root").unwrap();
let mut sampler = LlamaSampler::chain_simple([grammar, LlamaSampler::greedy()]);

let prompt = "<|im_start|>user\nReply with a.<|im_end|>\n<|im_start|>assistant\n";
let tokens = model.str_to_token(prompt, AddBos::Always).unwrap();
let mut batch = LlamaBatch::new(2048, 1);
let last = tokens.len() as i32 - 1;
for (i, t) in tokens.iter().enumerate() {
    batch.add(*t, i as i32, &[0], i as i32 == last).unwrap();
}
ctx.decode(&mut batch).unwrap();

// CRASH here:
let _token = sampler.sample(&ctx, batch.n_tokens() - 1);

Replacing sampler.sample(&ctx, ...) with manual arr.apply_sampler(&sampler) → token = arr.data[arr.selected].id over a LlamaTokenDataArray::from_iter (built from ctx.get_logits_ith(...)) does not crash. So the bug is specifically in the llama_sampler_samplechain_applygrammar_apply_impl interaction with a real context, not in the grammar sampler itself.

Suspected mechanism

llama-grammar.cpp:940:

static llama_grammar_candidates llama_grammar_reject_candidates(
        const llama_grammar_rules      & rules,
        const llama_grammar_stacks     & stacks,
        const llama_grammar_candidates & candidates) {
    GGML_ASSERT(!stacks.empty()); // REVIEW

The upstream // REVIEW comment suggests maintainers already think this assert is fishy. Hypothesis: backend-pre-sampled tokens from decode() cause llama_grammar_accept_impl to advance the grammar to an empty-stack state before our explicit sample() runs apply(). Then apply_impl calls reject_candidates with empty stacks and the assert fires.

Either:

  1. The assert at line 940 should be a graceful fallback (e.g. return empty rejects) — the // REVIEW comment hints this is on the maintainers' radar.
  2. chain_apply should re-init the grammar's stacks before calling apply when the grammar's state is invalid.
  3. Something in the llama_sampler_sample → backend-sample path is calling grammar_accept when it shouldn't.

Workaround (downstream)

For users who hit this in the meantime: build the candidate array yourself from ctx.get_logits_ith(idx) and call arr.apply_sampler(&chain) instead of sampler.sample(&ctx, idx). The grammar sampler itself works fine on synthetic arrays.

We're shipping a different workaround (drop grammar + post-validate JSON shape, since our parser is strict) but that's specific to our use case.

Full investigation writeup

Full reproducer binaries + ruled-out hypothesis matrix + decision log:
[github.com/sethclawd-prog/... — happy to share if useful]

Filing here rather than ggml-org/llama.cpp because the user-visible API surface is LlamaSampler::sample(ctx, idx). Cross-link to upstream is fine if maintainers prefer that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions