Skip to content

Commit c68818a

Browse files
authored
fix(llama-cpp): terminate tensor_buft_overrides with sentinel (#9919)
llama.cpp's model loader asserts back().pattern == nullptr on params.tensor_buft_overrides (and on params.kv_overrides.back().key[0] == 0) before binding them into llama_model_params. PR #8560 attempted to satisfy llama_params_fit's placeholder requirement by pre-filling params.tensor_buft_overrides up to llama_max_tensor_buft_overrides() *before* the option-parse loop. Any subsequent push_back from override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor then appended real entries after the placeholders, leaving back() with a real pattern and tripping the assert. The draft override vector likewise had no terminator at all. Mirror upstream common/arg.cpp:645-658 instead: real entries are pushed during option parsing, and after parsing we pad the main vector up to ntbo (placeholders land at the end, so back() is always nullptr) and append a single {nullptr, nullptr} to the draft vector when it is non-empty. The existing kv_overrides terminator block already matches upstream and stays. Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides (main + draft) and kv_overrides are sentinel-terminated common_params fields; everything else is size-driven std::vector. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
1 parent 11d5bd0 commit c68818a

1 file changed

Lines changed: 14 additions & 6 deletions

File tree

backend/cpp/llama-cpp/grpc-server.cpp

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -522,12 +522,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
522522
// n_ctx_checkpoints: max context checkpoints per slot (default: 8)
523523
params.n_ctx_checkpoints = 8;
524524

525-
// llama memory fit fails if we don't provide a buffer for tensor overrides
526-
const size_t ntbo = llama_max_tensor_buft_overrides();
527-
while (params.tensor_buft_overrides.size() < ntbo) {
528-
params.tensor_buft_overrides.push_back({nullptr, nullptr});
529-
}
530-
531525
// decode options. Options are in form optname:optvale, or if booleans only optname.
532526
for (int i = 0; i < request->options_size(); i++) {
533527
std::string opt = request->options(i);
@@ -1081,6 +1075,20 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
10811075
params.kv_overrides.back().key[0] = 0;
10821076
}
10831077

1078+
// tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
1079+
// Real entries are pushed during option parsing; here we pad/terminate so the
1080+
// model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
1081+
// and so llama_params_fit has the placeholder slots it requires.
1082+
{
1083+
const size_t ntbo = llama_max_tensor_buft_overrides();
1084+
while (params.tensor_buft_overrides.size() < ntbo) {
1085+
params.tensor_buft_overrides.push_back({nullptr, nullptr});
1086+
}
1087+
}
1088+
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
1089+
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
1090+
}
1091+
10841092
// TODO: Add yarn
10851093

10861094
if (!request->tensorsplit().empty()) {

0 commit comments

Comments
 (0)