Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
98bb579
ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing l…
reeselevine Apr 28, 2026
f9f3365
vulkan: Coalesce Q4_K/Q5_K scale loads (#21751)
TheBlueMatt Apr 28, 2026
52e5f0a
common : re-arm reasoning budget after DONE on new <think> (#22323)
BruceJillis Apr 28, 2026
5d56eff
convert : add support for Nemotron Nano 3 Omni (#22481)
danbev Apr 28, 2026
7b8443a
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (…
lnigam Apr 28, 2026
fc2b005
ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196)
michaelw9999 Apr 28, 2026
739393b
TP: fix delayed AllReduce + zero-sized slices (#22489)
JohannesGaessler Apr 29, 2026
bdc9c74
ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916)
hrushitfujitsu Apr 29, 2026
7b95ea5
common: Intentionally leak logger instance to fix hanging on Windows …
rillomas Apr 29, 2026
d6a5094
ggml-webgpu: Fix bug in FlashAttention support check (#22492)
reeselevine Apr 29, 2026
b5c4227
ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317)
qiurui144 Apr 29, 2026
3142f1d
ggml-cuda: refactor fusion code (#22468)
am17an Apr 29, 2026
1cbc846
ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault …
shalinib-ibm Apr 29, 2026
59237bf
webui: fix slow mic stop and WAV encode (#22480)
ServeurpersoCom Apr 29, 2026
4b221b7
ggml : bump version to 0.10.1 (ggml/1469)
ggerganov Apr 29, 2026
b1d5f5b
sync : ggml
ggerganov Apr 29, 2026
683c5ac
spec : disacard last drafted token with low prob (#22506)
ggerganov Apr 29, 2026
098705a
CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478)
anavp-nvidia Apr 29, 2026
41a63be
hexagon: make vmem and buffer-size configurable (#22487)
max-krasnyansky Apr 29, 2026
d775992
common : do not pass prompt tokens to reasoning budget sampler (#22488)
aldehir Apr 29, 2026
b42c7fa
spec : fix vocab compat checks in spec example (#22426)
petersid2022 Apr 30, 2026
80afa33
spec : fix draft model checkpoints (#22521)
ggerganov Apr 30, 2026
4515559
add fast matmul iquants (#22504)
SharmaRithik Apr 30, 2026
27aef3d
scripts : add wc2wt.sh - create worktree from current HEAD (#22513)
ggerganov Apr 30, 2026
e82aaf2
CUDA: fix tile FA kernel on Pascal (#22541)
JohannesGaessler Apr 30, 2026
5f0ab72
vendor : update cpp-httplib to 0.43.2 (#22548)
angt Apr 30, 2026
6118c04
ci : bump ty to 0.0.33 (#22535)
CISC Apr 30, 2026
c20c445
spec: fix argument typo (#22552)
barnjamin Apr 30, 2026
660b1b4
vulkan: add get/set tensor 2d functions (#22514)
0cc4m Apr 30, 2026
beb42ff
common : check for null getpwuid in hf-cache (#22550)
angt Apr 30, 2026
5cbfb18
Update llama-mmap to use ftello/fseeko (#22497)
reeselevine Apr 30, 2026
a95a11e
ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_I…
yomaytk Apr 30, 2026
aab6821
ggml-webgpu: add the upscale shader (#22419)
Constannnnnt May 1, 2026
05e141a
vulkan: Support asymmetric FA in coopmat2 path (#21753)
jeffbolznv May 1, 2026
c3c1505
ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (#22578)
yomaytk May 1, 2026
ab6120c
webui: Spring Cleaning Refactor v1 (#22505)
allozaur May 1, 2026
2098fd6
hexagon: enable non-contiguous row tensor support for unary ops (#22574)
aparmp-quic May 1, 2026
b97ebdc
llama-quant : fix `--tensor-type` when default `qtype` is overriden (…
ddh0 May 1, 2026
1a03cf4
hexagon: hmx flash attention (#22347)
njsyw1997 May 2, 2026
e8ec7ab
ggml : try fix win32 build (whisper/0)
ggerganov May 1, 2026
457e228
sync : ggml
ggerganov May 1, 2026
ed23489
ggml : bump version to 0.10.2 (ggml/1474)
ggerganov May 2, 2026
228e836
sync : ggml
ggerganov May 2, 2026
9dbb372
Github: update issue templates (#22594)
JohannesGaessler May 2, 2026
c5a3bc3
opencl: Adreno optimization for MoE - MxFP4 (#22301)
shawngu-quic May 2, 2026
63d93d1
convert : disable uint types (#18908)
csabakecskemeti May 2, 2026
0929436
ggml-virtgpu: fix circular dependency in headers (#22557)
Juste-Leo2 May 2, 2026
0754b7b
server : avoid checkpoint data host copies (#22558)
ggerganov May 2, 2026
d05fe1d
fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus ent…
lucyknada May 2, 2026
db44417
convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611)
jmrobles May 3, 2026
048a490
convert : Mistral format yarn apply_scale support (#22612)
juliendenize May 3, 2026
e48034d
common : determine generation prompt using longest common prefix (#22…
aldehir May 3, 2026
d4b0c22
ggml-webgpu: add layer norm ops (#22406)
Constannnnnt May 4, 2026
6dcd824
vulkan: delete dead GGML_VK_MAX_NODES def (#22621)
Atomic-Germ May 4, 2026
846262d
docs : update speculative decoding parameters after refactor (#22397)…
ggerganov May 4, 2026
fa8feae
webui: restore missing settings (#22666)
ntowle May 4, 2026
c84e6d6
server: Add a simple get_datetime server tool (#22649)
eapache May 4, 2026
994118a
model: move `load_hparams` and `load_tensors` to per-model definition…
ngxson May 4, 2026
a4701c9
common/autoparser: fixes for newline handling / forced tool calls (#2…
pwilkin May 4, 2026
36a694c
webui : fix circular dependency between chat.service.ts and models.sv…
Juste-Leo2 May 4, 2026
1a4fe4e
llama: allow partial seq_rm for GDN models for speculative decoding
am17an Apr 25, 2026
589490f
add enum for part sequence removal to enable checkpoints
am17an Apr 28, 2026
c5e0227
review: rename rollback to rs_seq and remove public API
am17an Apr 30, 2026
10829db
llama + spec: MTP support
am17an Apr 30, 2026
f8c6b03
add qwen35moe_mtp
am17an May 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/010-bug-compilation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ body:
after recreating the CMake build directory and with `-DGGML_CCACHE=OFF`.
If the compilation succeeds with ccache disabled you should be able to permanently fix the issue
by clearing `~/.cache/ccache` (on Linux).
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: commit
attributes:
Expand Down
4 changes: 3 additions & 1 deletion .github/ISSUE_TEMPLATE/011-bug-results.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Bug (model use)
description: Something goes wrong when using a model (in general, not specific to a single llama.cpp module).
description: Something goes wrong when running a model (crashes, garbled outputs, etc.).
title: "Eval bug: "
labels: ["bug-unconfirmed", "model evaluation"]
body:
Expand All @@ -12,6 +12,8 @@ body:
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
The `llama-completion` binary can be used for simple and reproducible model inference.

Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: version
attributes:
Expand Down
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/019-bug-misc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ body:
This issue template is intended for miscellaneous bugs that don't fit into any other category.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: version
attributes:
Expand Down
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/020-enhancement.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ body:
value: |
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggml-org/llama.cpp/discussions/categories/ideas)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: checkboxes
id: prerequisites
attributes:
Expand Down
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/030-research.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ body:
value: |
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: checkboxes
id: research-stage
attributes:
Expand Down
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/040-refactor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ body:
Don't forget to [check for existing refactor issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: background-description
attributes:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python-type-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
uses: actions/setup-python@v6
with:
python-version: "3.11"
pip-install: -r requirements/requirements-all.txt ty==0.0.26
pip-install: -r requirements/requirements-all.txt ty==0.0.33
# - name: Type-check with Pyright
# uses: jakebailey/pyright-action@v2
# with:
Expand Down
1 change: 1 addition & 0 deletions .pi/gg/SYSTEM.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ General:
- By very precise and concise when writing code, comments, explanations, etc.
- PR and commit titles format: `<module> : <title>`. Lookup recents for examples
- Don't try to build or run the code unless you are explicitly asked to do so
- Use the `gh` CLI tool when querying PRs, issues, or other GitHub resources

Coding:
- When in doubt, always refer to the CONTRIBUTING.md file of the project
Expand Down
10 changes: 6 additions & 4 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2864,7 +2864,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--tools"}, "TOOL1,TOOL2,...",
"experimental: whether to enable built-in tools for AI agents - do not enable in untrusted environments (default: no tools)\n"
"specify \"all\" to enable all tools\n"
"available tools: read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff",
"available tools: read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff, get_datetime",
[](common_params & params, const std::string & value) {
params.server_tools = parse_csv_row(value);
}
Expand Down Expand Up @@ -3380,7 +3380,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--spec-draft-poll", "--poll-draft"}, "<0|1>",
"Use polling to wait for draft model work (default: same as --poll])",
"Use polling to wait for draft model work (default: same as --poll)",
[](common_params & params, int value) {
params.speculative.draft.cpuparams.poll = value;
}
Expand Down Expand Up @@ -3499,7 +3499,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_LOOKUP, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_SPEC_DRAFT_N_MIN"));

add_opt(common_arg(
{"--spec--draft-p-split", "--draft-p-split"}, "P",
{"--spec-draft-p-split", "--draft-p-split"}, "P",
string_format("speculative decoding split probability (default: %.2f)", (double)params.speculative.draft.p_split),
[](common_params & params, const std::string & value) {
params.speculative.draft.p_split = std::stof(value);
Expand Down Expand Up @@ -3562,12 +3562,14 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--spec-type"}, "[none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]",
{"--spec-type"}, "[none|mtp|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]",
string_format("type of speculative decoding to use when no draft model is provided (default: %s)\n",
common_speculative_type_to_str(params.speculative.type).c_str()),
[](common_params & params, const std::string & value) {
if (value == "none") {
params.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
} else if (value == "mtp") {
params.speculative.type = COMMON_SPECULATIVE_TYPE_MTP;
} else if (value == "ngram-cache") {
params.speculative.type = COMMON_SPECULATIVE_TYPE_NGRAM_CACHE;
} else if (value == "ngram-simple") {
Expand Down
16 changes: 5 additions & 11 deletions common/chat-auto-parser-generator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,10 @@ common_peg_parser analyze_reasoning::build_parser(parser_build_context & ctx) co
if (!end.empty()) {
if (!start.empty()) {
// Standard tag-based: optional(<think>reasoning</think>)
return p.optional(start + p.reasoning(p.until(end)) + end + p.space());
return p.optional(p.optspace(start) + p.reasoning(p.until(trim_whitespace(end))) + p.optspace(end));
}
// Delimiter-style (empty start)
return p.optional(p.reasoning(p.until(end)) + end + p.space());
return p.optional(p.reasoning(p.until(trim_whitespace(end))) + p.optspace(end));
}
}

Expand Down Expand Up @@ -186,7 +186,6 @@ common_peg_parser analyze_tools::build_parser(parser_build_context & ctx) const
common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;

// Build effective field names with dot notation if function_field is set
std::string name_field = format.name_field;
Expand Down Expand Up @@ -225,8 +224,7 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
tool_start = format.per_call_start;
}

return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(p.until(tool_start)))) + tools_parser +
p.end();
return ctx.reasoning_parser + p.optional(p.content(p.until(tool_start))) + tools_parser + p.end();
}

common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
Expand Down Expand Up @@ -270,7 +268,6 @@ common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p,
common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;

common_peg_parser tool_choice = p.choice();

Expand Down Expand Up @@ -336,14 +333,12 @@ common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context

std::string trigger_marker = !format.section_start.empty() ? format.section_start : format.per_call_start;
auto content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(content_before_tools))) + tool_calls +
p.end();
return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
}

common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;

auto until_suffix = p.rule("until-suffix", p.until(arguments.value_suffix));

Expand Down Expand Up @@ -471,8 +466,7 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte

std::string trigger_marker = !format.section_start.empty() ? format.section_start : format.per_call_start;
auto content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(content_before_tools))) + tool_calls +
p.end();
return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
}

} // namespace autoparser
8 changes: 4 additions & 4 deletions common/chat-diff-analyzer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ void analyze_reasoning::compare_thinking_enabled() {
if (left_trimmed.empty() && !diff.right.empty()) {
if (!right_trimmed.empty() && string_ends_with(comparison->output_B, right_trimmed)) {
if (start.empty()) {
start = trim_leading_whitespace(diff.right);
start = diff.right;
mode = reasoning_mode::TAG_BASED;
}
}
Expand All @@ -353,7 +353,7 @@ void analyze_reasoning::compare_thinking_enabled() {
if (seg.size() >= 2 && seg[seg.size() - 1].value == left_trimmed && seg[seg.size() - 2].type == segment_type::MARKER) {
start = seg[seg.size() - 2].value;
}
end = trim_trailing_whitespace(diff.left);
end = diff.left;
mode = reasoning_mode::TAG_BASED;
}
}
Expand Down Expand Up @@ -445,14 +445,14 @@ void analyze_reasoning::compare_reasoning_scope() {
auto result = parser_wrapped.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
start = result.tags["pre"];
end = trim_trailing_whitespace(result.tags["post"]);
end = result.tags["post"];
} else {
auto parser_delimiter = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.literal(reasoning_content) + p.space() + p.optional(p.tag("post", (p.marker() + p.space())));
});
result = parser_delimiter.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
end = trim_trailing_whitespace(result.tags["post"]);
end = result.tags["post"];
} else {
LOG_DBG(ANSI_ORANGE "%s: Unable to extract reasoning markers, falling back to reasoning = NONE\n" ANSI_RESET, __func__);
mode = reasoning_mode::NONE;
Expand Down
26 changes: 26 additions & 0 deletions common/chat-peg-parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -816,6 +816,32 @@ common_peg_parser common_chat_peg_builder::prefix(const std::string & s, const s
return literal(s.substr(0, s.rfind(delimiter)));
}

common_peg_parser common_chat_peg_builder::optspace(const std::string & tag) {
auto parser = eps();
size_t end_of_prefix_space = tag.size();
size_t start_of_suffix_space = tag.size();
for (size_t i = 0; i < tag.size(); i++) {
if (!std::isspace(tag[i])) {
end_of_prefix_space = i;
break;
}
}
for (size_t i = tag.size(); i > 0; i--) {
if (!std::isspace(tag[i - 1])) {
start_of_suffix_space = i;
break;
}
}
for (size_t i = 0; i < end_of_prefix_space; i++) {
parser += optional(literal(std::string(1, tag[i])));
}
parser += literal(tag.substr(end_of_prefix_space, start_of_suffix_space - end_of_prefix_space));
for (size_t i = start_of_suffix_space; i < tag.size(); i++) {
parser += optional(literal(std::string(1, tag[i])));
}
return parser;
}

common_peg_parser common_chat_peg_builder::standard_json_tools(
const std::string & section_start,
const std::string & section_end,
Expand Down
3 changes: 3 additions & 0 deletions common/chat-peg-parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ class common_chat_peg_builder : public common_peg_parser_builder {
// Return a parser that parses the prefix of a string, up to a given delimiter.
common_peg_parser prefix(const std::string & s, const std::string & delimiter = {});

// Return a parser that parses all elements of tag, but leading and trailing spaces are optional
common_peg_parser optspace(const std::string & tag);

// Legacy-compatible helper for building standard JSON tool calls
// Used by tests and manual parsers
// name_key/args_key: JSON key names for function name and arguments
Expand Down
49 changes: 29 additions & 20 deletions common/chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2116,22 +2116,38 @@ std::optional<common_chat_params> common_chat_try_specialized_template(
return std::nullopt;
}

static std::string common_chat_templates_generation_prompt(const common_chat_template & tmpl, const autoparser::generation_params & inputs) {
autoparser::generation_params params = inputs;
params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);

size_t prefix_len = 0;
size_t min_size = std::min(no_gen_prompt.size(), gen_prompt.size());
while (prefix_len < min_size && no_gen_prompt[prefix_len] == gen_prompt[prefix_len]) {
prefix_len++;
}
return gen_prompt.substr(prefix_len);
}

static common_chat_params common_chat_templates_apply_jinja(const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs) {
autoparser::generation_params params;
params.tools = common_chat_tools_to_json_oaicompat(inputs.tools);
const auto & tmpl =
params.tools.is_array() && tmpls->template_tool_use ? *tmpls->template_tool_use : *tmpls->template_default;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_generation_prompt = inputs.add_generation_prompt;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;

if (src.find("<|channel|>") == std::string::npos) {
// map developer to system for all models except for GPT-OSS
Expand All @@ -2153,14 +2169,7 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
workaround::func_args_not_string(params.messages);
}

params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
params.generation_prompt = diff.right + diff.suffix;

params.add_generation_prompt = inputs.add_generation_prompt;
params.generation_prompt = common_chat_templates_generation_prompt(tmpl, params);

params.extra_context = common_chat_extra_context();
for (auto el : inputs.chat_template_kwargs) {
Expand Down Expand Up @@ -2212,8 +2221,8 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
auto auto_params = autoparser::peg_generator::generate_parser(tmpl, params, autoparser);
auto_params.supports_thinking = autoparser.reasoning.mode != autoparser::reasoning_mode::NONE;
if (auto_params.supports_thinking) {
auto_params.thinking_start_tag = autoparser.reasoning.start;
auto_params.thinking_end_tag = autoparser.reasoning.end;
auto_params.thinking_start_tag = trim_whitespace(autoparser.reasoning.start);
auto_params.thinking_end_tag = trim_whitespace(autoparser.reasoning.end);
}
auto_params.generation_prompt = params.generation_prompt;
common_peg_arena arena;
Expand Down
10 changes: 10 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1420,6 +1420,11 @@ common_context_seq_rm_type common_context_can_seq_rm(llama_context * ctx) {
goto done;
}

if (llama_n_rs_seq(ctx) > 0) {
res = COMMON_CONTEXT_SEQ_RM_TYPE_PART_BOUNDED;
goto done;
}

// try to remove the last tokens
if (!llama_memory_seq_rm(mem, 0, 1, -1)) {
LOG_WRN("%s: the target context does not support partial sequence removal\n", __func__);
Expand Down Expand Up @@ -1490,6 +1495,11 @@ struct llama_context_params common_context_params_to_llama(const common_params &

cparams.n_ctx = params.n_ctx;
cparams.n_seq_max = params.n_parallel;
{
const bool has_spec = (params.speculative.type != COMMON_SPECULATIVE_TYPE_NONE)
|| params.speculative.has_dft();
cparams.n_rs_seq = has_spec ? (uint32_t) params.speculative.draft.n_max : 0u;
}
cparams.n_batch = params.n_batch;
cparams.n_ubatch = params.n_ubatch;
cparams.n_threads = params.cpuparams.n_threads;
Expand Down
Loading