Skip to content

Commit 469f6e6

Browse files
s.anastasiouDavide Cifarelli
authored andcommitted
feat(server): --chat-template-file flag for Jinja chat templates
Adds a `--chat-template-file PATH` CLI flag to dflash_server that loads a Jinja chat template from disk and uses it to render the prompt, overriding the hardcoded QWEN3 / LAGUNA renderer in chat_template.cpp. Why --- The existing hardcoded Qwen3.5 ChatML template + tool preamble is adequate for plain chat but it ships with one specific way of telling the model how to emit tool calls (the `<tool_call><function=NAME>` XML format). Real-world Qwen3.6 deployments need template flexibility: * Community-fine-tuned variants of Qwen3.6 (e.g. froggeric's "Qwen-Fixed-Chat-Templates") publish their own .jinja files. Without --chat-template-file the server can't use them. * Agentic clients like claude-agent-sdk send tool definitions in Anthropic shape, expect the model to emit tool calls that the server's tool_parser can lift back into Anthropic tool_use blocks. Different templates give the model different XML-format instructions, which directly affects how reliably the model emits well-formed `<tool_call>...</tool_call>` blocks across long, tool-heavy contexts. * llama.cpp ships ~50 reference templates in models/templates/*.jinja — most users will want to point at one of those rather than write their own hardcoded C++ renderer. This mirrors llama-server's existing `--jinja --chat-template-file` flow but lives directly in dflash_server. What ---- 1. New `render_chat_template_jinja(template_src, messages, bos, eos, add_generation_prompt, enable_thinking, tools_json)` in chat_template.cpp. Mirrors llama.cpp's common_chat_template_direct_apply_impl: builds a JSON input matching the field names every Jinja chat template expects (messages, tools, bos_token, eos_token, add_generation_prompt, enable_thinking), parses + runs the template, returns the rendered prompt string. 2. Thread-local cache of the most-recently parsed jinja::program keyed on the literal template source. Steady-state cost is one runtime::execute() per request — no re-lex/re-parse — without introducing global mutable state. 3. The 7 jinja sources from `deps/llama.cpp/common/jinja/` (lexer/parser/runtime/value/string/caps) plus `common/unicode.cpp` (used by jinja's tojson() helper) are pulled into the dflash_common static lib. `deps/llama.cpp/common` is added as a PRIVATE include path. nlohmann_json was already a PUBLIC link dep. 4. New ServerConfig::chat_template_src / chat_template_path fields. server_main.cpp parses `--chat-template-file PATH`, reads the file into memory once at startup, logs the load. http_server.cpp's chat handler routes to render_chat_template_jinja() when the template source is non-empty, falling back to the hardcoded QWEN3/LAGUNA render when it's empty. 5. BOS/EOS strings are pulled from `tokenizer_.raw_token(bos_id())` / `raw_token(eos_id())` rather than decoded — special tokens like `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2 byte-level decode would otherwise produce mojibake. 6. Render failures (lex/parse/runtime/bad tools JSON) throw std::runtime_error, surfaced as a 500 response on the chat handler. Verified by ----------- 7 new unit tests in test_server_unit.cpp covering: - basic message render - add_generation_prompt off - tools array injected and accessible via {{ tools[0].name }} - "[]" tools list correctly treated as empty (no `tools` key in ctx) - bos_token / eos_token threaded through to template - empty template_src throws - malformed tools JSON throws End-to-end smoke against /v1/messages with the froggeric Qwen3.6 template: a get_weather tool definition + a "what's the weather in Tokyo" prompt produced a proper Anthropic tool_use block (`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`). Files ----- dflash/CMakeLists.txt +16 (jinja sources + include path) dflash/src/server/chat_template.h +26 (new fn declaration) dflash/src/server/chat_template.cpp +109 (impl + thread-local cache) dflash/src/server/http_server.h +6 (ServerConfig fields) dflash/src/server/http_server.cpp +37 (dispatch in chat handler) dflash/src/server/server_main.cpp +31 (CLI flag + file read) dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)
1 parent 41a5bab commit 469f6e6

7 files changed

Lines changed: 328 additions & 3 deletions

File tree

dflash/CMakeLists.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,18 @@ add_library(dflash_common STATIC
264264
src/server/sse_emitter.cpp
265265
src/server/prefix_cache.cpp
266266
src/server/disk_prefix_cache.cpp
267+
# ── Jinja chat-template engine (from llama.cpp common/jinja/) ──
268+
# Used by render_chat_template_jinja() to support --chat-template-file
269+
# in dflash_server. Mirrors llama.cpp's common_chat_template plumbing.
270+
# unicode.cpp supplies common_parse_utf8_codepoint() used by jinja's
271+
# value.cpp tojson() and is otherwise self-contained.
272+
deps/llama.cpp/common/jinja/lexer.cpp
273+
deps/llama.cpp/common/jinja/parser.cpp
274+
deps/llama.cpp/common/jinja/runtime.cpp
275+
deps/llama.cpp/common/jinja/value.cpp
276+
deps/llama.cpp/common/jinja/string.cpp
277+
deps/llama.cpp/common/jinja/caps.cpp
278+
deps/llama.cpp/common/unicode.cpp
267279
)
268280
# BSA (Block-Sparse Attention) backs the speculative-prefill drafter scoring
269281
# path. Default ON so prefill is fast out of the box. Turn OFF if you don't
@@ -452,6 +464,10 @@ target_include_directories(dflash_common
452464
PRIVATE
453465
${DFLASH27B_SRC_INCLUDE_DIRS}
454466
${CMAKE_CURRENT_SOURCE_DIR}/deps/llama.cpp/ggml/src
467+
# Jinja chat-template engine (lexer/parser/runtime/value/string/caps)
468+
# pulled from llama.cpp/common/jinja for --chat-template-file support.
469+
# nlohmann_json is already linked PUBLIC (used by jinja/value.cpp).
470+
${CMAKE_CURRENT_SOURCE_DIR}/deps/llama.cpp/common
455471
)
456472
if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")
457473
target_include_directories(dflash_common PRIVATE ${CUDAToolkit_INCLUDE_DIRS})

dflash/src/server/chat_template.cpp

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22

33
#include "chat_template.h"
44

5+
#include "jinja/lexer.h"
6+
#include "jinja/parser.h"
7+
#include "jinja/runtime.h"
8+
#include "jinja/value.h"
9+
10+
#include <nlohmann/json.hpp>
11+
12+
#include <memory>
13+
#include <stdexcept>
14+
515
namespace dflash::common {
616

717
// Qwen3.5 tool preamble — matches the official Jinja template exactly.
@@ -155,4 +165,103 @@ std::string render_chat_template(
155165
return result;
156166
}
157167

168+
// ─── Jinja path ─────────────────────────────────────────────────────────
169+
//
170+
// Render via a Jinja chat template (e.g. froggeric Qwen3.6 template). Each
171+
// thread caches the most-recently-parsed program for its template source,
172+
// so steady-state cost is just the runtime execute (parse happens once per
173+
// process per template).
174+
175+
namespace {
176+
177+
struct JinjaCache {
178+
std::string src;
179+
std::shared_ptr<jinja::program> prog;
180+
};
181+
182+
static thread_local JinjaCache tls_jinja_cache;
183+
184+
static std::shared_ptr<jinja::program> get_or_parse(const std::string & template_src) {
185+
if (tls_jinja_cache.prog && tls_jinja_cache.src == template_src) {
186+
return tls_jinja_cache.prog;
187+
}
188+
jinja::lexer lex;
189+
jinja::lexer_result lex_res;
190+
try {
191+
lex_res = lex.tokenize(template_src);
192+
} catch (const std::exception & e) {
193+
throw std::runtime_error(std::string("jinja lexer: ") + e.what());
194+
}
195+
auto prog = std::make_shared<jinja::program>(jinja::parse_from_tokens(lex_res));
196+
tls_jinja_cache.src = template_src;
197+
tls_jinja_cache.prog = prog;
198+
return prog;
199+
}
200+
201+
} // namespace
202+
203+
std::string render_chat_template_jinja(
204+
const std::string & template_src,
205+
const std::vector<ChatMessage> & messages,
206+
const std::string & bos_token,
207+
const std::string & eos_token,
208+
bool add_generation_prompt,
209+
bool enable_thinking,
210+
const std::string & tools_json)
211+
{
212+
if (template_src.empty()) {
213+
throw std::runtime_error("render_chat_template_jinja: template_src is empty");
214+
}
215+
216+
auto prog = get_or_parse(template_src);
217+
218+
// Build the JSON input that mirrors llama.cpp's
219+
// common_chat_template_direct_apply_impl. Field names must match the
220+
// names the Jinja templates expect (messages, tools, bos_token,
221+
// eos_token, add_generation_prompt, enable_thinking).
222+
nlohmann::ordered_json messages_j = nlohmann::ordered_json::array();
223+
for (const auto & m : messages) {
224+
nlohmann::ordered_json mj;
225+
mj["role"] = m.role;
226+
mj["content"] = m.content;
227+
if (!m.tool_call_id.empty()) {
228+
mj["tool_call_id"] = m.tool_call_id;
229+
}
230+
messages_j.push_back(std::move(mj));
231+
}
232+
233+
nlohmann::ordered_json inputs;
234+
inputs["messages"] = std::move(messages_j);
235+
inputs["bos_token"] = bos_token;
236+
inputs["eos_token"] = eos_token;
237+
inputs["add_generation_prompt"] = add_generation_prompt;
238+
inputs["enable_thinking"] = enable_thinking;
239+
240+
bool has_tools = !tools_json.empty() && tools_json != "[]" && tools_json != "null";
241+
if (has_tools) {
242+
try {
243+
inputs["tools"] = nlohmann::ordered_json::parse(tools_json);
244+
} catch (const std::exception & e) {
245+
throw std::runtime_error(
246+
std::string("render_chat_template_jinja: failed to parse tools JSON: ") + e.what());
247+
}
248+
}
249+
250+
jinja::context ctx(template_src);
251+
try {
252+
jinja::global_from_json(ctx, inputs, /*mark_input=*/false);
253+
} catch (const std::exception & e) {
254+
throw std::runtime_error(std::string("jinja global_from_json: ") + e.what());
255+
}
256+
257+
try {
258+
jinja::runtime rt(ctx);
259+
jinja::value results = rt.execute(*prog);
260+
auto parts = jinja::runtime::gather_string_parts(results);
261+
return parts->as_string().str();
262+
} catch (const std::exception & e) {
263+
throw std::runtime_error(std::string("jinja runtime: ") + e.what());
264+
}
265+
}
266+
158267
} // namespace dflash::common

dflash/src/server/chat_template.h

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,4 +49,30 @@ std::string render_chat_template(
4949
// Detect the appropriate chat format for an architecture.
5050
ChatFormat chat_format_for_arch(const std::string & arch);
5151

52+
// Render chat messages via a Jinja chat template (e.g. froggeric Qwen3.6
53+
// template, or any of the llama.cpp models/templates/*.jinja files).
54+
//
55+
// Mirrors llama.cpp's common_chat_template_direct_apply: parses the template
56+
// once per thread, converts inputs to jinja values, runs the program, returns
57+
// the rendered prompt string.
58+
//
59+
// `template_src` literal Jinja source (read from --chat-template-file)
60+
// `bos_token`,
61+
// `eos_token` passed through to the template (Qwen3.6 templates may use
62+
// {{bos_token}} / {{eos_token}}). Use empty strings if unknown.
63+
// `tools_json` optional JSON array of tool definitions; when non-empty it
64+
// is parsed and injected as `tools` into the template context.
65+
//
66+
// Internally caches the most recently parsed program per thread (avoids
67+
// re-parsing the template on every request). Throws std::runtime_error on
68+
// lexer/parser/runtime failure (caller should surface a 500 response).
69+
std::string render_chat_template_jinja(
70+
const std::string & template_src,
71+
const std::vector<ChatMessage> & messages,
72+
const std::string & bos_token,
73+
const std::string & eos_token,
74+
bool add_generation_prompt = true,
75+
bool enable_thinking = false,
76+
const std::string & tools_json = "");
77+
5278
} // namespace dflash::common

dflash/src/server/http_server.cpp

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -439,9 +439,41 @@ bool HttpServer::route_request(int fd, const HttpRequest & hr) {
439439
tools_json = req.tools.dump();
440440
}
441441

442-
std::string rendered = render_chat_template(chat_msgs, chat_format_,
443-
true, enable_thinking,
444-
tools_json);
442+
std::string rendered;
443+
if (!config_.chat_template_src.empty()) {
444+
// Jinja path: caller supplied a chat template file via
445+
// --chat-template-file. Override the hardcoded QWEN3/LAGUNA
446+
// renderer. Used for tool-using agents that need the Anthropic
447+
// tool_use envelope (e.g. froggeric Qwen3.6 template).
448+
//
449+
// Special tokens like <|im_start|> / <|im_end|> are stored
450+
// verbatim in the GGUF vocab — use raw_token() to skip the
451+
// GPT-2 byte decode (otherwise <0xC4><0x91> nonsense appears).
452+
const std::string & bos_str = (tokenizer_.bos_id() >= 0)
453+
? tokenizer_.raw_token(tokenizer_.bos_id())
454+
: std::string();
455+
const std::string & eos_str = (tokenizer_.eos_id() >= 0)
456+
? tokenizer_.raw_token(tokenizer_.eos_id())
457+
: std::string();
458+
try {
459+
rendered = render_chat_template_jinja(
460+
config_.chat_template_src,
461+
chat_msgs,
462+
bos_str,
463+
eos_str,
464+
/*add_generation_prompt=*/true,
465+
enable_thinking,
466+
tools_json);
467+
} catch (const std::exception & e) {
468+
send_error(fd, 500,
469+
std::string("chat template (jinja) render failed: ") + e.what());
470+
return true;
471+
}
472+
} else {
473+
rendered = render_chat_template(chat_msgs, chat_format_,
474+
true, enable_thinking,
475+
tools_json);
476+
}
445477
req.prompt_tokens = tokenizer_.encode(rendered);
446478

447479
// Detect if prompt ends with <think> (model will start in reasoning mode).

dflash/src/server/http_server.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,12 @@ struct ServerConfig {
6464
int disk_cache_min_tokens = 512; // only persist >= this many tokens
6565
int disk_cache_continued_interval = 10240; // continued checkpoint every N tokens
6666
int disk_cache_cold_max_tokens = 10240; // cold prefix for prompts longer than this
67+
68+
// Optional Jinja chat template (overrides the hardcoded ChatFormat::QWEN3
69+
// / LAGUNA renderer when non-empty). Used for tool-using agents that need
70+
// the Anthropic tool_use envelope, e.g. froggeric Qwen3.6 template.
71+
std::string chat_template_src; // literal Jinja source (loaded from file)
72+
std::string chat_template_path; // path it was loaded from (logged at startup)
6773
};
6874

6975
// ─── Parsed request ─────────────────────────────────────────────────────

dflash/src/server/server_main.cpp

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,13 @@ static void print_usage(const char * prog) {
7676
" --kv-cache-min-tokens <N> Min tokens to persist (default: 512)\n"
7777
" --kv-cache-interval <N> Continued checkpoint every N tokens (default: 10240)\n"
7878
" --kv-cache-cold-max <N> Cold prefix for prompts longer than N tokens (default: 10240)\n"
79+
"\n"
80+
"Chat template (optional, e.g. froggeric Qwen3.6 template for tool-using\n"
81+
"agents that need the Anthropic tool_use envelope):\n"
82+
" --chat-template-file <path> Load a Jinja chat template file.\n"
83+
" Overrides the hardcoded Qwen3/Laguna\n"
84+
" renderer. Empty or missing falls back\n"
85+
" to the hardcoded template.\n"
7986
"\n", prog);
8087
}
8188

@@ -143,6 +150,30 @@ int main(int argc, char ** argv) {
143150
sconfig.pflash_skip_park = true;
144151
} else if (std::strcmp(argv[i], "--lazy-draft") == 0) {
145152
sconfig.lazy_draft = true;
153+
} else if (std::strcmp(argv[i], "--chat-template-file") == 0 && i + 1 < argc) {
154+
const char * path = argv[++i];
155+
std::FILE * f = std::fopen(path, "rb");
156+
if (!f) {
157+
std::fprintf(stderr, "[server] --chat-template-file: cannot open '%s'\n", path);
158+
return 1;
159+
}
160+
std::fseek(f, 0, SEEK_END);
161+
long n = std::ftell(f);
162+
std::fseek(f, 0, SEEK_SET);
163+
if (n <= 0) {
164+
std::fclose(f);
165+
std::fprintf(stderr, "[server] --chat-template-file: empty file '%s'\n", path);
166+
return 1;
167+
}
168+
sconfig.chat_template_src.resize((size_t)n);
169+
size_t got = std::fread(sconfig.chat_template_src.data(), 1, (size_t)n, f);
170+
std::fclose(f);
171+
if (got != (size_t)n) {
172+
std::fprintf(stderr, "[server] --chat-template-file: short read on '%s'\n", path);
173+
return 1;
174+
}
175+
sconfig.chat_template_path = path;
176+
std::fprintf(stderr, "[server] loaded chat template from %s (%ld bytes)\n", path, n);
146177
} else if (std::strcmp(argv[i], "--kv-cache-dir") == 0 && i + 1 < argc) {
147178
sconfig.disk_cache_dir = argv[++i];
148179
} else if (std::strcmp(argv[i], "--kv-cache-budget") == 0 && i + 1 < argc) {

0 commit comments

Comments
 (0)