feat(server): --chat-template-file flag for Jinja chat templates by sanastasiou · Pull Request #248 · Luce-Org/lucebox-hub

sanastasiou · 2026-05-21T22:24:45Z

Why

The existing hardcoded Qwen3.5 ChatML template + tool preamble in `chat_template.cpp` is adequate for plain chat but ships with one specific way of telling the model how to emit tool calls (the `<tool_call><function=NAME>` XML format). Real Qwen3.6 deployments need template flexibility:

Community fine-tuned variants of Qwen3.6 (e.g. froggeric's Qwen-Fixed-Chat-Templates) publish their own `.jinja` files with refined tool-use instructions. Without `--chat-template-file`, dflash_server can't use them.
Agentic clients like claude-agent-sdk send tool definitions in Anthropic shape and expect the model to emit tool calls that the server's `tool_parser` can lift back into Anthropic `tool_use` blocks. Different templates give the model different XML-format instructions, which directly affects how reliably the model emits well-formed `<tool_call>...</tool_call>` blocks across long, tool-heavy contexts.
`llama.cpp` ships ~50 reference templates in `models/templates/*.jinja` — most users will want to point at one of those rather than write a hardcoded C++ renderer.

This mirrors `llama-server`'s existing `--jinja --chat-template-file` flow but lives directly in `dflash_server` so users don't have to layer two binaries.

What

New `render_chat_template_jinja(template_src, messages, bos, eos, add_generation_prompt, enable_thinking, tools_json)` in `chat_template.cpp`. Mirrors `llama.cpp`'s `common_chat_template_direct_apply_impl`: builds a JSON input matching the field names every Jinja chat template expects (`messages`, `tools`, `bos_token`, `eos_token`, `add_generation_prompt`, `enable_thinking`), parses + runs the template, returns the rendered prompt string.
Thread-local cache of the most-recently parsed `jinja::program` keyed on the literal template source. Steady-state cost is one `runtime::execute()` per request — no re-lex/re-parse — without introducing global mutable state.
Build wiring — the 6 jinja sources from `deps/llama.cpp/common/jinja/` (`lexer/parser/runtime/value/string/caps`) plus `common/unicode.cpp` (`common_parse_utf8_codepoint` used by jinja's `tojson()` helper) are added to the `dflash_common` static lib. `deps/llama.cpp/common` is added as a `PRIVATE` include path. `nlohmann_json` is already a `PUBLIC` link dep.
CLI + ServerConfig — `server_main.cpp` parses `--chat-template-file PATH`, reads the file into memory once at startup, stores it on `ServerConfig::chat_template_src` and logs the load. `http_server.cpp`'s chat handler routes to `render_chat_template_jinja()` when the source is non-empty, falling back to the hardcoded QWEN3/LAGUNA render when it's empty.
BOS/EOS handling — pulled from `tokenizer_.raw_token(bos_id())` / `raw_token(eos_id())` rather than `token_text()` — special tokens like `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2 byte-level decode would otherwise produce mojibake.
Error handling — lex/parse/runtime/bad-tools-JSON failures throw `std::runtime_error`, surfaced as a 500 response on the chat handler with the underlying error message.

Usage

```bash
./dflash_server /path/to/target.gguf \
--draft /path/to/draft.gguf \
--chat-template-file /path/to/qwen3.6-froggeric.jinja \
--port 18080 ...
```

If `--chat-template-file` is omitted, behavior is identical to today (hardcoded QWEN3/LAGUNA renderer).

Test plan

7 new unit tests in `test_server_unit.cpp`:
- basic message render (system + user + assistant turn prefix)
- `add_generation_prompt=false` suppresses the trailing assistant turn
- tools array injected and accessible via `{{ tools[0].name }}`
- `"[]"` tools list correctly treated as empty (no `tools` key in ctx)
- `bos_token` / `eos_token` threaded through to template
- empty `template_src` throws
- malformed tools JSON throws
End-to-end smoke against `/v1/messages` with the froggeric Qwen3.6 template + a get_weather tool definition + a "what's the weather in Tokyo" prompt → response contains a proper Anthropic tool_use block (`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`).
All existing tests still pass.

Files

```
dflash/CMakeLists.txt +16 (jinja sources + include path)
dflash/src/server/chat_template.h +26 (new fn declaration)
dflash/src/server/chat_template.cpp +109 (impl + thread-local cache)
dflash/src/server/http_server.h +6 (ServerConfig fields)
dflash/src/server/http_server.cpp +37 (dispatch in chat handler)
dflash/src/server/server_main.cpp +31 (CLI flag + file read)
dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)
```

Design notes / open questions

Layering — the jinja sources are compiled directly into `dflash_common`. An alternative would be to add a separate `dflash_jinja` static lib that `dflash_common` links to (cleaner dependency graph, ~unchanged build time). Happy to refactor if maintainers prefer that shape.
Thread-local cache — single template entry per thread is correct for the common single-template case. If a future use case wants multiple templates per process (e.g. multiple models with different jinja), this can become a small LRU.
Independent of PR fix(server): emit Anthropic tool_use content blocks (non-stream + stream) #247 — this PR doesn't depend on fix(server): emit Anthropic tool_use content blocks (non-stream + stream) #247 (the Anthropic tool_use serialization fix), but the two together are what makes dflash_server a viable backend for Anthropic SDK clients. The two can be reviewed and merged independently.

cubic-dev-ai

1 issue found across 7 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

sanastasiou · 2026-05-21T22:32:08Z

Thanks @cubic-dev-ai for catching this — issue addressed in fa86c97c (or whatever the new commit hash is, see latest push).

The CLI parser now matches the help text: a 0-byte --chat-template-file is treated as "no template", logs a warning, and falls through to the hardcoded QWEN3/LAGUNA renderer instead of aborting startup. Short-read errors (partial file) still abort because that's an integrity issue, not an empty-file case.

cubic-dev-ai · 2026-05-21T22:32:17Z

Thanks @cubic-dev-ai for catching this — issue addressed in fa86c97c (or whatever the new commit hash is, see latest push).

The CLI parser now matches the help text: a 0-byte --chat-template-file is treated as "no template", logs a warning, and falls through to the hardcoded QWEN3/LAGUNA renderer instead of aborting startup. Short-read errors (partial file) still abort because that's an integrity issue, not an empty-file case.

@sanastasiou Nice, that looks right. The commit message and code both match the behavior now: zero-byte file → warning + empty chat_template_src → falls through to the hardcoded renderer, while short reads (partial file) still hard-abort since that's a genuine I/O issue. Clean fix.

_{Tip: get faster answers by chatting with cubic’s review copilot in the review UI}

Adds a `--chat-template-file PATH` CLI flag to dflash_server that loads a Jinja chat template from disk and uses it to render the prompt, overriding the hardcoded QWEN3 / LAGUNA renderer in chat_template.cpp. Why --- The existing hardcoded Qwen3.5 ChatML template + tool preamble is adequate for plain chat but it ships with one specific way of telling the model how to emit tool calls (the `<tool_call><function=NAME>` XML format). Real-world Qwen3.6 deployments need template flexibility: * Community-fine-tuned variants of Qwen3.6 (e.g. froggeric's "Qwen-Fixed-Chat-Templates") publish their own .jinja files. Without --chat-template-file the server can't use them. * Agentic clients like claude-agent-sdk send tool definitions in Anthropic shape, expect the model to emit tool calls that the server's tool_parser can lift back into Anthropic tool_use blocks. Different templates give the model different XML-format instructions, which directly affects how reliably the model emits well-formed `<tool_call>...</tool_call>` blocks across long, tool-heavy contexts. * llama.cpp ships ~50 reference templates in models/templates/*.jinja — most users will want to point at one of those rather than write their own hardcoded C++ renderer. This mirrors llama-server's existing `--jinja --chat-template-file` flow but lives directly in dflash_server. What ---- 1. New `render_chat_template_jinja(template_src, messages, bos, eos, add_generation_prompt, enable_thinking, tools_json)` in chat_template.cpp. Mirrors llama.cpp's common_chat_template_direct_apply_impl: builds a JSON input matching the field names every Jinja chat template expects (messages, tools, bos_token, eos_token, add_generation_prompt, enable_thinking), parses + runs the template, returns the rendered prompt string. 2. Thread-local cache of the most-recently parsed jinja::program keyed on the literal template source. Steady-state cost is one runtime::execute() per request — no re-lex/re-parse — without introducing global mutable state. 3. The 7 jinja sources from `deps/llama.cpp/common/jinja/` (lexer/parser/runtime/value/string/caps) plus `common/unicode.cpp` (used by jinja's tojson() helper) are pulled into the dflash_common static lib. `deps/llama.cpp/common` is added as a PRIVATE include path. nlohmann_json was already a PUBLIC link dep. 4. New ServerConfig::chat_template_src / chat_template_path fields. server_main.cpp parses `--chat-template-file PATH`, reads the file into memory once at startup, logs the load. http_server.cpp's chat handler routes to render_chat_template_jinja() when the template source is non-empty, falling back to the hardcoded QWEN3/LAGUNA render when it's empty. 5. BOS/EOS strings are pulled from `tokenizer_.raw_token(bos_id())` / `raw_token(eos_id())` rather than decoded — special tokens like `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2 byte-level decode would otherwise produce mojibake. 6. Render failures (lex/parse/runtime/bad tools JSON) throw std::runtime_error, surfaced as a 500 response on the chat handler. Verified by ----------- 7 new unit tests in test_server_unit.cpp covering: - basic message render - add_generation_prompt off - tools array injected and accessible via {{ tools[0].name }} - "[]" tools list correctly treated as empty (no `tools` key in ctx) - bos_token / eos_token threaded through to template - empty template_src throws - malformed tools JSON throws End-to-end smoke against /v1/messages with the froggeric Qwen3.6 template: a get_weather tool definition + a "what's the weather in Tokyo" prompt produced a proper Anthropic tool_use block (`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`). Files ----- dflash/CMakeLists.txt +16 (jinja sources + include path) dflash/src/server/chat_template.h +26 (new fn declaration) dflash/src/server/chat_template.cpp +109 (impl + thread-local cache) dflash/src/server/http_server.h +6 (ServerConfig fields) dflash/src/server/http_server.cpp +37 (dispatch in chat handler) dflash/src/server/server_main.cpp +31 (CLI flag + file read) dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)

The usage text added in the previous commit promised: --chat-template-file <path> Load a Jinja chat template file. Overrides the hardcoded Qwen3/Laguna renderer. Empty or missing falls back to the hardcoded template. … but the CLI parser was aborting startup with `return 1` whenever the file length was <= 0, contradicting the "empty falls back" half of the promise. Behavior change: when --chat-template-file points at a 0-byte file we now log a warning and leave ServerConfig::chat_template_src empty, so http_server.cpp's chat handler falls through to render_chat_template() (the hardcoded QWEN3/LAGUNA path) as documented. Non-empty files are unchanged; short-read errors still abort. This makes scripted launches resilient to a transient empty template file (e.g. a half-written sed pipe, a checked-out-but-not-populated template path) — the server starts and serves with the hardcoded template instead of refusing to come up. Identified by cubic.

davide221 · 2026-05-22T10:00:16Z

Thanks for the contribution @sanastasiou

sanastasiou · 2026-05-22T10:02:58Z

@davide221 you'e welcone.. I was able to solve real coding challenged with these new addition while using cluade-code as harness and qwen as backend driven by lucebox. Unfortunately after a time something apparently fills up .. not sure if dflash / context and crashes :D .. also the speculative decode drops to 40% or less when context is above 50k or so..

cubic-dev-ai Bot reviewed May 21, 2026

View reviewed changes

Comment thread dflash/src/server/server_main.cpp Outdated

s.anastasiou added 2 commits May 22, 2026 11:50

davide221 force-pushed the feat/chat-template-file-jinja branch from b7ad386 to 8d6ad73 Compare May 22, 2026 09:52

davide221 merged commit 0f9ac25 into Luce-Org:main May 22, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): --chat-template-file flag for Jinja chat templates#248

feat(server): --chat-template-file flag for Jinja chat templates#248
davide221 merged 2 commits into
Luce-Org:mainfrom
sanastasiou:feat/chat-template-file-jinja

sanastasiou commented May 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

sanastasiou commented May 21, 2026

Uh oh!

cubic-dev-ai Bot commented May 21, 2026

Uh oh!

davide221 commented May 22, 2026

Uh oh!

Uh oh!

sanastasiou commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanastasiou commented May 21, 2026

Why

What

Usage

Test plan

Files

Design notes / open questions

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanastasiou commented May 21, 2026

Uh oh!

cubic-dev-ai Bot commented May 21, 2026

Uh oh!

davide221 commented May 22, 2026

Uh oh!

Uh oh!

sanastasiou commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cubic-dev-ai Bot left a comment •

edited

Loading