Skip to content

feat(server): --chat-template-file flag for Jinja chat templates#248

Merged
davide221 merged 2 commits into
Luce-Org:mainfrom
sanastasiou:feat/chat-template-file-jinja
May 22, 2026
Merged

feat(server): --chat-template-file flag for Jinja chat templates#248
davide221 merged 2 commits into
Luce-Org:mainfrom
sanastasiou:feat/chat-template-file-jinja

Conversation

@sanastasiou
Copy link
Copy Markdown
Contributor

Why

The existing hardcoded Qwen3.5 ChatML template + tool preamble in `chat_template.cpp` is adequate for plain chat but ships with one specific way of telling the model how to emit tool calls (the `<tool_call><function=NAME>` XML format). Real Qwen3.6 deployments need template flexibility:

  • Community fine-tuned variants of Qwen3.6 (e.g. froggeric's Qwen-Fixed-Chat-Templates) publish their own `.jinja` files with refined tool-use instructions. Without `--chat-template-file`, dflash_server can't use them.
  • Agentic clients like claude-agent-sdk send tool definitions in Anthropic shape and expect the model to emit tool calls that the server's `tool_parser` can lift back into Anthropic `tool_use` blocks. Different templates give the model different XML-format instructions, which directly affects how reliably the model emits well-formed `<tool_call>...</tool_call>` blocks across long, tool-heavy contexts.
  • `llama.cpp` ships ~50 reference templates in `models/templates/*.jinja` — most users will want to point at one of those rather than write a hardcoded C++ renderer.

This mirrors `llama-server`'s existing `--jinja --chat-template-file` flow but lives directly in `dflash_server` so users don't have to layer two binaries.

What

  1. New `render_chat_template_jinja(template_src, messages, bos, eos, add_generation_prompt, enable_thinking, tools_json)` in `chat_template.cpp`. Mirrors `llama.cpp`'s `common_chat_template_direct_apply_impl`: builds a JSON input matching the field names every Jinja chat template expects (`messages`, `tools`, `bos_token`, `eos_token`, `add_generation_prompt`, `enable_thinking`), parses + runs the template, returns the rendered prompt string.

  2. Thread-local cache of the most-recently parsed `jinja::program` keyed on the literal template source. Steady-state cost is one `runtime::execute()` per request — no re-lex/re-parse — without introducing global mutable state.

  3. Build wiring — the 6 jinja sources from `deps/llama.cpp/common/jinja/` (`lexer/parser/runtime/value/string/caps`) plus `common/unicode.cpp` (`common_parse_utf8_codepoint` used by jinja's `tojson()` helper) are added to the `dflash_common` static lib. `deps/llama.cpp/common` is added as a `PRIVATE` include path. `nlohmann_json` is already a `PUBLIC` link dep.

  4. CLI + ServerConfig — `server_main.cpp` parses `--chat-template-file PATH`, reads the file into memory once at startup, stores it on `ServerConfig::chat_template_src` and logs the load. `http_server.cpp`'s chat handler routes to `render_chat_template_jinja()` when the source is non-empty, falling back to the hardcoded QWEN3/LAGUNA render when it's empty.

  5. BOS/EOS handling — pulled from `tokenizer_.raw_token(bos_id())` / `raw_token(eos_id())` rather than `token_text()` — special tokens like `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2 byte-level decode would otherwise produce mojibake.

  6. Error handling — lex/parse/runtime/bad-tools-JSON failures throw `std::runtime_error`, surfaced as a 500 response on the chat handler with the underlying error message.

Usage

```bash
./dflash_server /path/to/target.gguf \
--draft /path/to/draft.gguf \
--chat-template-file /path/to/qwen3.6-froggeric.jinja \
--port 18080 ...
```

If `--chat-template-file` is omitted, behavior is identical to today (hardcoded QWEN3/LAGUNA renderer).

Test plan

  • 7 new unit tests in `test_server_unit.cpp`:
    • basic message render (system + user + assistant turn prefix)
    • `add_generation_prompt=false` suppresses the trailing assistant turn
    • tools array injected and accessible via `{{ tools[0].name }}`
    • `"[]"` tools list correctly treated as empty (no `tools` key in ctx)
    • `bos_token` / `eos_token` threaded through to template
    • empty `template_src` throws
    • malformed tools JSON throws
  • End-to-end smoke against `/v1/messages` with the froggeric Qwen3.6 template + a get_weather tool definition + a "what's the weather in Tokyo" prompt → response contains a proper Anthropic tool_use block (`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`).
  • All existing tests still pass.

Files

```
dflash/CMakeLists.txt +16 (jinja sources + include path)
dflash/src/server/chat_template.h +26 (new fn declaration)
dflash/src/server/chat_template.cpp +109 (impl + thread-local cache)
dflash/src/server/http_server.h +6 (ServerConfig fields)
dflash/src/server/http_server.cpp +37 (dispatch in chat handler)
dflash/src/server/server_main.cpp +31 (CLI flag + file read)
dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)
```

Design notes / open questions

  • Layering — the jinja sources are compiled directly into `dflash_common`. An alternative would be to add a separate `dflash_jinja` static lib that `dflash_common` links to (cleaner dependency graph, ~unchanged build time). Happy to refactor if maintainers prefer that shape.
  • Thread-local cache — single template entry per thread is correct for the common single-template case. If a future use case wants multiple templates per process (e.g. multiple models with different jinja), this can become a small LRU.
  • Independent of PR fix(server): emit Anthropic tool_use content blocks (non-stream + stream) #247 — this PR doesn't depend on fix(server): emit Anthropic tool_use content blocks (non-stream + stream) #247 (the Anthropic tool_use serialization fix), but the two together are what makes dflash_server a viable backend for Anthropic SDK clients. The two can be reviewed and merged independently.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/server/server_main.cpp Outdated
@sanastasiou
Copy link
Copy Markdown
Contributor Author

Thanks @cubic-dev-ai for catching this — issue addressed in fa86c97c (or whatever the new commit hash is, see latest push).

The CLI parser now matches the help text: a 0-byte --chat-template-file is treated as "no template", logs a warning, and falls through to the hardcoded QWEN3/LAGUNA renderer instead of aborting startup. Short-read errors (partial file) still abort because that's an integrity issue, not an empty-file case.

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 21, 2026

Thanks @cubic-dev-ai for catching this — issue addressed in fa86c97c (or whatever the new commit hash is, see latest push).

The CLI parser now matches the help text: a 0-byte --chat-template-file is treated as "no template", logs a warning, and falls through to the hardcoded QWEN3/LAGUNA renderer instead of aborting startup. Short-read errors (partial file) still abort because that's an integrity issue, not an empty-file case.

@sanastasiou Nice, that looks right. The commit message and code both match the behavior now: zero-byte file → warning + empty chat_template_src → falls through to the hardcoded renderer, while short reads (partial file) still hard-abort since that's a genuine I/O issue. Clean fix.

Tip: get faster answers by chatting with cubic’s review copilot in the review UI

s.anastasiou added 2 commits May 22, 2026 11:50
Adds a `--chat-template-file PATH` CLI flag to dflash_server that loads a
Jinja chat template from disk and uses it to render the prompt, overriding
the hardcoded QWEN3 / LAGUNA renderer in chat_template.cpp.

Why
---

The existing hardcoded Qwen3.5 ChatML template + tool preamble is
adequate for plain chat but it ships with one specific way of telling
the model how to emit tool calls (the `<tool_call><function=NAME>` XML
format). Real-world Qwen3.6 deployments need template flexibility:

  * Community-fine-tuned variants of Qwen3.6 (e.g. froggeric's
    "Qwen-Fixed-Chat-Templates") publish their own .jinja files. Without
    --chat-template-file the server can't use them.
  * Agentic clients like claude-agent-sdk send tool definitions in
    Anthropic shape, expect the model to emit tool calls that the
    server's tool_parser can lift back into Anthropic tool_use blocks.
    Different templates give the model different XML-format instructions,
    which directly affects how reliably the model emits well-formed
    `<tool_call>...</tool_call>` blocks across long, tool-heavy contexts.
  * llama.cpp ships ~50 reference templates in models/templates/*.jinja
    — most users will want to point at one of those rather than write
    their own hardcoded C++ renderer.

This mirrors llama-server's existing `--jinja --chat-template-file`
flow but lives directly in dflash_server.

What
----

1. New `render_chat_template_jinja(template_src, messages, bos, eos,
   add_generation_prompt, enable_thinking, tools_json)` in
   chat_template.cpp. Mirrors llama.cpp's
   common_chat_template_direct_apply_impl: builds a JSON input matching
   the field names every Jinja chat template expects (messages, tools,
   bos_token, eos_token, add_generation_prompt, enable_thinking),
   parses + runs the template, returns the rendered prompt string.

2. Thread-local cache of the most-recently parsed jinja::program keyed
   on the literal template source. Steady-state cost is one
   runtime::execute() per request — no re-lex/re-parse — without
   introducing global mutable state.

3. The 7 jinja sources from `deps/llama.cpp/common/jinja/`
   (lexer/parser/runtime/value/string/caps) plus `common/unicode.cpp`
   (used by jinja's tojson() helper) are pulled into the dflash_common
   static lib. `deps/llama.cpp/common` is added as a PRIVATE include
   path. nlohmann_json was already a PUBLIC link dep.

4. New ServerConfig::chat_template_src / chat_template_path fields.
   server_main.cpp parses `--chat-template-file PATH`, reads the file
   into memory once at startup, logs the load. http_server.cpp's chat
   handler routes to render_chat_template_jinja() when the template
   source is non-empty, falling back to the hardcoded QWEN3/LAGUNA
   render when it's empty.

5. BOS/EOS strings are pulled from `tokenizer_.raw_token(bos_id())` /
   `raw_token(eos_id())` rather than decoded — special tokens like
   `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2
   byte-level decode would otherwise produce mojibake.

6. Render failures (lex/parse/runtime/bad tools JSON) throw
   std::runtime_error, surfaced as a 500 response on the chat handler.

Verified by
-----------

7 new unit tests in test_server_unit.cpp covering:
  - basic message render
  - add_generation_prompt off
  - tools array injected and accessible via {{ tools[0].name }}
  - "[]" tools list correctly treated as empty (no `tools` key in ctx)
  - bos_token / eos_token threaded through to template
  - empty template_src throws
  - malformed tools JSON throws

End-to-end smoke against /v1/messages with the froggeric Qwen3.6
template: a get_weather tool definition + a "what's the weather in
Tokyo" prompt produced a proper Anthropic tool_use block
(`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`).

Files
-----

  dflash/CMakeLists.txt               +16   (jinja sources + include path)
  dflash/src/server/chat_template.h   +26   (new fn declaration)
  dflash/src/server/chat_template.cpp +109  (impl + thread-local cache)
  dflash/src/server/http_server.h     +6    (ServerConfig fields)
  dflash/src/server/http_server.cpp   +37   (dispatch in chat handler)
  dflash/src/server/server_main.cpp   +31   (CLI flag + file read)
  dflash/test/test_server_unit.cpp    +105  (7 jinja unit tests)
The usage text added in the previous commit promised:

    --chat-template-file <path>  Load a Jinja chat template file.
                                 Overrides the hardcoded Qwen3/Laguna
                                 renderer. Empty or missing falls back
                                 to the hardcoded template.

… but the CLI parser was aborting startup with `return 1` whenever the
file length was <= 0, contradicting the "empty falls back" half of the
promise.

Behavior change: when --chat-template-file points at a 0-byte file we
now log a warning and leave ServerConfig::chat_template_src empty, so
http_server.cpp's chat handler falls through to render_chat_template()
(the hardcoded QWEN3/LAGUNA path) as documented. Non-empty files are
unchanged; short-read errors still abort.

This makes scripted launches resilient to a transient empty template
file (e.g. a half-written sed pipe, a checked-out-but-not-populated
template path) — the server starts and serves with the hardcoded
template instead of refusing to come up.

Identified by cubic.
@davide221 davide221 force-pushed the feat/chat-template-file-jinja branch from b7ad386 to 8d6ad73 Compare May 22, 2026 09:52
@davide221
Copy link
Copy Markdown
Contributor

Thanks for the contribution @sanastasiou

@davide221 davide221 merged commit 0f9ac25 into Luce-Org:main May 22, 2026
1 of 2 checks passed
@sanastasiou
Copy link
Copy Markdown
Contributor Author

@davide221 you'e welcone.. I was able to solve real coding challenged with these new addition while using cluade-code as harness and qwen as backend driven by lucebox. Unfortunately after a time something apparently fills up .. not sure if dflash / context and crashes :D .. also the speculative decode drops to 40% or less when context is above 50k or so..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants