to_hf.py: rescue chat_template.jinja before deleting it

DongjiGao · DongjiGao · commit 5be34e633f34 · 2026-04-20T11:26:18.000-07:00
Modern HuggingFace transformers (~4.42+) moves long chat_template
strings out of tokenizer_config.json into a separate
chat_template.jinja file to keep the JSON readable. Qwen3-1.7B's
4168-char template triggers this split; Nemotron-Nano's shorter
template stays inline.

The old code deleted chat_template.jinja before reading
tokenizer_config.json, assuming the inline copy was always complete.
For Qwen3 that meant the exported checkpoint shipped with an empty
chat_template -- vLLM's apply_chat_template returned a prompt
without the &lt;|audio|&gt; placeholder, which broke multimodal prompt
replacement (Failed to apply prompt replacement for mm_items['audio'][0]).

Now read chat_template.jinja, inline it into tokenizer_config.json
when non-empty, and only then delete the file. Nemotron's inline-only
path is unchanged because .jinja doesn't get written for small
templates.

Made-with: Cursor
diff --git a/examples/speechlm2/to_hf.py b/examples/speechlm2/to_hf.py
@@ -183,15 +183,22 @@ def prepare_for_vllm(output_dir: str, model_cfg: dict) -> None:
     if _AUDIO_TOKEN not in tok.get_vocab():
         tok.add_special_tokens({"additional_special_tokens": [_AUDIO_TOKEN]})
     tok.save_pretrained(str(output_dir))
-    # A separate chat_template.jinja file, if present, overrides the inline copy
-    # in tokenizer_config.json. Remove it so tokenizer_config.json wins.
+    # Newer transformers writes long chat templates to a separate
+    # ``chat_template.jinja`` file instead of inlining them in
+    # ``tokenizer_config.json`` (Qwen3's 4k-char template triggers this,
+    # Nemotron's shorter one stays inline). Read whichever is populated,
+    # inline it into tokenizer_config.json, and delete the .jinja file so
+    # downstream tooling sees a single canonical location.
+    tok_cfg_path = output_dir / "tokenizer_config.json"
+    tok_cfg = json.loads(tok_cfg_path.read_text())
     jinja_file = output_dir / "chat_template.jinja"
     if jinja_file.exists():
+        jinja_from_file = jinja_file.read_text()
+        if jinja_from_file.strip():
+            tok_cfg["chat_template"] = jinja_from_file
         jinja_file.unlink()
     # Normalize extra_special_tokens: transformers writes our added audio token
     # as a list, but HF/vLLM loaders expect a dict keyed by semantic name.
-    tok_cfg_path = output_dir / "tokenizer_config.json"
-    tok_cfg = json.loads(tok_cfg_path.read_text())
     tok_cfg["extra_special_tokens"] = {"audio_token": _AUDIO_TOKEN}
     # Some reasoning backbones (e.g. nemotron-nano-v3) ship a chat_template whose
     # default ``enable_thinking`` is ``True``; our SpeechLM fine-tuning renders