Description
Follow-up to #1098. OpenFang's OpenAI-compat driver re-emits persisted thinking as reasoning_content on assistant messages. vLLM deprecated and removed this field in v0.19.0 (PR #33402), renaming it to reasoning per OpenAI's GPT-OSS Responses-API convention.
Effect: for any reasoning model served by vLLM ≥ 0.19.0 (MiniMax M2, DeepSeek-R1, Qwen-thinking, GLM-thinking, GPT-OSS, …), persisted thinking is silently stripped by vLLM and never reaches the model — including for intra-turn agentic tool loops where it matters most.
Reproduction (direct vLLM /tokenize endpoint, vllm 0.19.2): sending an assistant message with reasoning_content: "MARKER-..." produces a rendered prompt with no block; sending the same message with reasoning: "MARKER-..." produces \nMARKER-...\n in the prompt as expected.
Suggested fix: rename the outbound field from reasoning_content to reasoning in the OpenAI-compat driver. For backwards compatibility with older vLLM and other servers (some forks may still accept the old name), emit both fields.
Refs: vLLM RFC #27755, PR #33402.
Expected Behavior
Persisted assistant thinking (per the #1098 fix) should reach the model on subsequent turns when running against vLLM-served reasoning models — i.e. inside an agentic tool loop, the model's earlier reasoning blocks should appear as ... content in the prompt sent to the model, so the model can build on its own prior thinking across tool-calling iterations.
Steps to Reproduce
Reproduction (direct vLLM /tokenize endpoint, vllm 0.19.2): sending an assistant message with reasoning_content: "MARKER-..." produces a rendered prompt with no block; sending the same message with reasoning: "MARKER-..." produces \nMARKER-...\n in the prompt as expected.
OpenFang Version
0.6.4
Operating System
Linux (x86_64)
Logs / Screenshots
No response
Description
Follow-up to #1098. OpenFang's OpenAI-compat driver re-emits persisted thinking as reasoning_content on assistant messages. vLLM deprecated and removed this field in v0.19.0 (PR #33402), renaming it to reasoning per OpenAI's GPT-OSS Responses-API convention.
Effect: for any reasoning model served by vLLM ≥ 0.19.0 (MiniMax M2, DeepSeek-R1, Qwen-thinking, GLM-thinking, GPT-OSS, …), persisted thinking is silently stripped by vLLM and never reaches the model — including for intra-turn agentic tool loops where it matters most.
Reproduction (direct vLLM /tokenize endpoint, vllm 0.19.2): sending an assistant message with reasoning_content: "MARKER-..." produces a rendered prompt with no block; sending the same message with reasoning: "MARKER-..." produces \nMARKER-...\n in the prompt as expected.
Suggested fix: rename the outbound field from reasoning_content to reasoning in the OpenAI-compat driver. For backwards compatibility with older vLLM and other servers (some forks may still accept the old name), emit both fields.
Refs: vLLM RFC #27755, PR #33402.
Expected Behavior
Persisted assistant thinking (per the #1098 fix) should reach the model on subsequent turns when running against vLLM-served reasoning models — i.e. inside an agentic tool loop, the model's earlier reasoning blocks should appear as ... content in the prompt sent to the model, so the model can build on its own prior thinking across tool-calling iterations.
Steps to Reproduce
Reproduction (direct vLLM /tokenize endpoint, vllm 0.19.2): sending an assistant message with reasoning_content: "MARKER-..." produces a rendered prompt with no block; sending the same message with reasoning: "MARKER-..." produces \nMARKER-...\n in the prompt as expected.
OpenFang Version
0.6.4
Operating System
Linux (x86_64)
Logs / Screenshots
No response