Skip to content

feat: Support deployment-level thinking defaults for chat templates#11047

Open
indrajit96 wants to merge 3 commits into
mainfrom
ibhosale/deployment-thinking-toggle
Open

feat: Support deployment-level thinking defaults for chat templates#11047
indrajit96 wants to merge 3 commits into
mainfrom
ibhosale/deployment-thinking-toggle

Conversation

@indrajit96

@indrajit96 indrajit96 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Overview:

Adds a deployment-level default thinking toggle for chat-template rendering.
This lets operators set whether a deployment should default to thinking enabled or disabled, while preserving request-level override priority.

Details:

  • Adds --dyn-default-thinking-mode enabled|disabled / DYN_DEFAULT_THINKING_MODE.
  • Publishes the configured value through model runtime metadata as default_thinking_mode.
  • Wires metadata publication through vLLM, SGLang, and TRTLLM worker registration paths.
  • Applies the deployment default in the Python frontend processors and Rust OpenAI preprocessor.
  • Injects compatible chat-template args when no request-level thinking control is present:
    • thinking
    • enable_thinking
    • thinking_mode
  • Preserves request precedence for root-level thinking, chat_template_args, and chat_template_kwargs.

Validation:

  • Unit coverage for Rust preprocessor default injection.
  • Python processor unit coverage for vLLM/SGLang paths.
  • E2E validated with Dynamo frontend + vLLM using Qwen3 and SmolLM3:
    • deployment default disabled
    • deployment default enabled
    • per-request override via chat_template_kwargs.enable_thinking

Open in Devin Review

Summary by CodeRabbit

  • New Features

    • Added support for a deployment-level default “thinking mode” that can be set via runtime configuration or CLI.
    • Chat request preprocessing now applies this default when requests do not already specify thinking-related options.
    • The setting is carried through worker startup and engine metadata so it is available across supported runtimes.
  • Tests

    • Added coverage for enabled/disabled defaults and for preserving user-provided thinking settings.

Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
@indrajit96 indrajit96 requested review from a team as code owners June 29, 2026 16:57
@github-actions github-actions Bot added feat backend::vllm Relates to the vllm backend backend::sglang Relates to the sglang backend backend::trtllm Relates to the trtllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Jun 29, 2026

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment on lines +140 to 145
chat_template_kwargs = apply_default_thinking_mode_to_template_kwargs(
chat_template_kwargs,
default_thinking_mode,
request_has_root_thinking=(isinstance(request, dict) and "thinking" in request),
)
chat_template_kwargs["reasoning_effort"] = request_for_sampling.reasoning_effort

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Behavioral difference between SGLang and vLLM/Rust in how reasoning_effort interacts with default_thinking_mode

In the SGLang path, _normalize_openai_thinking_template_kwargs (components/src/dynamo/frontend/sglang_prepost.py:366-401) processes reasoning_effort: "none" into thinking: False / enable_thinking: False via setdefault_reasoning(False) BEFORE apply_default_thinking_mode_to_template_kwargs runs. This means reasoning_effort: "none" effectively blocks any deployment default from being applied.

In the vLLM path (components/src/dynamo/frontend/prepost.py:139-145) and the Rust preprocessor (lib/llm/src/preprocessor.rs:402-442), reasoning_effort is NOT checked before the default is applied — it's simply added as a separate template kwarg afterward. So a request with reasoning_effort: "none" + default_thinking_mode: "enabled" would end up with both enable_thinking: true (from the default) and reasoning_effort: "none" in the template kwargs. Whether this is a problem depends on the chat template's precedence rules for these fields.

This is likely intentional — SGLang normalizes OpenAI thinking controls into template flags, while vLLM/Rust let the template interpret reasoning_effort directly — but it's worth confirming the desired behavior for mixed-signal scenarios.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a deployment-level default_thinking_mode option (enabled/disabled) surfaced via --dyn-default-thinking-mode CLI flag. The value flows from CLI config through Rust WorkerConfig, engine runtime metadata (vLLM, SGLang, TRT-LLM), Python/Rust preprocessors, and frontend processors, injecting thinking/enable_thinking/thinking_mode template kwargs only when the request does not already specify them.

Changes

Default Thinking Mode Feature

Layer / File(s) Summary
thinking.py helpers and Rust preprocessor logic
components/src/dynamo/frontend/thinking.py, lib/llm/src/preprocessor.rs
New thinking.py defines DEFAULT_THINKING_MODE_RUNTIME_KEY, THINKING_CONTROL_KEYS, runtime_default_thinking_mode, and apply_default_thinking_mode_to_template_kwargs. Rust preprocessor.rs adds parallel helpers and calls apply_default_thinking_mode on inbound chat-completion requests before preprocessing.
CLI config and Rust WorkerConfig field
components/src/dynamo/common/configuration/groups/runtime_args.py, lib/backend-common/src/worker.rs
DynamoRuntimeConfig adds dyn_default_thinking_mode and the corresponding --dyn-default-thinking-mode CLI arg. Rust WorkerConfig adds the default_thinking_mode field and build_local_model injects it into runtime_data.
Engine runtime metadata publication
components/src/dynamo/vllm/main.py, components/src/dynamo/sglang/register.py, components/src/dynamo/trtllm/workers/llm_worker.py
Each engine worker conditionally JSON-serializes dyn_default_thinking_mode and publishes it as an engine-specific runtime_config key.
Python WorkerConfig binding and PyO3 stub
components/src/dynamo/common/backend/worker.py, lib/bindings/python/rust/backend.rs, lib/bindings/python/src/dynamo/_core.pyi
Python WorkerConfig dataclass adds default_thinking_mode, from_runtime_config reads it, Worker.run() forwards it. The PyO3 constructor and .pyi stub expose the new parameter.
Frontend processors wiring
components/src/dynamo/frontend/prepost.py, components/src/dynamo/frontend/sglang_prepost.py, components/src/dynamo/frontend/vllm_processor.py, components/src/dynamo/frontend/sglang_processor.py
All four frontend modules accept default_thinking_mode from runtime config and thread it into preprocess_chat_request / apply_default_thinking_mode_to_template_kwargs. SGLang processor also propagates it into worker pool initargs.
Tests
components/src/dynamo/common/backend/tests/test_backend_bindings.py, lib/backend-common/src/worker.rs, lib/llm/src/preprocessor.rs, components/src/dynamo/frontend/tests/test_sglang_processor_unit.py, components/src/dynamo/frontend/tests/test_vllm_processor_unit.py
Tests cover WorkerConfig binding constructor/defaults, Rust build_local_model runtime_data injection, Rust preprocessor injection/non-override cases, and Python frontend disabled/non-override behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers overview and details, but it omits the required "Where should the reviewer start?" and Related Issues sections. Add the missing reviewer-start guidance and complete the required Related Issues section with the issue link or the no-issue confirmation.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: deployment-level thinking defaults for chat templates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/bindings/python/rust/backend.rs (1)

297-345: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Append the new backend.WorkerConfig parameter instead of inserting it mid-signature.

This changes the positional constructor contract for backend.WorkerConfig(...): every argument after reasoning_parser now shifts one slot. Please move default_thinking_mode to the end of the PyO3 signature/parameter list and update lib/bindings/python/src/dynamo/_core.pyi to match.

Suggested fix
     #[pyo3(signature = (
         namespace,
         component = "backend".to_string(),
         endpoint = "generate".to_string(),
         model_name = String::new(),
         served_model_name = None,
         model_input = ModelInput::Tokens,
         endpoint_types = "chat,completions".to_string(),
         custom_jinja_template = None,
         tool_call_parser = None,
         reasoning_parser = None,
-        default_thinking_mode = None,
         exclude_tools_when_tool_choice_none = true,
         enable_local_indexer = true,
         enable_kv_routing = true,
         metrics_labels = Vec::new(),
         runtime = None,
         disaggregation_mode = DisaggregationMode::Aggregated,
         health_check_payload = None,
         structural_tag_mode = "off".to_string(),
         structural_tag_scope = "auto".to_string(),
         structural_tag_schema = "auto".to_string(),
         route_to_encoder = false,
+        default_thinking_mode = None,
     ))]
@@
         custom_jinja_template: Option<String>,
         tool_call_parser: Option<String>,
         reasoning_parser: Option<String>,
-        default_thinking_mode: Option<String>,
         exclude_tools_when_tool_choice_none: bool,
         enable_local_indexer: bool,
         enable_kv_routing: bool,
         metrics_labels: Vec<(String, String)>,
         runtime: Option<RuntimeConfig>,
         disaggregation_mode: DisaggregationMode,
         health_check_payload: Option<PyObject>,
         structural_tag_mode: String,
         structural_tag_scope: String,
         structural_tag_schema: String,
         route_to_encoder: bool,
+        default_thinking_mode: Option<String>,
     ) -> PyResult<Self> {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/bindings/python/rust/backend.rs` around lines 297 - 345, The
backend.WorkerConfig constructor signature changed the positional argument order
by inserting a new parameter in the middle, which breaks existing callers.
Update the PyO3 constructor in backend.rs so default_thinking_mode is appended
at the end of the signature and parameter list after route_to_encoder, and
mirror the same ordering in the backend.WorkerConfig type stub in
lib/bindings/python/src/dynamo/_core.pyi to keep the Python API contract stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/common/backend/worker.py`:
- Line 120: Move the new WorkerConfig field default_thinking_mode to the end of
the class so the dataclass stays append-only and existing positional callers
keep binding correctly. Update the WorkerConfig definition by appending
default_thinking_mode after the current optional fields, rather than inserting
it before exclude_tools_when_tool_choice_none or any existing parameters.

---

Outside diff comments:
In `@lib/bindings/python/rust/backend.rs`:
- Around line 297-345: The backend.WorkerConfig constructor signature changed
the positional argument order by inserting a new parameter in the middle, which
breaks existing callers. Update the PyO3 constructor in backend.rs so
default_thinking_mode is appended at the end of the signature and parameter list
after route_to_encoder, and mirror the same ordering in the backend.WorkerConfig
type stub in lib/bindings/python/src/dynamo/_core.pyi to keep the Python API
contract stable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b0055b16-3d34-4118-ad8d-6ab9157ae694

📥 Commits

Reviewing files that changed from the base of the PR and between 1e05725 and 3443e08.

📒 Files selected for processing (17)
  • components/src/dynamo/common/backend/tests/test_backend_bindings.py
  • components/src/dynamo/common/backend/worker.py
  • components/src/dynamo/common/configuration/groups/runtime_args.py
  • components/src/dynamo/frontend/prepost.py
  • components/src/dynamo/frontend/sglang_prepost.py
  • components/src/dynamo/frontend/sglang_processor.py
  • components/src/dynamo/frontend/tests/test_sglang_processor_unit.py
  • components/src/dynamo/frontend/tests/test_vllm_processor_unit.py
  • components/src/dynamo/frontend/thinking.py
  • components/src/dynamo/frontend/vllm_processor.py
  • components/src/dynamo/sglang/register.py
  • components/src/dynamo/trtllm/workers/llm_worker.py
  • components/src/dynamo/vllm/main.py
  • lib/backend-common/src/worker.rs
  • lib/bindings/python/rust/backend.rs
  • lib/bindings/python/src/dynamo/_core.pyi
  • lib/llm/src/preprocessor.rs

custom_jinja_template: Optional[str] = None
tool_call_parser: Optional[str] = None
reasoning_parser: Optional[str] = None
default_thinking_mode: Optional[str] = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep WorkerConfig append-only.

Line 120 inserts default_thinking_mode before existing optional fields, but this class already documents that new fields must be appended to preserve positional callers. Any positional WorkerConfig(...) call that used exclude_tools_when_tool_choice_none or anything after it will now silently bind the wrong values.

Suggested fix
-    default_thinking_mode: Optional[str] = None
     exclude_tools_when_tool_choice_none: bool = True
     enable_local_indexer: bool = True
@@
     # Appended at the END of the dataclass to keep positional callers
     # working -- inserting mid-class would silently shift downstream args.
     route_to_encoder: bool = False
+    default_thinking_mode: Optional[str] = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
default_thinking_mode: Optional[str] = None
exclude_tools_when_tool_choice_none: bool = True
enable_local_indexer: bool = True
# Appended at the END of the dataclass to keep positional callers
# working -- inserting mid-class would silently shift downstream args.
route_to_encoder: bool = False
default_thinking_mode: Optional[str] = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/common/backend/worker.py` at line 120, Move the new
WorkerConfig field default_thinking_mode to the end of the class so the
dataclass stays append-only and existing positional callers keep binding
correctly. Update the WorkerConfig definition by appending default_thinking_mode
after the current optional fields, rather than inserting it before
exclude_tools_when_tool_choice_none or any existing parameters.

@datadog-official

datadog-official Bot commented Jun 29, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 3 Pipeline jobs failed

PR | deploy-cleanup   View in Datadog   GitHub Actions

PR | deploy-operator   View in Datadog   GitHub Actions

PR | deploy-status-check   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 3443e08 | Docs | Give us feedback!

nvyutwu added a commit to nvyutwu/dynamo that referenced this pull request Jun 30, 2026
…namo#11047)

Adapt and apply ai-dynamo/dynamo PR ai-dynamo#11047 ("Support deployment-level
thinking defaults for chat templates") onto feat/glm51-v12-20260610.

Adds --dyn-default-thinking-mode enabled|disabled / DYN_DEFAULT_THINKING_MODE,
published through model runtime metadata (runtime_data.default_thinking_mode)
and applied in the SGLang/vLLM Python frontends and the Rust OpenAI
preprocessor when a request carries no thinking control of its own.
Request-level thinking/chat_template_args/chat_template_kwargs always win.

Adaptations for this branch's older base (~855 commits behind the PR base):
- preprocessor.rs: this branch's NvCreateChatCompletionRequest has no
  first-class `thinking` field, so request-level precedence checks
  `unsupported_fields` (where a root-level `thinking` lands) instead.
- worker.rs: this branch's EngineConfig has no `runtime_data`, so the
  default-thinking runtime_data map is built fresh rather than cloned.
- sglang_prepost.py: this branch lacks the newer-main
  `_normalize_openai_thinking_template_kwargs` helper; the deployment
  default is injected at the top of preprocess_chat_request (before
  reasoning gating and rendering) so both observe the same thinking state.
- vllm/main.py: add missing `import json` for the set_engine_specific call.
- The PR's frontend Python unit tests were not ported (written against
  newer-main fixtures); the Rust preprocessor unit tests are included.

Original-author: Indrajit Bhosale <iamindrajitb@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend backend::trtllm Relates to the trtllm backend backend::vllm Relates to the vllm backend feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant