[None][fix] Align GPTOSS router tokenization and disagg draft scheduling by SimengLiu-nv · Pull Request #15605 · NVIDIA/TensorRT-LLM

SimengLiu-nv · 2026-06-24T23:18:52Z

Use tool-aware chat template and Harmony tokenization for KV-cache-aware router hashes without mutating forwarded OpenAI requests.

Sync disaggregated generation draft tokens from context-phase params before scheduling so batch capacity accounting sees transferred draft tokens.

Add unit coverage for router/server tokenization parity and disagg draft-token scheduler accounting.

Summary by CodeRabbit

New Features
- Added optional Harmony tokenization support for block-hash routing, including automatic detection for compatible model configs.
- Improved chat tokenization to pass tool and reasoning settings through correctly.
Bug Fixes
- Fixed scheduling for generation requests so draft tokens are counted consistently after transfer completion.
- Improved handling of requests with precomputed token IDs and prompts so they are no longer rewritten unnecessarily.

Description

Test Coverage

tests/unittest/_torch/executor/test_py_executor.py
tests/unittest/disaggregated/test_router.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-24T23:26:59Z

📝 Walkthrough

Walkthrough

PyExecutor now synchronizes disaggregated draft-token fields before batch scheduling and uses the synchronized count in scheduled-token estimation. Router tokenization can select Harmony from use_harmony or config.json, forwards tool schemas, and stops rewriting request prompts. Tests cover both paths.

Changes

Draft token synchronization

Layer / File(s)	Summary
Draft-token helpers and completion setup `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Adds helpers that sync disaggregated draft-token fields and normalizes transmission-complete requests into Python-visible draft-token state.
Batch scheduling and token estimation `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Calls the sync helper before scheduling and uses synchronized draft-token counts when estimating scheduled tokens.
Draft-token scheduler tests `tests/unittest/_torch/executor/test_py_executor.py`	Extends request mocks and adds tests for synced, empty, and ignored disaggregated draft-token cases.

Harmony tokenization routing

Layer / File(s)	Summary
Harmony initialization `tensorrt_llm/serve/router.py`	Adds `use_harmony` initialization, model config loading, and Harmony-selection logging.
Harmony and chat tokenization `tensorrt_llm/serve/router.py`	Adds Harmony selection helpers, forwards tool schemas into chat tokenization, and stops mutating request prompt fields.
Tokenization behavior tests `tests/unittest/disaggregated/test_router.py`	Adds tokenization helpers and tests for tool arguments, empty tools, completion prompts, and `prompt_token_ids` short-circuiting.
Harmony selection tests `tests/unittest/disaggregated/test_router.py`	Adds tests for GPT-OSS Harmony tokenization, router/server parity, explicit Harmony selection, and `config.json` inference.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

suyoggupta
MrGeva
galagam

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main changes: GPTOSS router tokenization and disaggregated draft scheduling.
Description check	✅ Passed	The description includes a clear summary, test coverage, and checklist, though the dedicated Description section is mostly left as a template placeholder.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/serve/router.py (1)

772-785: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Forward all worker-side chat-template inputs into router hashing.

The router now mirrors tools, but the worker also renders with request.documents and request.chat_template. Requests that differ only by RAG documents or a custom template can still produce identical router hashes while generating different worker prompts.

Suggested fix

 chat_template_kwargs = dict(request.chat_template_kwargs or {})
 chat_template_kwargs["tools"] = self._tool_dicts(request)
+chat_template_kwargs["documents"] = request.documents
+if request.chat_template is not None:
+    chat_template_kwargs["chat_template"] = request.chat_template
 result = tokenizer.apply_chat_template(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` around lines 772 - 785, Forward the remaining
worker-side chat-template inputs into the router hash computation in router.py’s
tokenizer.apply_chat_template path: the current hash only mirrors tools via
chat_template_kwargs, but it also needs request.documents and
request.chat_template so router hashing matches the worker-rendered prompt.
Update the chat_template_kwargs assembly in the same block that builds result to
include these request fields before calling apply_chat_template, ensuring
requests that differ by RAG documents or a custom template no longer collide.

🧹 Nitpick comments (3)

tests/unittest/disaggregated/test_router.py (2)
1118-1181: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Cover the disabled-Harmony selection path.

The selection tests cover explicit use_harmony=True and config.json inference, but not the server-compatible disabled path. Add a monkeypatch.setenv("DISABLE_HARMONY_ADAPTER", "1") case in tests/unittest/disaggregated/test_router.py asserting GPT-OSS requests use the tokenizer path and do not call Harmony. As per path instructions, coverage needs follow-up in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_router.py` around lines 1118 - 1181, Add a
test case in test_use_harmony_flag_for_alias_model or a nearby test in
KvCacheAwareRouter coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch
before calling _tokenize; verify the disabled-Harmony path uses
tokenizer.apply_chat_template, does not invoke
harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens, and leaves
req.prompt_token_ids populated as expected for GPT-OSS/server-compatible
requests. Use the existing test helpers and symbols like KvCacheAwareRouter,
_tokenize, and ChatCompletionRequest to keep the new case aligned with the
current Harmony selection tests.
Source: Path instructions

1037-1117: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add parity cases for documents and custom chat_template.

Coverage is insufficient for the remaining worker-side template inputs. Extend tests/unittest/disaggregated/test_router.py with parametrized router/server parity cases where request.documents and request.chat_template are set, so cache-hash tokenization stays aligned with serving tokenization. As per path instructions, this names the concrete test file and coverage gap.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_router.py` around lines 1037 - 1117, Add
router/server parity coverage for the remaining worker-side template inputs by
extending the existing
`test_gpt_oss_router_and_server_create_same_tokenized_input` in
`test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents`
is populated and where a custom `chat_template` is set, then verify
`KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical
tokenized inputs. Reuse the current `fake_harmony_tokens`,
`harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new
cases exercise the same cache-hash/tokenization path.
Source: Path instructions
tests/unittest/_torch/executor/test_py_executor.py (1)
378-408: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Coverage is insufficient for the executor integration paths.

tests/unittest/_torch/executor/test_py_executor.py now covers the static helper directly, but should also cover: (1) trans-complete draft_tokens=None with enable_spec_decode=True / max_total_draft_tokens > 0, and (2) the PP scheduling path that bypasses _prepare_and_schedule_batch. As per path instructions, tests under tests/** should be reviewed for actionable coverage sufficiency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_py_executor.py` around lines 378 - 408,
The executor tests only cover the helper directly and miss key integration
coverage. Add a test around trans-complete requests with draft_tokens=None when
enable_spec_decode is true and max_total_draft_tokens is positive, and add a PP
scheduling test that exercises the path bypassing _prepare_and_schedule_batch.
Use the existing PyExecutor and
_sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens
helpers to locate the relevant flow, but assert behavior through the
higher-level scheduling integration rather than only the static helper.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2617-2633: The draft-token sync in
_sync_disagg_generation_trans_complete_draft_tokens is using a different draft
count rule than _prepare_disagg_gen_transmission_complete, which can undercount
capacity when spec decode is enabled. Update the transmission-complete sync to
use the same draft-token resolver/logic as
_prepare_disagg_gen_transmission_complete, and move this sync so it runs after
the spec-decode decision is applied. Keep the behavior aligned for empty or None
context drafts and ensure py_draft_tokens, draft_tokens, and
py_draft_pages_allocated are derived from the shared resolution path.
- Around line 2651-2652: The PP executor path is missing the same disagg
generation transition sync that `_prepare_and_schedule_batch` already performs,
so `pp_size > 1` can schedule requests with stale draft-token state. Add the
`_sync_disagg_generation_trans_complete_draft_tokens` call in
`_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or
`_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch`
and using `self.active_requests` there as the source.

In `@tensorrt_llm/serve/router.py`:
- Around line 736-740: The Harmony tokenization decision in
`_uses_harmony_tokenization` is still defaulting GPT-OSS models to Harmony even
when the server has disabled it. Update `KvCacheAwareRouter` to use the same
resolved `use_harmony` value that `OpenAIServer` derives from
`DISABLE_HARMONY_ADAPTER`, or otherwise apply that env gate here, so router KV
hashing and worker chat template rendering stay aligned.

---

Outside diff comments:
In `@tensorrt_llm/serve/router.py`:
- Around line 772-785: Forward the remaining worker-side chat-template inputs
into the router hash computation in router.py’s tokenizer.apply_chat_template
path: the current hash only mirrors tools via chat_template_kwargs, but it also
needs request.documents and request.chat_template so router hashing matches the
worker-rendered prompt. Update the chat_template_kwargs assembly in the same
block that builds result to include these request fields before calling
apply_chat_template, ensuring requests that differ by RAG documents or a custom
template no longer collide.

---

Nitpick comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Around line 378-408: The executor tests only cover the helper directly and
miss key integration coverage. Add a test around trans-complete requests with
draft_tokens=None when enable_spec_decode is true and max_total_draft_tokens is
positive, and add a PP scheduling test that exercises the path bypassing
_prepare_and_schedule_batch. Use the existing PyExecutor and
_sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens
helpers to locate the relevant flow, but assert behavior through the
higher-level scheduling integration rather than only the static helper.

In `@tests/unittest/disaggregated/test_router.py`:
- Around line 1118-1181: Add a test case in
test_use_harmony_flag_for_alias_model or a nearby test in KvCacheAwareRouter
coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch before calling
_tokenize; verify the disabled-Harmony path uses tokenizer.apply_chat_template,
does not invoke harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens,
and leaves req.prompt_token_ids populated as expected for
GPT-OSS/server-compatible requests. Use the existing test helpers and symbols
like KvCacheAwareRouter, _tokenize, and ChatCompletionRequest to keep the new
case aligned with the current Harmony selection tests.
- Around line 1037-1117: Add router/server parity coverage for the remaining
worker-side template inputs by extending the existing
`test_gpt_oss_router_and_server_create_same_tokenized_input` in
`test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents`
is populated and where a custom `chat_template` is set, then verify
`KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical
tokenized inputs. Reuse the current `fake_harmony_tokens`,
`harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new
cases exercise the same cache-hash/tokenization path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1001c8fd-15d1-46f1-9553-f45f1bea19cf

📥 Commits

Reviewing files that changed from the base of the PR and between f28a832 and 3081468.

📒 Files selected for processing (4)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/serve/router.py
tests/unittest/_torch/executor/test_py_executor.py
tests/unittest/disaggregated/test_router.py

coderabbitai · 2026-06-24T23:27:02Z

+    @staticmethod
+    def _sync_disagg_generation_trans_complete_draft_tokens(
+            requests: Iterable[LlmRequest]) -> None:
+        for request in requests:
+            if not getattr(request,
+                           "is_disagg_generation_transmission_complete", False):
+                continue
+
+            context_phase_params = request.context_phase_params
+            if context_phase_params is None:
+                continue
+
+            draft_tokens = context_phase_params.draft_tokens
+            request.py_draft_tokens = [] if draft_tokens is None else list(
+                draft_tokens)
+            request.draft_tokens = request.py_draft_tokens
+            request.py_draft_pages_allocated = len(request.py_draft_tokens)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use the same draft-token resolver before scheduling and transmission-complete setup.

Line 2630 treats None/empty ctx drafts as zero scheduled draft tokens, but Line 4478 later turns that same no-draft case into max_total_draft_tokens dummy drafts when model_engine.enable_spec_decode is true. That can undercount batch capacity before the request prepares more decode tokens. Move this sync after the current spec-decode decision and share the same resolver with _prepare_disagg_gen_transmission_complete.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2617 - 2633, The draft-token sync in _sync_disagg_generation_trans_complete_draft_tokens is using a different draft count rule than _prepare_disagg_gen_transmission_complete, which can undercount capacity when spec decode is enabled. Update the transmission-complete sync to use the same draft-token resolver/logic as _prepare_disagg_gen_transmission_complete, and move this sync so it runs after the spec-decode decision is applied. Keep the behavior aligned for empty or None context drafts and ensure py_draft_tokens, draft_tokens, and py_draft_pages_allocated are derived from the shared resolution path.

coderabbitai · 2026-06-24T23:27:02Z

+            self._sync_disagg_generation_trans_complete_draft_tokens(
+                self.active_requests)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Mirror this sync in the PP scheduling path.

Line 2651 only covers _prepare_and_schedule_batch; _executor_loop_pp bypasses that helper and calls _pp_schedule_and_propagate() / _schedule() directly, so pp_size > 1 disagg generation requests can still schedule with stale draft-token fields.

Suggested placement in `_executor_loop_pp`

if self.kv_cache_transceiver: self._check_disagg_ctx_schedulable_status(new_requests) self._check_disagg_gen_transfer_status() + self._sync_disagg_generation_trans_complete_draft_tokens( + self.active_requests)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2651 - 2652, The PP executor path is missing the same disagg generation transition sync that `_prepare_and_schedule_batch` already performs, so `pp_size > 1` can schedule requests with stale draft-token state. Add the `_sync_disagg_generation_trans_complete_draft_tokens` call in `_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or `_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch` and using `self.active_requests` there as the source.

coderabbitai · 2026-06-24T23:27:02Z

+    def _uses_harmony_tokenization(self,
+                                   request: ChatCompletionRequest) -> bool:
+        if self._use_harmony is not None:
+            return self._use_harmony
+        return self._get_model_type(request.model) == "gpt_oss"


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Honor the server-side Harmony disable gate.

OpenAIServer resolves DISABLE_HARMONY_ADAPTER=1 to use_harmony=False, but this default router inference still enables Harmony for GPT-OSS models. That makes router KV hashes use Harmony tokens while the worker renders the non-Harmony chat template. Consider applying the same env gate here or always passing the server’s resolved use_harmony value into KvCacheAwareRouter.

Suggested alignment

def _uses_harmony_tokenization(self, request: ChatCompletionRequest) -> bool: + if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1": + return False if self._use_harmony is not None: return self._use_harmony return self._get_model_type(request.model) == "gpt_oss"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _uses_harmony_tokenization(self,

request: ChatCompletionRequest) -> bool:

if self._use_harmony is not None:

return self._use_harmony

return self._get_model_type(request.model) == "gpt_oss"

def _uses_harmony_tokenization(self,

request: ChatCompletionRequest) -> bool:

if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1":

return False

if self._use_harmony is not None:

return self._use_harmony

return self._get_model_type(request.model) == "gpt_oss"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/router.py` around lines 736 - 740, The Harmony tokenization decision in `_uses_harmony_tokenization` is still defaulting GPT-OSS models to Harmony even when the server has disabled it. Update `KvCacheAwareRouter` to use the same resolved `use_harmony` value that `OpenAIServer` derives from `DISABLE_HARMONY_ADAPTER`, or otherwise apply that env gate here, so router KV hashing and worker chat template rendering stay aligned.

Use tool-aware chat template and Harmony tokenization for KV-cache-aware router hashes without mutating forwarded OpenAI requests. Sync disaggregated generation draft tokens from context-phase params before scheduling so batch capacity accounting sees transferred draft tokens. Add unit coverage for router/server tokenization parity and disagg draft-token scheduler accounting. Signed-off-by: Simeng Liu <simengl@nvidia.com>

SimengLiu-nv requested review from a team as code owners June 24, 2026 23:18

SimengLiu-nv requested a review from schetlur-nv June 24, 2026 23:18

github-actions Bot assigned SimengLiu-nv Jun 24, 2026

SimengLiu-nv requested review from Tabrizian and pcastonguay June 24, 2026 23:19

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

SimengLiu-nv force-pushed the migrate-bench-x branch from 3081468 to 1f3dde4 Compare June 24, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605

[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605
SimengLiu-nv wants to merge 1 commit into
NVIDIA:mainfrom
SimengLiu-nv:migrate-bench-x

SimengLiu-nv commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Uh oh!

coderabbitai Bot Jun 24, 2026

Uh oh!

coderabbitai Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self._sync_disagg_generation_trans_complete_draft_tokens(
		self.active_requests)

Uh oh!

Conversation

SimengLiu-nv commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SimengLiu-nv commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading