Skip to content

[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605

Open
SimengLiu-nv wants to merge 1 commit into
NVIDIA:mainfrom
SimengLiu-nv:migrate-bench-x
Open

[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605
SimengLiu-nv wants to merge 1 commit into
NVIDIA:mainfrom
SimengLiu-nv:migrate-bench-x

Conversation

@SimengLiu-nv

@SimengLiu-nv SimengLiu-nv commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Use tool-aware chat template and Harmony tokenization for KV-cache-aware router hashes without mutating forwarded OpenAI requests.

Sync disaggregated generation draft tokens from context-phase params before scheduling so batch capacity accounting sees transferred draft tokens.

Add unit coverage for router/server tokenization parity and disagg draft-token scheduler accounting.

Summary by CodeRabbit

  • New Features

    • Added optional Harmony tokenization support for block-hash routing, including automatic detection for compatible model configs.
    • Improved chat tokenization to pass tool and reasoning settings through correctly.
  • Bug Fixes

    • Fixed scheduling for generation requests so draft tokens are counted consistently after transfer completion.
    • Improved handling of requests with precomputed token IDs and prompts so they are no longer rewritten unnecessarily.

Description

Test Coverage

tests/unittest/_torch/executor/test_py_executor.py
tests/unittest/disaggregated/test_router.py

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

PyExecutor now synchronizes disaggregated draft-token fields before batch scheduling and uses the synchronized count in scheduled-token estimation. Router tokenization can select Harmony from use_harmony or config.json, forwards tool schemas, and stops rewriting request prompts. Tests cover both paths.

Changes

Draft token synchronization

Layer / File(s) Summary
Draft-token helpers and completion setup
tensorrt_llm/_torch/pyexecutor/py_executor.py
Adds helpers that sync disaggregated draft-token fields and normalizes transmission-complete requests into Python-visible draft-token state.
Batch scheduling and token estimation
tensorrt_llm/_torch/pyexecutor/py_executor.py
Calls the sync helper before scheduling and uses synchronized draft-token counts when estimating scheduled tokens.
Draft-token scheduler tests
tests/unittest/_torch/executor/test_py_executor.py
Extends request mocks and adds tests for synced, empty, and ignored disaggregated draft-token cases.

Harmony tokenization routing

Layer / File(s) Summary
Harmony initialization
tensorrt_llm/serve/router.py
Adds use_harmony initialization, model config loading, and Harmony-selection logging.
Harmony and chat tokenization
tensorrt_llm/serve/router.py
Adds Harmony selection helpers, forwards tool schemas into chat tokenization, and stops mutating request prompt fields.
Tokenization behavior tests
tests/unittest/disaggregated/test_router.py
Adds tokenization helpers and tests for tool arguments, empty tools, completion prompts, and prompt_token_ids short-circuiting.
Harmony selection tests
tests/unittest/disaggregated/test_router.py
Adds tests for GPT-OSS Harmony tokenization, router/server parity, explicit Harmony selection, and config.json inference.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • suyoggupta
  • MrGeva
  • galagam
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main changes: GPTOSS router tokenization and disaggregated draft scheduling.
Description check ✅ Passed The description includes a clear summary, test coverage, and checklist, though the dedicated Description section is mostly left as a template placeholder.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/serve/router.py (1)

772-785: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Forward all worker-side chat-template inputs into router hashing.

The router now mirrors tools, but the worker also renders with request.documents and request.chat_template. Requests that differ only by RAG documents or a custom template can still produce identical router hashes while generating different worker prompts.

Suggested fix
 chat_template_kwargs = dict(request.chat_template_kwargs or {})
 chat_template_kwargs["tools"] = self._tool_dicts(request)
+chat_template_kwargs["documents"] = request.documents
+if request.chat_template is not None:
+    chat_template_kwargs["chat_template"] = request.chat_template
 result = tokenizer.apply_chat_template(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` around lines 772 - 785, Forward the remaining
worker-side chat-template inputs into the router hash computation in router.py’s
tokenizer.apply_chat_template path: the current hash only mirrors tools via
chat_template_kwargs, but it also needs request.documents and
request.chat_template so router hashing matches the worker-rendered prompt.
Update the chat_template_kwargs assembly in the same block that builds result to
include these request fields before calling apply_chat_template, ensuring
requests that differ by RAG documents or a custom template no longer collide.
🧹 Nitpick comments (3)
tests/unittest/disaggregated/test_router.py (2)

1118-1181: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Cover the disabled-Harmony selection path.

The selection tests cover explicit use_harmony=True and config.json inference, but not the server-compatible disabled path. Add a monkeypatch.setenv("DISABLE_HARMONY_ADAPTER", "1") case in tests/unittest/disaggregated/test_router.py asserting GPT-OSS requests use the tokenizer path and do not call Harmony. As per path instructions, coverage needs follow-up in this PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_router.py` around lines 1118 - 1181, Add a
test case in test_use_harmony_flag_for_alias_model or a nearby test in
KvCacheAwareRouter coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch
before calling _tokenize; verify the disabled-Harmony path uses
tokenizer.apply_chat_template, does not invoke
harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens, and leaves
req.prompt_token_ids populated as expected for GPT-OSS/server-compatible
requests. Use the existing test helpers and symbols like KvCacheAwareRouter,
_tokenize, and ChatCompletionRequest to keep the new case aligned with the
current Harmony selection tests.

Source: Path instructions


1037-1117: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add parity cases for documents and custom chat_template.

Coverage is insufficient for the remaining worker-side template inputs. Extend tests/unittest/disaggregated/test_router.py with parametrized router/server parity cases where request.documents and request.chat_template are set, so cache-hash tokenization stays aligned with serving tokenization. As per path instructions, this names the concrete test file and coverage gap.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_router.py` around lines 1037 - 1117, Add
router/server parity coverage for the remaining worker-side template inputs by
extending the existing
`test_gpt_oss_router_and_server_create_same_tokenized_input` in
`test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents`
is populated and where a custom `chat_template` is set, then verify
`KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical
tokenized inputs. Reuse the current `fake_harmony_tokens`,
`harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new
cases exercise the same cache-hash/tokenization path.

Source: Path instructions

tests/unittest/_torch/executor/test_py_executor.py (1)

378-408: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Coverage is insufficient for the executor integration paths.

tests/unittest/_torch/executor/test_py_executor.py now covers the static helper directly, but should also cover: (1) trans-complete draft_tokens=None with enable_spec_decode=True / max_total_draft_tokens > 0, and (2) the PP scheduling path that bypasses _prepare_and_schedule_batch. As per path instructions, tests under tests/** should be reviewed for actionable coverage sufficiency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_py_executor.py` around lines 378 - 408,
The executor tests only cover the helper directly and miss key integration
coverage. Add a test around trans-complete requests with draft_tokens=None when
enable_spec_decode is true and max_total_draft_tokens is positive, and add a PP
scheduling test that exercises the path bypassing _prepare_and_schedule_batch.
Use the existing PyExecutor and
_sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens
helpers to locate the relevant flow, but assert behavior through the
higher-level scheduling integration rather than only the static helper.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2617-2633: The draft-token sync in
_sync_disagg_generation_trans_complete_draft_tokens is using a different draft
count rule than _prepare_disagg_gen_transmission_complete, which can undercount
capacity when spec decode is enabled. Update the transmission-complete sync to
use the same draft-token resolver/logic as
_prepare_disagg_gen_transmission_complete, and move this sync so it runs after
the spec-decode decision is applied. Keep the behavior aligned for empty or None
context drafts and ensure py_draft_tokens, draft_tokens, and
py_draft_pages_allocated are derived from the shared resolution path.
- Around line 2651-2652: The PP executor path is missing the same disagg
generation transition sync that `_prepare_and_schedule_batch` already performs,
so `pp_size > 1` can schedule requests with stale draft-token state. Add the
`_sync_disagg_generation_trans_complete_draft_tokens` call in
`_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or
`_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch`
and using `self.active_requests` there as the source.

In `@tensorrt_llm/serve/router.py`:
- Around line 736-740: The Harmony tokenization decision in
`_uses_harmony_tokenization` is still defaulting GPT-OSS models to Harmony even
when the server has disabled it. Update `KvCacheAwareRouter` to use the same
resolved `use_harmony` value that `OpenAIServer` derives from
`DISABLE_HARMONY_ADAPTER`, or otherwise apply that env gate here, so router KV
hashing and worker chat template rendering stay aligned.

---

Outside diff comments:
In `@tensorrt_llm/serve/router.py`:
- Around line 772-785: Forward the remaining worker-side chat-template inputs
into the router hash computation in router.py’s tokenizer.apply_chat_template
path: the current hash only mirrors tools via chat_template_kwargs, but it also
needs request.documents and request.chat_template so router hashing matches the
worker-rendered prompt. Update the chat_template_kwargs assembly in the same
block that builds result to include these request fields before calling
apply_chat_template, ensuring requests that differ by RAG documents or a custom
template no longer collide.

---

Nitpick comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Around line 378-408: The executor tests only cover the helper directly and
miss key integration coverage. Add a test around trans-complete requests with
draft_tokens=None when enable_spec_decode is true and max_total_draft_tokens is
positive, and add a PP scheduling test that exercises the path bypassing
_prepare_and_schedule_batch. Use the existing PyExecutor and
_sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens
helpers to locate the relevant flow, but assert behavior through the
higher-level scheduling integration rather than only the static helper.

In `@tests/unittest/disaggregated/test_router.py`:
- Around line 1118-1181: Add a test case in
test_use_harmony_flag_for_alias_model or a nearby test in KvCacheAwareRouter
coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch before calling
_tokenize; verify the disabled-Harmony path uses tokenizer.apply_chat_template,
does not invoke harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens,
and leaves req.prompt_token_ids populated as expected for
GPT-OSS/server-compatible requests. Use the existing test helpers and symbols
like KvCacheAwareRouter, _tokenize, and ChatCompletionRequest to keep the new
case aligned with the current Harmony selection tests.
- Around line 1037-1117: Add router/server parity coverage for the remaining
worker-side template inputs by extending the existing
`test_gpt_oss_router_and_server_create_same_tokenized_input` in
`test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents`
is populated and where a custom `chat_template` is set, then verify
`KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical
tokenized inputs. Reuse the current `fake_harmony_tokens`,
`harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new
cases exercise the same cache-hash/tokenization path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1001c8fd-15d1-46f1-9553-f45f1bea19cf

📥 Commits

Reviewing files that changed from the base of the PR and between f28a832 and 3081468.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/serve/router.py
  • tests/unittest/_torch/executor/test_py_executor.py
  • tests/unittest/disaggregated/test_router.py

Comment on lines +2617 to +2633
@staticmethod
def _sync_disagg_generation_trans_complete_draft_tokens(
requests: Iterable[LlmRequest]) -> None:
for request in requests:
if not getattr(request,
"is_disagg_generation_transmission_complete", False):
continue

context_phase_params = request.context_phase_params
if context_phase_params is None:
continue

draft_tokens = context_phase_params.draft_tokens
request.py_draft_tokens = [] if draft_tokens is None else list(
draft_tokens)
request.draft_tokens = request.py_draft_tokens
request.py_draft_pages_allocated = len(request.py_draft_tokens)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use the same draft-token resolver before scheduling and transmission-complete setup.

Line 2630 treats None/empty ctx drafts as zero scheduled draft tokens, but Line 4478 later turns that same no-draft case into max_total_draft_tokens dummy drafts when model_engine.enable_spec_decode is true. That can undercount batch capacity before the request prepares more decode tokens. Move this sync after the current spec-decode decision and share the same resolver with _prepare_disagg_gen_transmission_complete.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2617 - 2633, The
draft-token sync in _sync_disagg_generation_trans_complete_draft_tokens is using
a different draft count rule than _prepare_disagg_gen_transmission_complete,
which can undercount capacity when spec decode is enabled. Update the
transmission-complete sync to use the same draft-token resolver/logic as
_prepare_disagg_gen_transmission_complete, and move this sync so it runs after
the spec-decode decision is applied. Keep the behavior aligned for empty or None
context drafts and ensure py_draft_tokens, draft_tokens, and
py_draft_pages_allocated are derived from the shared resolution path.

Comment on lines +2651 to +2652
self._sync_disagg_generation_trans_complete_draft_tokens(
self.active_requests)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Mirror this sync in the PP scheduling path.

Line 2651 only covers _prepare_and_schedule_batch; _executor_loop_pp bypasses that helper and calls _pp_schedule_and_propagate() / _schedule() directly, so pp_size > 1 disagg generation requests can still schedule with stale draft-token fields.

Suggested placement in `_executor_loop_pp`
 if self.kv_cache_transceiver:
     self._check_disagg_ctx_schedulable_status(new_requests)
     self._check_disagg_gen_transfer_status()
+    self._sync_disagg_generation_trans_complete_draft_tokens(
+        self.active_requests)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2651 - 2652, The
PP executor path is missing the same disagg generation transition sync that
`_prepare_and_schedule_batch` already performs, so `pp_size > 1` can schedule
requests with stale draft-token state. Add the
`_sync_disagg_generation_trans_complete_draft_tokens` call in
`_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or
`_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch`
and using `self.active_requests` there as the source.

Comment on lines +736 to +740
def _uses_harmony_tokenization(self,
request: ChatCompletionRequest) -> bool:
if self._use_harmony is not None:
return self._use_harmony
return self._get_model_type(request.model) == "gpt_oss"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Honor the server-side Harmony disable gate.

OpenAIServer resolves DISABLE_HARMONY_ADAPTER=1 to use_harmony=False, but this default router inference still enables Harmony for GPT-OSS models. That makes router KV hashes use Harmony tokens while the worker renders the non-Harmony chat template. Consider applying the same env gate here or always passing the server’s resolved use_harmony value into KvCacheAwareRouter.

Suggested alignment
 def _uses_harmony_tokenization(self,
                                request: ChatCompletionRequest) -> bool:
+    if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1":
+        return False
     if self._use_harmony is not None:
         return self._use_harmony
     return self._get_model_type(request.model) == "gpt_oss"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _uses_harmony_tokenization(self,
request: ChatCompletionRequest) -> bool:
if self._use_harmony is not None:
return self._use_harmony
return self._get_model_type(request.model) == "gpt_oss"
def _uses_harmony_tokenization(self,
request: ChatCompletionRequest) -> bool:
if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1":
return False
if self._use_harmony is not None:
return self._use_harmony
return self._get_model_type(request.model) == "gpt_oss"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` around lines 736 - 740, The Harmony
tokenization decision in `_uses_harmony_tokenization` is still defaulting
GPT-OSS models to Harmony even when the server has disabled it. Update
`KvCacheAwareRouter` to use the same resolved `use_harmony` value that
`OpenAIServer` derives from `DISABLE_HARMONY_ADAPTER`, or otherwise apply that
env gate here, so router KV hashing and worker chat template rendering stay
aligned.

Use tool-aware chat template and Harmony tokenization for KV-cache-aware
router hashes without mutating forwarded OpenAI requests.

Sync disaggregated generation draft tokens from context-phase params before
scheduling so batch capacity accounting sees transferred draft tokens.

Add unit coverage for router/server tokenization parity and disagg draft-token
scheduler accounting.

Signed-off-by: Simeng Liu <simengl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant