[TRTLLM-12958][feat] Enable speculative decoding for dis-agg gen-only requests by bo-nv · Pull Request #14546 · NVIDIA/TensorRT-LLM

bo-nv · 2026-05-26T04:17:07Z

Summary by CodeRabbit

Bug Fixes
- Improved compatibility for disaggregated execution with varying peer layer configurations.
- Enhanced handling of speculative decoding batch processing when draft tokens are unavailable, ensuring uniform input shapes across batch operations.
- Increased robustness of generation in disaggregated execution scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-26T04:23:37Z

📝 Walkthrough

Walkthrough

This PR enables uniform draft token handling across disaggregated CTX and GEN peers by relaxing peer compatibility checks, padding empty draft token lists, and pre-filling missing context draft tokens in the executor to ensure consistent speculative decoding input shapes.

Changes

Uniform draft token handling in disaggregated generation

Layer / File(s)	Summary
Peer compatibility, transmission unpacking, and executor draft setup `tensorrt_llm/_torch/disaggregation/native/peer.py`, `tensorrt_llm/_torch/disaggregation/native/transfer.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	`PeerRegistrar._check_peer_compatible` logs info and allows partial layer transfer on layer-count mismatch instead of failing; `RxSession.unpack_aux` pads empty `draft_tokens` with zeros to match `_max_draft_len`; `PyExecutor._prepare_disagg_gen_transmission_complete` pre-fills missing `ctx_draft_tokens` with dummy tokens when speculative decoding is enabled, ensuring GEN receives uniform draft-length inputs.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is largely empty with only template placeholders and uncompleted sections.	Fill in the Description and Test Coverage sections with specific details about what changes are made and why. Provide at least one relevant test case safeguarding the changes.
Title check	❓ Inconclusive	The title clearly references the JIRA ticket and feature type, but lacks a descriptive summary of what 'gen-only spec dec' enables or the actual changes made.	Expand the title to be more descriptive of the actual changes, such as '[TRTLLM-12958][feat] Allow speculative decoding in generation-only phase' or similar, to clarify the change for developers scanning history.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/peer.py`:
- Around line 100-107: The current unconditional allowance for differing layer
counts in PeerRegistrar can register incompatible peers; change the logic so you
compute the actual transferable overlap (e.g., derive overlapping layer
names/ids via set(self_layers) ∩ set(peer_layers) or by using the existing
pool_mapping logic to compute transferable_layers) and only proceed with the
partial-transfer path when that overlap is non-empty or the difference matches
an explicitly allowed MTP-only delta; otherwise log an error and raise/fail
compatibility instead of silently allowing registration (update the code around
PeerRegistrar and the pool_mapping check and the logger call that currently
emits "layer count differs ... allowing partial layer transfer").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05dfc83a-4bcf-4d42-9a00-bb3a8e3859b2

📥 Commits

Reviewing files that changed from the base of the PR and between 526d7ee and 2b57dd9.

📒 Files selected for processing (3)

tensorrt_llm/_torch/disaggregation/native/peer.py
tensorrt_llm/_torch/disaggregation/native/transfer.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

bo-nv · 2026-05-26T10:56:13Z

/bot run --add-multi-gpu-test --disable-fail-fast

bo-nv · 2026-05-27T01:01:57Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-27T01:08:10Z

PR_Github #50415 [ run ] triggered by Bot. Commit: 2cebbef Link to invocation

tensorrt-cicd · 2026-05-27T09:08:58Z

PR_Github #50415 [ run ] completed with state SUCCESS. Commit: 2cebbef
/LLM/main/L0_MergeRequest_PR pipeline #39939 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-02T02:04:41Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-02T02:11:15Z

PR_Github #51474 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

tensorrt-cicd · 2026-06-02T09:39:50Z

PR_Github #51474 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #40880 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-02T11:36:45Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-02T14:11:07Z

PR_Github #51609 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

tensorrt-cicd · 2026-06-02T19:45:07Z

PR_Github #51609 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #40992 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-03T01:27:56Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-03T01:33:59Z

PR_Github #51714 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

tensorrt-cicd · 2026-06-03T06:07:28Z

PR_Github #51714 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #41090 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-03T06:56:25Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-03T07:01:51Z

PR_Github #51783 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

tensorrt-cicd · 2026-06-03T09:40:35Z

PR_Github #51783 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #41149 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-03T10:08:44Z

/bot run

tensorrt-cicd · 2026-06-03T10:15:02Z

PR_Github #51823 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

bo-nv · 2026-06-09T20:18:29Z

/bot run

tensorrt-cicd · 2026-06-09T20:24:40Z

PR_Github #53138 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

tensorrt-cicd · 2026-06-09T20:53:12Z

PR_Github #53138 [ run ] completed with state FAILURE. Commit: 12f0890
/LLM/main/L0_MergeRequest_PR pipeline #42341 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-09T21:18:46Z

/bot run

tensorrt-cicd · 2026-06-09T21:25:47Z

PR_Github #53149 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

tensorrt-cicd · 2026-06-09T22:01:19Z

PR_Github #53149 [ run ] completed with state FAILURE. Commit: 12f0890
/LLM/main/L0_MergeRequest_PR pipeline #42352 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-09T22:19:01Z

/bot run

tensorrt-cicd · 2026-06-09T22:25:34Z

PR_Github #53162 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv · 2026-06-10T02:23:21Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-10T02:29:01Z

PR_Github #53197 [ run ] triggered by Bot. Commit: 3ee277e Link to invocation

tensorrt-cicd · 2026-06-10T02:32:33Z

PR_Github #53162 [ run ] completed with state ABORTED. Commit: 12f0890

Link to invocation

chuangz0 · 2026-06-10T07:11:17Z

+            # Limit to prompt_len blocks, matching C++ cacheFormatter behavior.
+            # Extra blocks from num_extra_kv_tokens (speculative decoding) have
+            # uninitialized KV data and must not be transferred.
+            total_blocks = (req.prompt_len + tpb - 1) // tpb


Here, along with token_range, the requirement is to only transfer blocks for prompt_len. However, in practice, prompt + num_extra_kv_tokens blocks are allocated.

If MTP is enabled for both the context phase and the generation phase, then the current modification will only transfer prompt_len
blocks, and the extra block that may will not be transferred. The questions are:

Will the KV cache for num_extra_kv_tokens be written to during the context phase?

Will the KV cache written during the context phase be used by the generation phase?

When both context and generation have MTP enabled, do we need to transfer prompt_len + num_extra_kv_tokens KV blocks?
cc @lfr-0531

@chuangz0 these changes fix the accuracy issue with py-transceiver + eagle3. Without them，accuracy drops even if enable eagle3 for both ctx and gen.

Merge first

tensorrt-cicd · 2026-06-10T08:58:58Z

PR_Github #53197 [ run ] completed with state SUCCESS. Commit: 3ee277e
/LLM/main/L0_MergeRequest_PR pipeline #42394 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bo-nv · 2026-06-10T09:11:33Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-10T09:18:41Z

PR_Github #53286 [ run ] triggered by Bot. Commit: 3ee277e Link to invocation

tensorrt-cicd · 2026-06-10T11:35:09Z

PR_Github #53286 [ run ] completed with state SUCCESS. Commit: 3ee277e
/LLM/main/L0_MergeRequest_PR pipeline #42476 completed with status: 'SUCCESS'

CI Report

Link to invocation

ADEngine subclasses the abstract ModelEngine and does not run PyTorchModelEngine.__init__, so it never set `enable_spec_decode`. After NVIDIA#14546 added an unguarded `self.model_engine.enable_spec_decode` read in `_prepare_disagg_gen_transmission_complete` (the disagg generation handoff path that ADEngine traverses via NVIDIA#14057 AutoDeploy Basic Disagg Support), AutoDeploy disaggregated runs crash with: AttributeError: 'ADEngine' object has no attribute 'enable_spec_decode' NVIDIA#14546 and NVIDIA#14057 each passed CI independently but conflict semantically once both are on main. Set `is_spec_decode`/`enable_spec_decode` in ADEngine.__init__, mirroring PyTorchModelEngine (enable_spec_decode == spec_config is not None), so ADEngine satisfies the ModelEngine attribute contract that shared PyExecutor code relies on. Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

bo-nv self-assigned this May 26, 2026

bo-nv requested review from a team as code owners May 26, 2026 04:17

bo-nv requested review from dongxuy04, lancelly and leslie-fang25 May 26, 2026 04:17

bo-nv marked this pull request as draft May 26, 2026 04:17

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/disaggregation/native/peer.py

bo-nv force-pushed the trtllm-12958 branch from 2cebbef to a26e28d Compare June 2, 2026 02:03

Shixiaowei02 requested review from Shixiaowei02 and chuangz0 June 3, 2026 09:54

bo-nv added 4 commits June 10, 2026 02:19

[TRTLLM-12958][feat] Enable gen-only spec dec

58e200a

Signed-off-by: Bo Deng <deemod@nvidia.com>

clean codes && add tests

124cc70

Signed-off-by: Bo Deng <deemod@nvidia.com>

fix eagle3 acc

e35a919

Signed-off-by: Bo Deng <deemod@nvidia.com>

fix

3ee277e

Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv force-pushed the trtllm-12958 branch from 12f0890 to 3ee277e Compare June 10, 2026 02:22

chuangz0 approved these changes Jun 10, 2026

View reviewed changes

chuangz0 self-requested a review June 10, 2026 07:04

chuangz0 reviewed Jun 10, 2026

View reviewed changes

bo-nv enabled auto-merge (squash) June 11, 2026 02:47

bo-nv merged commit 228829c into NVIDIA:main Jun 11, 2026
8 checks passed

Shixiaowei02 reviewed Jun 11, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

Shixiaowei02 mentioned this pull request Jun 11, 2026

[#11423][feat] AutoDeploy: Basic Disagg Support #14057

Merged

1 task

Shixiaowei02 changed the title ~~[TRTLLM-12958][feat] Enable gen-only spec dec~~ [TRTLLM-12958][feat] Enable speculative decoding for dis-agg gen-only requests Jun 29, 2026

Uh oh!

Conversation

bo-nv commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bo-nv commented May 26, 2026

Uh oh!

bo-nv commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

bo-nv commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

bo-nv commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

bo-nv commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

bo-nv commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

bo-nv commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

bo-nv commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

bo-nv commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

bo-nv commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

bo-nv commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

chuangz0 Jun 10, 2026

Choose a reason for hiding this comment

bo-nv commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading

bo-nv Jun 10, 2026 •

edited

Loading