[TRTLLM-11558][feat] Acceptance rate based speculation off in one model path by zheyuf · Pull Request #12905 · NVIDIA/TensorRT-LLM

zheyuf · 2026-04-09T21:52:20Z

Summary by CodeRabbit

Updates
- Enhanced speculative decoding gating mechanism to use acceptance rate-based metrics for improved decision-making on when to disable speculation.
- Renamed speculative decoding configuration parameters: acceptance_window → acceptance_rate_window_size and acceptance_length_threshold → acceptance_rate_threshold.
- New acceptance rate threshold now enforced between 0.0 and 1.0 for consistency.

Description

This PR ports AR-based speculation-off to the one-model path and removes the previous implementation from the two-model path.

Highlights

Users can configure acceptance_rate_window_size and acceptance_rate_threshold through the LLM API.
If the rolling acceptance rate over the recent window drops below acceptance_rate_threshold, speculative decoding is permanently disabled.
This is compatible with batch-size-based dynamic speculation, so both mechanisms can be enabled at the same time.

Main changes compared with the previous two-model-path implementation

The gate is now based on true acceptance rate, rather than acceptance length. This is important for compatibility with dynamic draft length, where the draft length can vary over time.
AR history now records one batch-level acceptance-rate sample per speculation-enabled iteration. Previously, it recorded one request-level average after each request completed.
Parameter naming is now clearer and more consistent.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-09T23:37:27Z

📝 Walkthrough

Walkthrough

This change refactors speculative decoding acceptance-rate gating from per-request average-decoded-tokens tracking to batch-based true acceptance-rate sampling. Configuration parameters and gate API methods are renamed accordingly.

Changes

Cohort / File(s)	Summary
Speculative Decoding Core `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/speculative/speculation_gate.py`	Refactored acceptance tracking from per-request `record_avg_decoded(...)` to batch-based `record_acceptance_rate(...)`. Gate now records true acceptance-rate samples (`acceptance_rate = total_accepted_tokens / total_draft_tokens`) instead of derived metrics from average decoded tokens. Internal state tracking fields renamed (`acceptance_history` → `acceptance_rate_history`, etc.). Added proactive clearing of draft tokens when speculation is permanently disabled.
Configuration Schema `tensorrt_llm/llmapi/llm_args.py`	Renamed acceptance window/threshold fields in `DecodingBaseConfig`: `acceptance_window` → `acceptance_rate_window_size`, `acceptance_length_threshold` → `acceptance_rate_threshold`. New threshold is constrained to `[0.0, 1.0]` and now represents true acceptance rate rather than acceptance length.
Test Suite `tests/unittest/_torch/speculative/test_spec_gate.py`	Updated all speculation gate tests to use new `record_acceptance_rate` API instead of `record_avg_decoded`. Adjusted mock parameters, tracked fields, configuration references, and assertions to align with acceptance-rate semantics. Modified test configuration values for thresholds and cache settings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: implementing acceptance rate-based speculation disabling in the one-model execution path.
Description check	✅ Passed	The PR description explains the key changes and highlights but lacks detailed Test Coverage section and some checklist items are incomplete.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

zheyuf · 2026-04-09T23:37:41Z

/bot run --disable-fail-fast

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1167-1171: Remove unused new_tokens_lens retrieval.

The new_tokens_lens variable is retrieved from sample_state.host and converted to a list, but it is never used in the method body. This appears to be leftover code from a previous implementation.

♻️ Proposed fix to remove dead code

         if (getattr(self.dist, 'has_pp', False)
                 and not self.dist.is_last_pp_rank):
             return False, None
-        new_tokens_lens = getattr(sample_state.host, 'new_tokens_lens', None)
-        if new_tokens_lens is None:
-            return False, None
-        new_tokens_lens_list = (new_tokens_lens.tolist() if hasattr(
-            new_tokens_lens, 'tolist') else list(new_tokens_lens))
         total_draft_tokens = 0
         total_accepted_tokens = 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1167 - 1171,
Remove the dead retrieval and conversion of new_tokens_lens: delete the lines
that get new_tokens_lens from sample_state.host, the conditional that checks if
new_tokens_lens is None and returns (False, None), and the conversion to
new_tokens_lens_list (the references to new_tokens_lens, new_tokens_lens_list
and their tolist/list conversion). Ensure no other code in the enclosing
function (e.g., methods using sample_state.host) depends on new_tokens_lens
before removing these statements.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1167-1171: Remove the dead retrieval and conversion of
new_tokens_lens: delete the lines that get new_tokens_lens from
sample_state.host, the conditional that checks if new_tokens_lens is None and
returns (False, None), and the conversion to new_tokens_lens_list (the
references to new_tokens_lens, new_tokens_lens_list and their tolist/list
conversion). Ensure no other code in the enclosing function (e.g., methods using
sample_state.host) depends on new_tokens_lens before removing these statements.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fcb3fcdc-4b58-4922-b217-d415972c41fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2a0be45 and c12fe3b.

📒 Files selected for processing (4)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/speculation_gate.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/speculative/test_spec_gate.py

tensorrt-cicd · 2026-04-09T23:43:23Z

PR_Github #42591 [ run ] triggered by Bot. Commit: c12fe3b Link to invocation

tensorrt-cicd · 2026-04-10T23:29:39Z

PR_Github #42591 [ run ] completed with state SUCCESS. Commit: c12fe3b
/LLM/main/L0_MergeRequest_PR pipeline #33317 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zheyuf · 2026-04-11T00:13:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-11T00:19:34Z

PR_Github #42761 [ run ] triggered by Bot. Commit: cdda7c8 Link to invocation

tensorrt-cicd · 2026-04-11T15:50:28Z

PR_Github #42761 [ run ] completed with state SUCCESS. Commit: cdda7c8
/LLM/main/L0_MergeRequest_PR pipeline #33438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zheyuf · 2026-04-13T06:06:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-13T06:13:35Z

PR_Github #42966 [ run ] triggered by Bot. Commit: 79c0653 Link to invocation

tensorrt-cicd · 2026-04-13T22:37:55Z

PR_Github #42966 [ run ] completed with state SUCCESS. Commit: 79c0653
/LLM/main/L0_MergeRequest_PR pipeline #33622 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zheyuf · 2026-04-14T22:30:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-14T22:37:18Z

PR_Github #43290 [ run ] triggered by Bot. Commit: ddd812e Link to invocation

tensorrt-cicd · 2026-04-15T03:50:09Z

PR_Github #43290 [ run ] completed with state FAILURE. Commit: ddd812e
/LLM/main/L0_MergeRequest_PR pipeline #33837 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-06-11T20:23:16Z

PR_Github #53662 [ run ] triggered by Bot. Commit: 672132c Link to invocation

tensorrt-cicd · 2026-06-12T03:16:06Z

PR_Github #53662 [ run ] completed with state SUCCESS. Commit: 672132c
/LLM/main/L0_MergeRequest_PR pipeline #42803 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zheyuf · 2026-06-12T07:33:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-12T07:39:32Z

PR_Github #53835 [ run ] triggered by Bot. Commit: 8098694 Link to invocation

tensorrt-cicd · 2026-06-12T17:23:52Z

PR_Github #53835 [ run ] completed with state SUCCESS. Commit: 8098694
/LLM/main/L0_MergeRequest_PR pipeline #42948 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zheyuf · 2026-06-12T19:41:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-12T19:48:34Z

PR_Github #53948 [ run ] triggered by Bot. Commit: eaa50b7 Link to invocation

tensorrt-cicd · 2026-06-13T02:53:25Z

PR_Github #53948 [ run ] completed with state FAILURE. Commit: eaa50b7
/LLM/main/L0_MergeRequest_PR pipeline #43038 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zheyuf · 2026-06-13T03:20:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-13T03:26:20Z

PR_Github #54002 [ run ] triggered by Bot. Commit: 734d277 Link to invocation

tensorrt-cicd · 2026-06-13T10:09:44Z

PR_Github #54002 [ run ] completed with state SUCCESS. Commit: 734d277
/LLM/main/L0_MergeRequest_PR pipeline #43087 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zheyuf · 2026-06-22T17:40:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-22T17:47:21Z

PR_Github #55059 [ run ] triggered by Bot. Commit: a695708 Link to invocation

zheyuf · 2026-06-22T18:29:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-22T18:35:42Z

PR_Github #55070 [ run ] triggered by Bot. Commit: 9cd2fe6 Link to invocation

tensorrt-cicd · 2026-06-22T18:39:26Z

PR_Github #55059 [ run ] completed with state ABORTED. Commit: a695708

Link to invocation

tensorrt-cicd · 2026-06-23T01:22:14Z

PR_Github #55070 [ run ] completed with state FAILURE. Commit: 9cd2fe6
/LLM/main/L0_MergeRequest_PR pipeline #44057 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zheyuf · 2026-06-23T06:46:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-23T06:52:16Z

PR_Github #55192 [ run ] triggered by Bot. Commit: d30d2e3 Link to invocation

tensorrt-cicd · 2026-06-23T13:59:15Z

PR_Github #55192 [ run ] completed with state SUCCESS. Commit: d30d2e3
/LLM/main/L0_MergeRequest_PR pipeline #44155 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Gate speculative decoding on a rolling per-request acceptance rate. When the moving-average AR over acceptance_rate_window_size falls below acceptance_rate_threshold, speculation is disabled for that request; when it rises back above the threshold, speculation re-engages. - Wire SpeculationGate through the three hot-path call sites (PP, non- overlap, overlap scheduler) and hoist the early-return so the no-op-when-disabled guarantee is visible in the diff. - Rename _record_batch_acceptance_rate -> _update_batch_acceptance_rate and document the overlap-scheduler interaction. - Add the acceptance_rate_window_size / acceptance_rate_threshold knobs to TorchLlmArgs / DecodingBaseConfig. - Add unit tests under tests/unittest/_torch/speculative/test_spec_gate.py. Squashed from: 0e4906a AR based speculation off e099d11 fix CI ef245d7 Rename _record_batch_acceptance_rate and add overlap scheduler comment. 955bd69 [None][perf] Make AR gate no-op explicit at call sites Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

zheyuf · 2026-06-23T16:47:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-23T16:53:47Z

PR_Github #55277 [ run ] triggered by Bot. Commit: fc0be10 Link to invocation

tensorrt-cicd · 2026-06-24T00:37:27Z

PR_Github #55277 [ run ] completed with state FAILURE. Commit: fc0be10
/LLM/main/L0_MergeRequest_PR pipeline #44230 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned zheyuf Apr 9, 2026

zheyuf marked this pull request as ready for review April 9, 2026 23:35

zheyuf requested review from a team as code owners April 9, 2026 23:35

zheyuf requested review from dongxuy04, hchings, mikeiovine and ziyixiong-nv and removed request for dongxuy04 and ziyixiong-nv April 9, 2026 23:36

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

zheyuf force-pushed the 0408_AR_based branch from 2885a3e to cdda7c8 Compare April 11, 2026 00:12

zheyuf force-pushed the 0408_AR_based branch from cdda7c8 to 79c0653 Compare April 13, 2026 06:06

zheyuf force-pushed the 0408_AR_based branch from 79c0653 to ddd812e Compare April 14, 2026 22:30

zheyuf force-pushed the 0408_AR_based branch from ddd812e to 11e8736 Compare April 15, 2026 04:28

zheyuf enabled auto-merge (squash) April 15, 2026 04:28

zheyuf force-pushed the 0408_AR_based branch from 672132c to 8098694 Compare June 12, 2026 07:32

zheyuf force-pushed the 0408_AR_based branch from 8098694 to eaa50b7 Compare June 12, 2026 19:40

zheyuf force-pushed the 0408_AR_based branch from eaa50b7 to 734d277 Compare June 13, 2026 03:20

zheyuf force-pushed the 0408_AR_based branch from a695708 to 9cd2fe6 Compare June 22, 2026 18:29

zheyuf force-pushed the 0408_AR_based branch from 9cd2fe6 to d30d2e3 Compare June 23, 2026 06:45

zheyuf force-pushed the 0408_AR_based branch from d30d2e3 to fc0be10 Compare June 23, 2026 16:42

Conversation

zheyuf commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Highlights

Main changes compared with the previous two-model-path implementation

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

zheyuf commented Apr 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

zheyuf commented Apr 11, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

zheyuf commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

zheyuf commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

zheyuf commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

zheyuf commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

zheyuf commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

zheyuf commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

zheyuf commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

zheyuf commented Jun 23, 2026

zheyuf commented Apr 9, 2026 •

edited

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading