[TRTLLM-11558][feat] Acceptance rate based speculation off in one model path#12905
[TRTLLM-11558][feat] Acceptance rate based speculation off in one model path#12905zheyuf wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis change refactors speculative decoding acceptance-rate gating from per-request average-decoded-tokens tracking to batch-based true acceptance-rate sampling. Configuration parameters and gate API methods are renamed accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1167-1171: Remove unusednew_tokens_lensretrieval.The
new_tokens_lensvariable is retrieved fromsample_state.hostand converted to a list, but it is never used in the method body. This appears to be leftover code from a previous implementation.♻️ Proposed fix to remove dead code
if (getattr(self.dist, 'has_pp', False) and not self.dist.is_last_pp_rank): return False, None - new_tokens_lens = getattr(sample_state.host, 'new_tokens_lens', None) - if new_tokens_lens is None: - return False, None - new_tokens_lens_list = (new_tokens_lens.tolist() if hasattr( - new_tokens_lens, 'tolist') else list(new_tokens_lens)) total_draft_tokens = 0 total_accepted_tokens = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1167 - 1171, Remove the dead retrieval and conversion of new_tokens_lens: delete the lines that get new_tokens_lens from sample_state.host, the conditional that checks if new_tokens_lens is None and returns (False, None), and the conversion to new_tokens_lens_list (the references to new_tokens_lens, new_tokens_lens_list and their tolist/list conversion). Ensure no other code in the enclosing function (e.g., methods using sample_state.host) depends on new_tokens_lens before removing these statements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1167-1171: Remove the dead retrieval and conversion of
new_tokens_lens: delete the lines that get new_tokens_lens from
sample_state.host, the conditional that checks if new_tokens_lens is None and
returns (False, None), and the conversion to new_tokens_lens_list (the
references to new_tokens_lens, new_tokens_lens_list and their tolist/list
conversion). Ensure no other code in the enclosing function (e.g., methods using
sample_state.host) depends on new_tokens_lens before removing these statements.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fcb3fcdc-4b58-4922-b217-d415972c41fe
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/speculative/speculation_gate.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/speculative/test_spec_gate.py
|
PR_Github #42591 [ run ] triggered by Bot. Commit: |
|
PR_Github #42591 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42761 [ run ] triggered by Bot. Commit: |
|
PR_Github #42761 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42966 [ run ] triggered by Bot. Commit: |
|
PR_Github #42966 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43290 [ run ] triggered by Bot. Commit: |
|
PR_Github #43290 [ run ] completed with state
|
|
PR_Github #53662 [ run ] triggered by Bot. Commit: |
|
PR_Github #53662 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53835 [ run ] triggered by Bot. Commit: |
|
PR_Github #53835 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53948 [ run ] triggered by Bot. Commit: |
|
PR_Github #53948 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #54002 [ run ] triggered by Bot. Commit: |
|
PR_Github #54002 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55059 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #55070 [ run ] triggered by Bot. Commit: |
|
PR_Github #55059 [ run ] completed with state |
|
PR_Github #55070 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55192 [ run ] triggered by Bot. Commit: |
|
PR_Github #55192 [ run ] completed with state
|
Gate speculative decoding on a rolling per-request acceptance rate. When the moving-average AR over acceptance_rate_window_size falls below acceptance_rate_threshold, speculation is disabled for that request; when it rises back above the threshold, speculation re-engages. - Wire SpeculationGate through the three hot-path call sites (PP, non- overlap, overlap scheduler) and hoist the early-return so the no-op-when-disabled guarantee is visible in the diff. - Rename _record_batch_acceptance_rate -> _update_batch_acceptance_rate and document the overlap-scheduler interaction. - Add the acceptance_rate_window_size / acceptance_rate_threshold knobs to TorchLlmArgs / DecodingBaseConfig. - Add unit tests under tests/unittest/_torch/speculative/test_spec_gate.py. Squashed from: 0e4906a AR based speculation off e099d11 fix CI ef245d7 Rename _record_batch_acceptance_rate and add overlap scheduler comment. 955bd69 [None][perf] Make AR gate no-op explicit at call sites Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
|
/bot run --disable-fail-fast |
|
PR_Github #55277 [ run ] triggered by Bot. Commit: |
|
PR_Github #55277 [ run ] completed with state
|
Summary by CodeRabbit
acceptance_window→acceptance_rate_window_sizeandacceptance_length_threshold→acceptance_rate_threshold.Description
This PR ports AR-based speculation-off to the one-model path and removes the previous implementation from the two-model path.
Highlights
acceptance_rate_window_sizeandacceptance_rate_thresholdthrough the LLM API.acceptance_rate_threshold, speculative decoding is permanently disabled.Main changes compared with the previous two-model-path implementation
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.