[TRTLLM-12982][perf] reuse multi-item scoring position_ids and params#15413
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthrough
ChangesMulti-item part_lens: prepare() ownership refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
766-770: 💤 Low valueConsider adding
strict=Truetozip()and using iterable unpacking.Static analysis flagged two minor style issues:
- Line 766-767:
zip()withoutstrict=parameter. While both iterables are guaranteed same-length by construction, addingstrict=Trueimproves defensiveness.- Line 769-770: List concatenation can use iterable unpacking for clarity.
♻️ Optional style improvements
range_ends = torch.tensor( [ item_len + 1 for req_part_lens, token_pos_in_items_raw_len in zip( - multi_item_part_lens, token_pos_in_items_raw_lens) + multi_item_part_lens, token_pos_in_items_raw_lens, strict=True) for item_len in ( - req_part_lens[1:] + - [token_pos_in_items_len - token_pos_in_items_raw_len]) + [*req_part_lens[1:], + token_pos_in_items_len - token_pos_in_items_raw_len]) ],🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/attention_backend/flashinfer.py` around lines 766 - 770, In the nested list comprehension around the zip() call, add the strict=True parameter to the zip() function that iterates over multi_item_part_lens and token_pos_in_items_raw_lens to ensure both iterables have the same length. Additionally, replace the list concatenation operation that combines req_part_lens[1:] with a single-element list containing the computation (token_pos_in_items_len - token_pos_in_items_raw_len) with iterable unpacking syntax for improved clarity.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 4319-4321: The assert statement checking multi_item_part_lens is
None causes a hard failure (AssertionError) when multi-item scoring is requested
on the CUDA graph path, but eager execution already supports this input. Instead
of asserting and aborting the request, implement a fallback mechanism: when
multi_item_part_lens is present in the graph execution path, either route the
computation to non-graph (eager) execution, or disable graph selection upstream
before this code point is reached. Replace the assert statement with conditional
logic that gracefully handles the multi-item case without raising an exception.
---
Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/flashinfer.py`:
- Around line 766-770: In the nested list comprehension around the zip() call,
add the strict=True parameter to the zip() function that iterates over
multi_item_part_lens and token_pos_in_items_raw_lens to ensure both iterables
have the same length. Additionally, replace the list concatenation operation
that combines req_part_lens[1:] with a single-element list containing the
computation (token_pos_in_items_len - token_pos_in_items_raw_len) with iterable
unpacking syntax for improved clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 79445b4c-c3eb-454c-9dc2-d90ff60b1a42
📒 Files selected for processing (8)
tensorrt_llm/_torch/attention_backend/flashinfer.pytensorrt_llm/_torch/attention_backend/interface.pytensorrt_llm/_torch/attention_backend/star_flashinfer.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/attention_backend/vanilla.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/llmapi/llm.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/modules/attention.py
|
PR_Github #54575 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #54578 [ run ] triggered by Bot. Commit: |
|
PR_Github #54575 [ run ] completed with state |
|
Note: SGLang is employing a similar optimization (sgl-project/sglang#10979) |
|
/bot run --disable-fail-fast |
|
PR_Github #54664 [ run ] triggered by Bot. Commit: |
|
PR_Github #54578 [ run ] completed with state
|
|
PR_Github #54664 [ run ] completed with state |
800c7ee to
1110424
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54776 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #55225 [ run ] triggered by Bot. Commit: |
|
PR_Github #55225 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55252 [ run ] triggered by Bot. Commit: |
|
PR_Github #55252 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
/bot run --stage-list "GB200-4GPUs-PyTorch-PerfSanity-1,GB200-4GPUs-PyTorch-PerfSanity-2" |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #55266 [ run ] triggered by Bot. Commit: |
|
PR_Github #55267 [ run ] triggered by Bot. Commit: |
|
PR_Github #55266 [ run ] completed with state |
|
/bot run --stage-list "GB200-4GPUs-PyTorch-PerfSanity-1,GB200-4GPUs-PyTorch-PerfSanity-2" |
|
PR_Github #55267 [ run ] completed with state
|
|
PR_Github #55273 [ run ] triggered by Bot. Commit: |
|
PR_Github #55273 [ run ] completed with state
|
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-2" |
|
PR_Github #55426 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-2" --disable-fail-fast |
|
PR_Github #55426 [ run ] completed with state
|
|
PR_Github #55429 [ run ] triggered by Bot. Commit: |
|
PR_Github #55429 [ run ] completed with state |
|
Around 2026-06-23 10:00 CEST, all but two stages had passed. At the time, the branch had last been updated around 2026-06-22 20:00 CEST, the corresponding commit is now less than 50 commits behind |
|
/bot skip --comment "#15413 (comment)" |
|
PR_Github #55472 [ skip ] triggered by Bot. Commit: |
|
PR_Github #55472 [ skip ] completed with state |
Description
Implements performance improvements for multi-item scoring:
FlashInferMultiItemParamsare reused across model layersposition_idsare computed on GPU and reused across model layersNote for reviewers: Changeset contains moved code, so consider something like
git diff --color-moved=dimmed-zebra -w.Test Coverage
Covered by existing tests.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Refactor
Performance
Improvements