[TRTLLM-13024][perf] Make chat template application non-blocking#15278
[TRTLLM-13024][perf] Make chat template application non-blocking#152782ez4bz wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThe PR refactors chat prompt preparation in both ChangesConcurrent chat prompt preparation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/serve/openai_server.py`:
- Around line 1380-1394: The MM-encoder path drops the server-level chat
template because openai_mm_encoder passes only request.chat_template into
_prepare_chat_prompt_inputs_nonblocking; restore the same fallback as
openai_chat by passing (request.chat_template or self.chat_template) into
_prepare_chat_prompt_inputs_nonblocking so the server-configured template is
used when the request omits chat_template (adjust call in openai_mm_encoder
where _prepare_chat_prompt_inputs_nonblocking is invoked).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 66f7efc5-ce6d-45dd-8a7e-56ec2687880d
📒 Files selected for processing (1)
tensorrt_llm/serve/openai_server.py
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2e018d2 to
5ba141e
Compare
| await server.serve(sockets=sockets) | ||
|
|
||
|
|
||
| async def _apply_chat_template_nonblocking(**kwargs: Any |
There was a problem hiding this comment.
It appears to me that the design in https://github.com/NVIDIA/TensorRT-LLM/pull/15284/changes#diff-d57e5f661eb5980543ac9d0b8bf7f53f62e6ac37fa788130b469ae1a4a7c9e52R709 is cleaner and can be better reused beyond openai_server.py. Can you reconcile these two MRs and do a quick perf retest?
Also please have a unittest for the apply_chat_template function.
|
Closed in favor of #15284 |
Summary by CodeRabbit
Release Notes
Description
Move chat template application out of the event loop
hot path.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.