Skip to content

[TRTLLM-13024][perf] Make chat template application non-blocking#15278

Closed
2ez4bz wants to merge 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-async-chat-template
Closed

[TRTLLM-13024][perf] Make chat template application non-blocking#15278
2ez4bz wants to merge 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-async-chat-template

Conversation

@2ez4bz

@2ez4bz 2ez4bz commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • Refactor
    • Improved OpenAI-compatible chat API performance through concurrent template rendering and multimodal data processing. Non-blocking operations reduce request latency while maintaining existing validation rules and API compatibility.

Description

Move chat template application out of the event loop
hot path.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz 2ez4bz requested a review from a team as a code owner June 11, 2026 22:52
@2ez4bz 2ez4bz requested a review from hchings June 11, 2026 22:52
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR refactors chat prompt preparation in both openai_chat and openai_mm_encoder endpoints to enable concurrent processing. New async helpers render chat templates in background threads while awaiting multimodal results, reducing blocking I/O. Both endpoints now merge multimodal data conditionally while preserving validation of mutual exclusivity between data and embeddings.

Changes

Concurrent chat prompt preparation

Layer / File(s) Summary
Async helpers for chat template and prompt preparation
tensorrt_llm/serve/openai_server.py
_apply_chat_template_nonblocking offloads template application to the thread pool; _prepare_chat_prompt_inputs_nonblocking overlaps template rendering with multimodal coroutine completion via asyncio.create_task, enforces mutual exclusivity of multimodal data and embeddings, and performs task cancellation cleanup in finally. Typing imports are reformatted.
openai_chat endpoint refactor
tensorrt_llm/serve/openai_server.py
Prompt building now calls _prepare_chat_prompt_inputs_nonblocking with embeddings allowed, replacing the prior synchronous template + multimodal await sequence with overlapping concurrent preparation while forwarding tool/document inputs and prompt token overrides.
openai_mm_encoder endpoint refactor
tensorrt_llm/serve/openai_server.py
Prompt building now calls _prepare_chat_prompt_inputs_nonblocking with allow_mm_embeddings=False, applying the same concurrent preparation pattern but forbidding embedding injection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is incomplete. While it explains the purpose (moving chat template application out of event loop), the Test Coverage section is empty and lacks specific test information required by the template. Add explicit test case information to the Test Coverage section to demonstrate what tests safeguard these changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: making chat template application non-blocking for performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/serve/openai_server.py`:
- Around line 1380-1394: The MM-encoder path drops the server-level chat
template because openai_mm_encoder passes only request.chat_template into
_prepare_chat_prompt_inputs_nonblocking; restore the same fallback as
openai_chat by passing (request.chat_template or self.chat_template) into
_prepare_chat_prompt_inputs_nonblocking so the server-configured template is
used when the request omits chat_template (adjust call in openai_mm_encoder
where _prepare_chat_prompt_inputs_nonblocking is invoked).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66f7efc5-ce6d-45dd-8a7e-56ec2687880d

📥 Commits

Reviewing files that changed from the base of the PR and between b586ebf and 2e018d2.

📒 Files selected for processing (1)
  • tensorrt_llm/serve/openai_server.py

Comment thread tensorrt_llm/serve/openai_server.py
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-async-chat-template branch from 2e018d2 to 5ba141e Compare June 11, 2026 23:34
await server.serve(sockets=sockets)


async def _apply_chat_template_nonblocking(**kwargs: Any

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me that the design in https://github.com/NVIDIA/TensorRT-LLM/pull/15284/changes#diff-d57e5f661eb5980543ac9d0b8bf7f53f62e6ac37fa788130b469ae1a4a7c9e52R709 is cleaner and can be better reused beyond openai_server.py. Can you reconcile these two MRs and do a quick perf retest?

Also please have a unittest for the apply_chat_template function.

@2ez4bz 2ez4bz closed this Jun 12, 2026
@2ez4bz

2ez4bz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Closed in favor of #15284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants