[TRTLLM-13024][perf] Make chat template application non-blocking by 2ez4bz · Pull Request #15278 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-06-11T22:52:34Z

Summary by CodeRabbit

Release Notes

Refactor
- Improved OpenAI-compatible chat API performance through concurrent template rendering and multimodal data processing. Non-blocking operations reduce request latency while maintaining existing validation rules and API compatibility.

Description

Move chat template application out of the event loop
hot path.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-11T22:57:35Z

📝 Walkthrough

Walkthrough

The PR refactors chat prompt preparation in both openai_chat and openai_mm_encoder endpoints to enable concurrent processing. New async helpers render chat templates in background threads while awaiting multimodal results, reducing blocking I/O. Both endpoints now merge multimodal data conditionally while preserving validation of mutual exclusivity between data and embeddings.

Changes

Concurrent chat prompt preparation

Layer / File(s)	Summary
Async helpers for chat template and prompt preparation `tensorrt_llm/serve/openai_server.py`	`_apply_chat_template_nonblocking` offloads template application to the thread pool; `_prepare_chat_prompt_inputs_nonblocking` overlaps template rendering with multimodal coroutine completion via `asyncio.create_task`, enforces mutual exclusivity of multimodal data and embeddings, and performs task cancellation cleanup in `finally`. Typing imports are reformatted.
openai_chat endpoint refactor `tensorrt_llm/serve/openai_server.py`	Prompt building now calls `_prepare_chat_prompt_inputs_nonblocking` with embeddings allowed, replacing the prior synchronous template + multimodal await sequence with overlapping concurrent preparation while forwarding tool/document inputs and prompt token overrides.
openai_mm_encoder endpoint refactor `tensorrt_llm/serve/openai_server.py`	Prompt building now calls `_prepare_chat_prompt_inputs_nonblocking` with `allow_mm_embeddings=False`, applying the same concurrent preparation pattern but forbidding embedding injection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is incomplete. While it explains the purpose (moving chat template application out of event loop), the Test Coverage section is empty and lacks specific test information required by the template.	Add explicit test case information to the Test Coverage section to demonstrate what tests safeguard these changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: making chat template application non-blocking for performance.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/serve/openai_server.py`:
- Around line 1380-1394: The MM-encoder path drops the server-level chat
template because openai_mm_encoder passes only request.chat_template into
_prepare_chat_prompt_inputs_nonblocking; restore the same fallback as
openai_chat by passing (request.chat_template or self.chat_template) into
_prepare_chat_prompt_inputs_nonblocking so the server-configured template is
used when the request omits chat_template (adjust call in openai_mm_encoder
where _prepare_chat_prompt_inputs_nonblocking is invoked).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66f7efc5-ce6d-45dd-8a7e-56ec2687880d

📥 Commits

Reviewing files that changed from the base of the PR and between b586ebf and 2e018d2.

📒 Files selected for processing (1)

tensorrt_llm/serve/openai_server.py

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

hchings · 2026-06-12T07:32:07Z

        await server.serve(sockets=sockets)
+
+
+async def _apply_chat_template_nonblocking(**kwargs: Any


It appears to me that the design in https://github.com/NVIDIA/TensorRT-LLM/pull/15284/changes#diff-d57e5f661eb5980543ac9d0b8bf7f53f62e6ac37fa788130b469ae1a4a7c9e52R709 is cleaner and can be better reused beyond openai_server.py. Can you reconcile these two MRs and do a quick perf retest?

Also please have a unittest for the apply_chat_template function.

2ez4bz · 2026-06-12T21:50:10Z

Closed in favor of #15284

2ez4bz requested a review from a team as a code owner June 11, 2026 22:52

2ez4bz requested a review from hchings June 11, 2026 22:52

github-actions Bot assigned 2ez4bz Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tensorrt_llm/serve/openai_server.py

[TRTLLM-13024][perf] Make chat template application non-blocking

5ba141e

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz force-pushed the dev-async-chat-template branch from 2e018d2 to 5ba141e Compare June 11, 2026 23:34

hchings reviewed Jun 12, 2026

View reviewed changes

2ez4bz closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-13024][perf] Make chat template application non-blocking#15278

[TRTLLM-13024][perf] Make chat template application non-blocking#15278
2ez4bz wants to merge 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-async-chat-template

2ez4bz commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

hchings Jun 12, 2026

Uh oh!

2ez4bz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		await server.serve(sockets=sockets)


		async def _apply_chat_template_nonblocking(**kwargs: Any

Uh oh!

Conversation

2ez4bz commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hchings Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

2ez4bz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2ez4bz commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading