[None][perf] offload chat template rendering into async by yechank-nvidia · Pull Request #15284 · NVIDIA/TensorRT-LLM

yechank-nvidia · 2026-06-12T00:22:05Z

Summary by CodeRabbit

Performance Improvements
- Chat template processing is now asynchronous and non-blocking throughout the inference server, enabling concurrent execution with multimodal preprocessing and other operations. This improves request throughput and server responsiveness, with particular benefits for workflows involving multimodal content.
Tests
- Added tests to validate async chat template processing works correctly in concurrent execution scenarios without blocking.

yechank-nvidia · 2026-06-22T05:34:30Z

/bot run

coderabbitai · 2026-06-22T05:39:14Z

📝 Walkthrough

Walkthrough

Introduces async_apply_chat_template in tensorrt_llm/inputs/utils.py as an async wrapper around the existing synchronous apply_chat_template using asyncio.to_thread. Three serve modules (openai_server.py, resource_governor.py, responses_utils.py) are updated to use this async variant, enabling concurrent execution of chat-template rendering alongside multimodal preprocessing coroutines via asyncio.gather.

Changes

Async chat-template rendering and integration

Layer / File(s)	Summary
`async_apply_chat_template` wrapper and test `tensorrt_llm/inputs/utils.py`, `tests/unittest/inputs/test_chat_template_dispatch.py`	Adds `async_apply_chat_template` delegating to `apply_chat_template` via `asyncio.to_thread`. Test asserts the call executes on a worker thread (not the event-loop thread) and returns the expected rendered string.
`openai_server`: concurrent template + multimodal gather `tensorrt_llm/serve/openai_server.py`	Replaces synchronous template application in `openai_chat` and `openai_mm_encoder` with an async `prompt_task` coroutine awaited concurrently with `mm_coroutines` via `asyncio.gather`; multimodal results assigned conditionally on `prompt_token_ids`.
`resource_governor`: async `_convert_messages` `tensorrt_llm/serve/resource_governor.py`	Converts `_convert_messages` to `async def`, adds `asyncio` import, uses `async_apply_chat_template` and `parse_chat_messages_coroutines` gathered concurrently; both `_truncate_kv_cache` call sites updated to `await`.
`responses_utils`: concurrent token + multimodal gather `tensorrt_llm/serve/responses_utils.py`	Replaces sequential token/multimodal computation in `_create_input_tokens` with a concurrent `asyncio.gather` of the token coroutine and multimodal coroutines.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains '`@coderabbitai` summary' without any substantive explanation of the changes, rationale, test coverage, or checklist items required by the template.	Add a comprehensive description explaining what was changed and why, document relevant test coverage, and complete the PR checklist items as required by the repository template.
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: offloading chat template rendering into async operations for performance improvement.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/unittest/inputs/test_chat_template_dispatch.py (1)
332-359: 🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift

Coverage is insufficient for the full async integration surface.

This test validates the wrapper contract well, but it does not cover the new asyncio.gather integration paths. Please add follow-up tests for:

tests/unittest/serve/test_openai_server.py (both prompt_token_ids and non-prompt_token_ids branches),

tests/unittest/serve/test_resource_governor.py (_convert_messages tokenization path),

tests/unittest/serve/test_responses_utils.py (_create_input_tokens gather + multimodal unpack path).

As per coding guidelines, reviews under tests/** should explicitly assess whether coverage is sufficient and call out concrete follow-up files when it is not.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/inputs/test_chat_template_dispatch.py` around lines 332 - 359,
The test class TestAsyncApplyChatTemplate currently only validates the wrapper
contract through test_runs_in_worker_thread but does not cover the
asyncio.gather integration paths used in the actual application. Add follow-up
test cases in three additional files: test_openai_server.py covering both
prompt_token_ids and non-prompt_token_ids branches, test_resource_governor.py
covering the _convert_messages tokenization path, and test_responses_utils.py
covering the _create_input_tokens gather operation combined with multimodal
unpacking. Each test should verify that async_apply_chat_template integrates
correctly with its respective calling context and that asyncio.gather properly
coordinates the async operations.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/serve/resource_governor.py`:
- Around line 104-106: Replace the use of self.model_config.model_type in the
async_apply_chat_template call with a call to resolve_top_level_model_type to
ensure KV-truncation tokenization uses the same model-type resolver as the
serving paths and maintains consistency for aliased or top-level configs.

In `@tensorrt_llm/serve/responses_utils.py`:
- Line 838: The asyncio.gather assignment at line 838 incorrectly unpacks the
result from mm_coroutines. The mm_coroutines awaitable returns a tuple of
(mm_data, mm_embeddings), but the current code assigns this entire tuple
directly to mm_data, causing _create_input_tokens to return a tuple instead of
just the multimodal data dict as documented. Fix this by properly unpacking the
tuple result in the asyncio.gather assignment so that both mm_data and
mm_embeddings are correctly extracted, rather than assigning the tuple to
mm_data.

---

Nitpick comments:
In `@tests/unittest/inputs/test_chat_template_dispatch.py`:
- Around line 332-359: The test class TestAsyncApplyChatTemplate currently only
validates the wrapper contract through test_runs_in_worker_thread but does not
cover the asyncio.gather integration paths used in the actual application. Add
follow-up test cases in three additional files: test_openai_server.py covering
both prompt_token_ids and non-prompt_token_ids branches,
test_resource_governor.py covering the _convert_messages tokenization path, and
test_responses_utils.py covering the _create_input_tokens gather operation
combined with multimodal unpacking. Each test should verify that
async_apply_chat_template integrates correctly with its respective calling
context and that asyncio.gather properly coordinates the async operations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd1178bd-f85c-45f3-b9e7-0112fb8b878a

📥 Commits

Reviewing files that changed from the base of the PR and between aef7d47 and 5c5024f.

📒 Files selected for processing (5)

tensorrt_llm/inputs/utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/resource_governor.py
tensorrt_llm/serve/responses_utils.py
tests/unittest/inputs/test_chat_template_dispatch.py

tensorrt-cicd · 2026-06-22T05:40:15Z

PR_Github #55020 [ run ] triggered by Bot. Commit: 5c5024f Link to invocation

tensorrt-cicd · 2026-06-22T07:14:38Z

PR_Github #55020 [ run ] completed with state SUCCESS. Commit: 5c5024f
/LLM/main/L0_MergeRequest_PR pipeline #44011 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-06-22T07:51:53Z

/bot run

tensorrt-cicd · 2026-06-22T07:57:11Z

PR_Github #55028 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

tensorrt-cicd · 2026-06-22T09:26:11Z

PR_Github #55028 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44019 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-06-23T14:13:32Z

/bot run

tensorrt-cicd · 2026-06-23T14:19:43Z

PR_Github #55250 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

tensorrt-cicd · 2026-06-23T14:58:52Z

PR_Github #55250 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44208 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-06-24T00:50:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T00:56:12Z

PR_Github #55356 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

tensorrt-cicd · 2026-06-24T08:08:21Z

PR_Github #55356 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44305 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

- resource_governor: resolve the top-level model type (resolve_top_level_ model_type) in _convert_messages, matching the serving call sites, instead of the raw model_config.model_type. - responses_utils: unpack the (mm_data, mm_embeddings) tuple from the asyncio.gather result so _create_input_tokens returns mm_data (not the whole tuple) as its contract states. - tests: add async regression coverage for both gather paths (ResourceGovernor._convert_messages and _create_input_tokens). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

tburt-nv · 2026-06-24T21:01:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T21:07:39Z

PR_Github #55585 [ run ] triggered by Bot. Commit: 644bddc Link to invocation

tensorrt-cicd · 2026-06-25T05:09:15Z

PR_Github #55585 [ run ] completed with state FAILURE. Commit: 644bddc
/LLM/main/L0_MergeRequest_PR pipeline #44503 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

2ez4bz · 2026-06-25T05:28:50Z

+
+    @pytest.mark.asyncio
+    async def test_resource_governor_convert_messages(self, monkeypatch):
+        from unittest.mock import Mock


Nit: move imports to module-level.

github-actions Bot assigned yechank-nvidia Jun 12, 2026

2ez4bz mentioned this pull request Jun 12, 2026

[TRTLLM-13024][perf] Make chat template application non-blocking #15278

Closed

1 task

yechank-nvidia marked this pull request as ready for review June 22, 2026 05:34

yechank-nvidia requested review from a team as code owners June 22, 2026 05:34

yechank-nvidia requested review from Superjomn, moraxu and tijyojwad June 22, 2026 05:34

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread tensorrt_llm/serve/resource_governor.py

Comment thread tensorrt_llm/serve/responses_utils.py Outdated

yechank-nvidia force-pushed the async_chat_template branch from 5c5024f to 45efcca Compare June 22, 2026 07:12

yechank-nvidia added 2 commits June 24, 2026 17:01

[None][perf] offload chat template rendering in serving

be3fa3d

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

tburt-nv force-pushed the async_chat_template branch from 45efcca to 644bddc Compare June 24, 2026 21:01

2ez4bz approved these changes Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

yechank-nvidia commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

yechank-nvidia commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

yechank-nvidia commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

yechank-nvidia commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

yechank-nvidia commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tburt-nv commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

2ez4bz Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yechank-nvidia commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading