Skip to content

[None][perf] offload chat template rendering into async#15284

Open
yechank-nvidia wants to merge 2 commits into
NVIDIA:mainfrom
yechank-nvidia:async_chat_template
Open

[None][perf] offload chat template rendering into async#15284
yechank-nvidia wants to merge 2 commits into
NVIDIA:mainfrom
yechank-nvidia:async_chat_template

Conversation

@yechank-nvidia

@yechank-nvidia yechank-nvidia commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Performance Improvements

    • Chat template processing is now asynchronous and non-blocking throughout the inference server, enabling concurrent execution with multimodal preprocessing and other operations. This improves request throughput and server responsiveness, with particular benefits for workflows involving multimodal content.
  • Tests

    • Added tests to validate async chat template processing works correctly in concurrent execution scenarios without blocking.

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces async_apply_chat_template in tensorrt_llm/inputs/utils.py as an async wrapper around the existing synchronous apply_chat_template using asyncio.to_thread. Three serve modules (openai_server.py, resource_governor.py, responses_utils.py) are updated to use this async variant, enabling concurrent execution of chat-template rendering alongside multimodal preprocessing coroutines via asyncio.gather.

Changes

Async chat-template rendering and integration

Layer / File(s) Summary
async_apply_chat_template wrapper and test
tensorrt_llm/inputs/utils.py, tests/unittest/inputs/test_chat_template_dispatch.py
Adds async_apply_chat_template delegating to apply_chat_template via asyncio.to_thread. Test asserts the call executes on a worker thread (not the event-loop thread) and returns the expected rendered string.
openai_server: concurrent template + multimodal gather
tensorrt_llm/serve/openai_server.py
Replaces synchronous template application in openai_chat and openai_mm_encoder with an async prompt_task coroutine awaited concurrently with mm_coroutines via asyncio.gather; multimodal results assigned conditionally on prompt_token_ids.
resource_governor: async _convert_messages
tensorrt_llm/serve/resource_governor.py
Converts _convert_messages to async def, adds asyncio import, uses async_apply_chat_template and parse_chat_messages_coroutines gathered concurrently; both _truncate_kv_cache call sites updated to await.
responses_utils: concurrent token + multimodal gather
tensorrt_llm/serve/responses_utils.py
Replaces sequential token/multimodal computation in _create_input_tokens with a concurrent asyncio.gather of the token coroutine and multimodal coroutines.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains '@coderabbitai summary' without any substantive explanation of the changes, rationale, test coverage, or checklist items required by the template. Add a comprehensive description explaining what was changed and why, document relevant test coverage, and complete the PR checklist items as required by the repository template.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: offloading chat template rendering into async operations for performance improvement.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/unittest/inputs/test_chat_template_dispatch.py (1)

332-359: 🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift

Coverage is insufficient for the full async integration surface.

This test validates the wrapper contract well, but it does not cover the new asyncio.gather integration paths. Please add follow-up tests for:

  • tests/unittest/serve/test_openai_server.py (both prompt_token_ids and non-prompt_token_ids branches),
  • tests/unittest/serve/test_resource_governor.py (_convert_messages tokenization path),
  • tests/unittest/serve/test_responses_utils.py (_create_input_tokens gather + multimodal unpack path).

As per coding guidelines, reviews under tests/** should explicitly assess whether coverage is sufficient and call out concrete follow-up files when it is not.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/inputs/test_chat_template_dispatch.py` around lines 332 - 359,
The test class TestAsyncApplyChatTemplate currently only validates the wrapper
contract through test_runs_in_worker_thread but does not cover the
asyncio.gather integration paths used in the actual application. Add follow-up
test cases in three additional files: test_openai_server.py covering both
prompt_token_ids and non-prompt_token_ids branches, test_resource_governor.py
covering the _convert_messages tokenization path, and test_responses_utils.py
covering the _create_input_tokens gather operation combined with multimodal
unpacking. Each test should verify that async_apply_chat_template integrates
correctly with its respective calling context and that asyncio.gather properly
coordinates the async operations.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/serve/resource_governor.py`:
- Around line 104-106: Replace the use of self.model_config.model_type in the
async_apply_chat_template call with a call to resolve_top_level_model_type to
ensure KV-truncation tokenization uses the same model-type resolver as the
serving paths and maintains consistency for aliased or top-level configs.

In `@tensorrt_llm/serve/responses_utils.py`:
- Line 838: The asyncio.gather assignment at line 838 incorrectly unpacks the
result from mm_coroutines. The mm_coroutines awaitable returns a tuple of
(mm_data, mm_embeddings), but the current code assigns this entire tuple
directly to mm_data, causing _create_input_tokens to return a tuple instead of
just the multimodal data dict as documented. Fix this by properly unpacking the
tuple result in the asyncio.gather assignment so that both mm_data and
mm_embeddings are correctly extracted, rather than assigning the tuple to
mm_data.

---

Nitpick comments:
In `@tests/unittest/inputs/test_chat_template_dispatch.py`:
- Around line 332-359: The test class TestAsyncApplyChatTemplate currently only
validates the wrapper contract through test_runs_in_worker_thread but does not
cover the asyncio.gather integration paths used in the actual application. Add
follow-up test cases in three additional files: test_openai_server.py covering
both prompt_token_ids and non-prompt_token_ids branches,
test_resource_governor.py covering the _convert_messages tokenization path, and
test_responses_utils.py covering the _create_input_tokens gather operation
combined with multimodal unpacking. Each test should verify that
async_apply_chat_template integrates correctly with its respective calling
context and that asyncio.gather properly coordinates the async operations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd1178bd-f85c-45f3-b9e7-0112fb8b878a

📥 Commits

Reviewing files that changed from the base of the PR and between aef7d47 and 5c5024f.

📒 Files selected for processing (5)
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/resource_governor.py
  • tensorrt_llm/serve/responses_utils.py
  • tests/unittest/inputs/test_chat_template_dispatch.py

Comment thread tensorrt_llm/serve/resource_governor.py
Comment thread tensorrt_llm/serve/responses_utils.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55020 [ run ] triggered by Bot. Commit: 5c5024f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55020 [ run ] completed with state SUCCESS. Commit: 5c5024f
/LLM/main/L0_MergeRequest_PR pipeline #44011 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55028 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55028 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44019 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55250 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55250 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44208 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55356 [ run ] triggered by Bot. Commit: 45efcca Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55356 [ run ] completed with state SUCCESS. Commit: 45efcca
/LLM/main/L0_MergeRequest_PR pipeline #44305 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
- resource_governor: resolve the top-level model type (resolve_top_level_
  model_type) in _convert_messages, matching the serving call sites,
  instead of the raw model_config.model_type.
- responses_utils: unpack the (mm_data, mm_embeddings) tuple from the
  asyncio.gather result so _create_input_tokens returns mm_data (not the
  whole tuple) as its contract states.
- tests: add async regression coverage for both gather paths
  (ResourceGovernor._convert_messages and _create_input_tokens).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@tburt-nv tburt-nv force-pushed the async_chat_template branch from 45efcca to 644bddc Compare June 24, 2026 21:01
@tburt-nv

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55585 [ run ] triggered by Bot. Commit: 644bddc Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55585 [ run ] completed with state FAILURE. Commit: 644bddc
/LLM/main/L0_MergeRequest_PR pipeline #44503 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation


@pytest.mark.asyncio
async def test_resource_governor_convert_messages(self, monkeypatch):
from unittest.mock import Mock

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move imports to module-level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants