fix: Use CompletionStreamResponse for streaming completions usage chunk by jhaotingc · Pull Request #12758 · NVIDIA/TensorRT-LLM

jhaotingc · 2026-04-04T01:03:14Z

The completion_stream_post_processor incorrectly used ChatCompletionStreamResponse for the final usage-only chunk, causing streaming /v1/completions responses to include "object": "chat.completion.chunk" instead of the expected "object": "text_completion". This breaks OpenAI-compatible clients (e.g., aiperf) that validate the object type field per endpoint.

Replace ChatCompletionStreamResponse with CompletionStreamResponse at line 512 to match the type already used for regular streaming chunks (line 492).

@coderabbitai summary

Description

The completion_stream_post_processor incorrectly uses ChatCompletionStreamResponse for the final usage-only chunk in streaming /v1/completions responses. This causes the
last SSE chunk to include "object": "chat.completion.chunk" instead of the expected "object": "text_completion", breaking OpenAI-compatible clients (e.g., aiperf) that
validate the object type field per endpoint.

Only the final usage-only chunk is affected — all regular token-streaming chunks already correctly use CompletionStreamResponse (line 492). The bug is only in the usage
chunk at line 512, and only triggers when stream_options.include_usage is true.

Repro

Start trtllm-serve with any model using the PyTorch backend, then send:

curl -s http://localhost:8001/v1/completions
-H "Content-Type: application/json"
-d '{"model": "...", "prompt": "Hello", "max_tokens": 3,
"stream": true, "stream_options": {"include_usage": true}}'

Before (bug)

Regular token chunks have the correct type, but the final usage-only chunk has the wrong type:

data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Kitty"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Cafe"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" opens","finish_reason":"length"}],"usage":null}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk",...,"choices":[],"usage":{...}} <- BUG
data: [DONE]

After (fix)

data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Kitty"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Cafe"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" opens","finish_reason":"length"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[],"usage":{...}} <- FIXED
data: [DONE]

Fix

One-line change in tensorrt_llm/serve/postprocess_handlers.py:512: replace ChatCompletionStreamResponse with CompletionStreamResponse.

Test plan

Llama-3.1-8B-Instruct: streaming /v1/completions with include_usage returns correct text_completion for all chunks
Chat completions endpoint unaffected (uses ChatCompletionStreamResponse correctly at line 319)
Pre-commit hooks pass

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The completion_stream_post_processor incorrectly used ChatCompletionStreamResponse for the final usage-only chunk, causing streaming /v1/completions responses to include "object": "chat.completion.chunk" instead of the expected "object": "text_completion". This breaks OpenAI-compatible clients (e.g., aiperf) that validate the object type field per endpoint. Replace ChatCompletionStreamResponse with CompletionStreamResponse at line 512 to match the type already used for regular streaming chunks (line 492). Signed-off-by: Jhao-Ting Chen <jhaotingc@users.noreply.github.com>

Add assertions to test_completion_stream_options to verify that all streaming chunks (including the final usage-only chunk) return "object": "text_completion" for the /v1/completions endpoint. This guards against regressions where the usage chunk might incorrectly use ChatCompletionStreamResponse. Signed-off-by: Jhao-Ting Chen <jhaotingc@users.noreply.github.com>

github-actions bot assigned jhaotingc Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use CompletionStreamResponse for streaming completions usage chunk#12758

fix: Use CompletionStreamResponse for streaming completions usage chunk#12758
jhaotingc wants to merge 2 commits intoNVIDIA:mainfrom
jhaotingc:fix/completion-stream-object-type

jhaotingc commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhaotingc commented Apr 4, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant