Skip to content

fix: Use CompletionStreamResponse for streaming completions usage chunk#12758

Draft
jhaotingc wants to merge 2 commits intoNVIDIA:mainfrom
jhaotingc:fix/completion-stream-object-type
Draft

fix: Use CompletionStreamResponse for streaming completions usage chunk#12758
jhaotingc wants to merge 2 commits intoNVIDIA:mainfrom
jhaotingc:fix/completion-stream-object-type

Conversation

@jhaotingc
Copy link
Copy Markdown
Collaborator

The completion_stream_post_processor incorrectly used ChatCompletionStreamResponse for the final usage-only chunk, causing streaming /v1/completions responses to include "object": "chat.completion.chunk" instead of the expected "object": "text_completion". This breaks OpenAI-compatible clients (e.g., aiperf) that validate the object type field per endpoint.

Replace ChatCompletionStreamResponse with CompletionStreamResponse at line 512 to match the type already used for regular streaming chunks (line 492).

@coderabbitai summary

Description

The completion_stream_post_processor incorrectly uses ChatCompletionStreamResponse for the final usage-only chunk in streaming /v1/completions responses. This causes the
last SSE chunk to include "object": "chat.completion.chunk" instead of the expected "object": "text_completion", breaking OpenAI-compatible clients (e.g., aiperf) that
validate the object type field per endpoint.

Only the final usage-only chunk is affected — all regular token-streaming chunks already correctly use CompletionStreamResponse (line 492). The bug is only in the usage
chunk at line 512, and only triggers when stream_options.include_usage is true.

Repro

Start trtllm-serve with any model using the PyTorch backend, then send:

curl -s http://localhost:8001/v1/completions
-H "Content-Type: application/json"
-d '{"model": "...", "prompt": "Hello", "max_tokens": 3,
"stream": true, "stream_options": {"include_usage": true}}'

Before (bug)

Regular token chunks have the correct type, but the final usage-only chunk has the wrong type:

data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Kitty"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Cafe"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" opens","finish_reason":"length"}],"usage":null}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk",...,"choices":[],"usage":{...}} <- BUG
data: [DONE]

After (fix)

data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Kitty"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" Cafe"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[{"text":" opens","finish_reason":"length"}],"usage":null}
data: {"id":"cmpl-...","object":"text_completion",...,"choices":[],"usage":{...}} <- FIXED
data: [DONE]

Fix

One-line change in tensorrt_llm/serve/postprocess_handlers.py:512: replace ChatCompletionStreamResponse with CompletionStreamResponse.

Test plan

  • Llama-3.1-8B-Instruct: streaming /v1/completions with include_usage returns correct text_completion for all chunks
  • Chat completions endpoint unaffected (uses ChatCompletionStreamResponse correctly at line 319)
  • Pre-commit hooks pass

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The completion_stream_post_processor incorrectly used
ChatCompletionStreamResponse for the final usage-only chunk, causing
streaming /v1/completions responses to include "object":
"chat.completion.chunk" instead of the expected "object":
"text_completion". This breaks OpenAI-compatible clients (e.g., aiperf)
that validate the object type field per endpoint.

Replace ChatCompletionStreamResponse with CompletionStreamResponse at
line 512 to match the type already used for regular streaming chunks
(line 492).

Signed-off-by: Jhao-Ting Chen <jhaotingc@users.noreply.github.com>
Add assertions to test_completion_stream_options to verify that all
streaming chunks (including the final usage-only chunk) return
"object": "text_completion" for the /v1/completions endpoint. This
guards against regressions where the usage chunk might incorrectly
use ChatCompletionStreamResponse.

Signed-off-by: Jhao-Ting Chen <jhaotingc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant