Skip to content

feat: response body mode framework for streaming support#169

Open
noyitz wants to merge 1 commit into
llm-d:mainfrom
noyitz:feat/response-body-mode-framework
Open

feat: response body mode framework for streaming support#169
noyitz wants to merge 1 commit into
llm-d:mainfrom
noyitz:feat/response-body-mode-framework

Conversation

@noyitz

@noyitz noyitz commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add BodyProcessingMode type (Skip, Chunks, Full) and ResponseBodyRequirement interface for response plugins to declare their body access needs
  • Pre-compute NeedsResponseBuffering per profile at startup based on plugin declarations
  • When NeedsResponseBuffering=false for the selected profile, ack each response body chunk immediately — enabling real-time client streaming
  • Plugins that don't implement the interface default to Full (backward compatible)

Body Processing Modes

Mode Behavior Use case
Skip Plugin doesn't need response body Headers-only plugins
Chunks Plugin processes chunks as they stream (must implement ChunkProcessor) Metering, logging
Full Framework buffers entire response before calling plugin API translation, body mutation

How it works

At startup, the config loader iterates each profile's response plugins and checks their BodyProcessingMode() declaration:

  • If ANY plugin declares FullNeedsResponseBuffering=true for that profile
  • Otherwise → NeedsResponseBuffering=false, chunks are acked immediately

This is a per-profile decision, so different profiles can have different buffering behavior.

Test plan

  • Unit tests for computeResponseBuffering covering all mode combinations
  • Tests for validation: Chunks without ChunkProcessor implementation fails at startup
  • Multi-profile test verifying independent buffering decisions

@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 15, 2026
@github-actions

Copy link
Copy Markdown

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

@github-actions

Copy link
Copy Markdown

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

noyitz added a commit to noyitz/llm-d-inference-payload-processor that referenced this pull request Jun 17, 2026
Add ChunkProcessor interface that plugins declaring BodyChunked must
implement. The framework validates this at startup and pre-computes
the list of ChunkProcessors per profile. When streaming (no buffering),
each response body chunk is passed through all ChunkProcessors before
being acked to Envoy, enabling in-flight chunk transformation.

Depends on llm-d#169

Signed-off-by: Noy Itzikowitz <nitzikow@redhat.com>
@noyitz noyitz force-pushed the feat/response-body-mode-framework branch from bf93834 to 381fe43 Compare June 17, 2026 16:11
@github-actions

Copy link
Copy Markdown

⚠️ Large PR detected

Your PR is large. Please consider breaking it into multiple PRs.

The do-not-merge/hold label has been added and can be removed by the reviewers based on their judgement.

@nirrozenbaum nirrozenbaum removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2026

## Design

### Response Body Mode Declaration

@nirrozenbaum nirrozenbaum Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to propose an alternative.
instead of making plugins declare if they require full body or can operate on chunks, I propose the following:

go to the file pkg/frameworkinterfacerequesthandling. split the ResponseProcessor interface into two interfaces:

type ResponseBodyProcessor interface {
	plugin.Plugin
	// ProcessResponseBody runs the ResponseBodyProcessor plugin.
	// ResponseBodyProcessor can mutate the headers and/or the body of the response.
	ProcessResponseBody(ctx context.Context, cycleState *plugin.CycleState, response *InferenceResponse) error
}

type ResponseChunkProcessor interface {
	plugin.Plugin
	// ProcessResponseChunk runs the ResponseChunkProcessor plugin.
	// ResponseChunkProcessor can operate on a single chunk and can mutate the headers and/or the body of the response.
	ProcessResponseChunk(ctx context.Context, cycleState *plugin.CycleState, responseChunk *InferenceResponseChunk) error // InferenceRequestChunk doesn't exist. need to see if the response can be reused or a new type is needed.

then, Profile should be extended to include both ResponseChunkProcessor[] AND ResponseBodyProcessor[].

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — reworked the PR to follow your approach. Split ResponseProcessor into:

  • ResponseProcessor — processes full buffered body (existing interface, unchanged)
  • ResponseChunkProcessor — processes individual chunks without buffering (new)

The config loader sorts response plugins into the right list based on which interface they implement. Profile now has ResponsePlugins, ResponseChunkProcessors, and NeedsResponseBuffering (computed automatically from len(ResponsePlugins) > 0).

Removed the BodyProcessingMode enum and ResponseBodyRequirement interface entirely. Proposal doc removed too.

Comment thread pkg/config/loader/configloader.go Outdated
return nil, err
}

computeResponseBuffering(profiles, logger)

@nirrozenbaum nirrozenbaum Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't require a separate function. if you follow the proposal I've added, this can be folded into buildProfiles function, where we need to maintain for each profile a boolean field RequireResponseBuffering: bool (which is actually a result of if len(ResponseBodyProcessor) > 0, so maybe we don't even need the extra field).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — folded into buildProfiles. The buffering flag is computed inline: theProfile.NeedsResponseBuffering = len(theProfile.ResponsePlugins) > 0

Comment on lines +71 to +75
Plugins that don't implement `ResponseBodyRequirement` get a warning:
```
INFO Response plugin does not declare ResponseBodyRequirement, defaulting to BodyFull profile=default plugin=legacy-plugin/legacy
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this. llm-d IPP is a new repo and hasn't been through the process of a formal release yet. so we don't need to support backward compatibility, which can simplify a bit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. No backward compat shims.


## Future: Dynamic Mode Override

Envoy ext_proc supports `mode_override` in responses. A future enhancement could override `response_body_mode` to `NONE` per-request when no response plugin in the selected profile needs the body at all. This is out of scope for the initial implementation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very interesting. I didn't know this option exist.
can you open an issue about this so we can track it and make sure it goes into next version?
100% agree this is not first priority.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a tracking issue for the FULL_DUPLEX_STREAMED per-chunk envoy mode.

…cessing

Split response processing into two interfaces per review feedback:

- ResponseProcessor: processes the complete buffered response body (existing)
- ResponseChunkProcessor: processes individual chunks without buffering (new)

The Profile now carries both processor lists and a NeedsResponseBuffering
flag. The config loader sorts response plugins into the right list based
on which interface they implement. A plugin can implement both.

When ResponseProcessor plugins are present → buffer the full response.
When only ResponseChunkProcessors are present → stream chunks through.
When neither is present → buffer (backward compatible).

Signed-off-by: Noy Itzikowitz <nitzikow@redhat.com>
@noyitz noyitz force-pushed the feat/response-body-mode-framework branch from 381fe43 to 5f2053c Compare June 22, 2026 12:36
@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants