feat: response body mode framework for streaming support#169
Conversation
|
Your PR is large. Please consider breaking it into multiple PRs. The |
|
Your PR is large. Please consider breaking it into multiple PRs. The |
Add ChunkProcessor interface that plugins declaring BodyChunked must implement. The framework validates this at startup and pre-computes the list of ChunkProcessors per profile. When streaming (no buffering), each response body chunk is passed through all ChunkProcessors before being acked to Envoy, enabling in-flight chunk transformation. Depends on llm-d#169 Signed-off-by: Noy Itzikowitz <nitzikow@redhat.com>
bf93834 to
381fe43
Compare
|
Your PR is large. Please consider breaking it into multiple PRs. The |
|
|
||
| ## Design | ||
|
|
||
| ### Response Body Mode Declaration |
There was a problem hiding this comment.
I'd like to propose an alternative.
instead of making plugins declare if they require full body or can operate on chunks, I propose the following:
go to the file pkg/frameworkinterfacerequesthandling. split the ResponseProcessor interface into two interfaces:
type ResponseBodyProcessor interface {
plugin.Plugin
// ProcessResponseBody runs the ResponseBodyProcessor plugin.
// ResponseBodyProcessor can mutate the headers and/or the body of the response.
ProcessResponseBody(ctx context.Context, cycleState *plugin.CycleState, response *InferenceResponse) error
}
type ResponseChunkProcessor interface {
plugin.Plugin
// ProcessResponseChunk runs the ResponseChunkProcessor plugin.
// ResponseChunkProcessor can operate on a single chunk and can mutate the headers and/or the body of the response.
ProcessResponseChunk(ctx context.Context, cycleState *plugin.CycleState, responseChunk *InferenceResponseChunk) error // InferenceRequestChunk doesn't exist. need to see if the response can be reused or a new type is needed.then, Profile should be extended to include both ResponseChunkProcessor[] AND ResponseBodyProcessor[].
There was a problem hiding this comment.
Done — reworked the PR to follow your approach. Split ResponseProcessor into:
ResponseProcessor— processes full buffered body (existing interface, unchanged)ResponseChunkProcessor— processes individual chunks without buffering (new)
The config loader sorts response plugins into the right list based on which interface they implement. Profile now has ResponsePlugins, ResponseChunkProcessors, and NeedsResponseBuffering (computed automatically from len(ResponsePlugins) > 0).
Removed the BodyProcessingMode enum and ResponseBodyRequirement interface entirely. Proposal doc removed too.
| return nil, err | ||
| } | ||
|
|
||
| computeResponseBuffering(profiles, logger) |
There was a problem hiding this comment.
this doesn't require a separate function. if you follow the proposal I've added, this can be folded into buildProfiles function, where we need to maintain for each profile a boolean field RequireResponseBuffering: bool (which is actually a result of if len(ResponseBodyProcessor) > 0, so maybe we don't even need the extra field).
There was a problem hiding this comment.
Done — folded into buildProfiles. The buffering flag is computed inline: theProfile.NeedsResponseBuffering = len(theProfile.ResponsePlugins) > 0
| Plugins that don't implement `ResponseBodyRequirement` get a warning: | ||
| ``` | ||
| INFO Response plugin does not declare ResponseBodyRequirement, defaulting to BodyFull profile=default plugin=legacy-plugin/legacy | ||
| ``` | ||
|
|
There was a problem hiding this comment.
we don't need this. llm-d IPP is a new repo and hasn't been through the process of a formal release yet. so we don't need to support backward compatibility, which can simplify a bit.
There was a problem hiding this comment.
Removed. No backward compat shims.
|
|
||
| ## Future: Dynamic Mode Override | ||
|
|
||
| Envoy ext_proc supports `mode_override` in responses. A future enhancement could override `response_body_mode` to `NONE` per-request when no response plugin in the selected profile needs the body at all. This is out of scope for the initial implementation. |
There was a problem hiding this comment.
this is very interesting. I didn't know this option exist.
can you open an issue about this so we can track it and make sure it goes into next version?
100% agree this is not first priority.
There was a problem hiding this comment.
Will open a tracking issue for the FULL_DUPLEX_STREAMED per-chunk envoy mode.
…cessing Split response processing into two interfaces per review feedback: - ResponseProcessor: processes the complete buffered response body (existing) - ResponseChunkProcessor: processes individual chunks without buffering (new) The Profile now carries both processor lists and a NeedsResponseBuffering flag. The config loader sorts response plugins into the right list based on which interface they implement. A plugin can implement both. When ResponseProcessor plugins are present → buffer the full response. When only ResponseChunkProcessors are present → stream chunks through. When neither is present → buffer (backward compatible). Signed-off-by: Noy Itzikowitz <nitzikow@redhat.com>
381fe43 to
5f2053c
Compare
Summary
BodyProcessingModetype (Skip,Chunks,Full) andResponseBodyRequirementinterface for response plugins to declare their body access needsNeedsResponseBufferingper profile at startup based on plugin declarationsNeedsResponseBuffering=falsefor the selected profile, ack each response body chunk immediately — enabling real-time client streamingFull(backward compatible)Body Processing Modes
SkipChunksChunkProcessor)FullHow it works
At startup, the config loader iterates each profile's response plugins and checks their
BodyProcessingMode()declaration:Full→NeedsResponseBuffering=truefor that profileNeedsResponseBuffering=false, chunks are acked immediatelyThis is a per-profile decision, so different profiles can have different buffering behavior.
Test plan
computeResponseBufferingcovering all mode combinationsChunkswithoutChunkProcessorimplementation fails at startup