api7
diff --git a/‎docs/internals/llm-gateway.md‎
Lines changed: 27 additions & 17 deletions b/‎docs/internals/llm-gateway.md‎
Lines changed: 27 additions & 17 deletions
@@ -4,9 +4,9 @@ This document describes the current Layer 3 gateway entry point.
 
 The current implementation is intentionally narrow:
 
-- it only handles non-streaming chat requests
+- it handles both complete and streaming chat requests
 - it routes either through native format support or through the OpenAI Chat hub
-- it exposes a typed `ChatResponse<F>` envelope even though the current implementation only returns the `Complete` variant
+- it exposes a typed `ChatResponse<F>` envelope that can return either `Complete` or `Stream`
 
 ## Files
 
@@ -23,12 +23,12 @@ The current slice is implemented in two files:
 
 Its control flow is:
 
-1. reject `stream=true` up front
+1. ask the format whether the request is streaming
 2. ask the format whether the provider supports a native path
-3. if native support exists, call the native non-streaming path
-4. otherwise bridge through the hub request/response path
+3. if native support exists, call the native path
+4. otherwise use the hub request/response path for complete calls or the hub streaming path for stream calls
 
-That keeps the current implementation limited to one closed loop: typed request in, typed response out, with `Usage` attached.
+That keeps the current implementation limited to one closed loop: typed request in, typed response out, with `Usage` attached either directly or through a oneshot receiver.
 
 ## Hub Path
 
@@ -44,34 +44,44 @@ The sequence is:
 
 This keeps provider-specific JSON shape handling in the provider layer and format-specific response shape handling in the format layer. The gateway itself only orchestrates the sequence.
 
+For streaming hub calls, the sequence is:
+
+1. `F::to_hub()` converts the request into `ChatCompletionRequest`
+2. `Gateway::call_chat_hub_stream()` runs provider-side request transformation and the HTTP POST
+3. `select_chat_stream_reader()` chooses the raw response reader based on `StreamReaderKind`
+4. `HubChunkStream` parses the provider stream into hub `ChatCompletionChunk` values
+5. `BridgedStream<F>` converts those hub chunks into `F::StreamChunk` and sends final `Usage` through a oneshot channel
+
+Today the gateway only wires `StreamReaderKind::Sse`. Other reader kinds return a validation error until their readers are implemented.
+
 ## Native Path
 
 The native path is used when `F::native_support()` returns a `NativeHandler` for the chosen provider.
 
-The native path is still non-streaming only.
-
 The sequence is:
 
 1. `F::call_native()` chooses the endpoint path and request body
 2. `Gateway::call_chat_native()` executes the HTTP POST against the provider instance base URL
-3. `F::parse_native_response()` parses the JSON response into `F::Response`
+3. for complete calls, `F::parse_native_response()` parses the JSON response into `F::Response`
+4. for stream calls, `NativeStream<F>` converts provider-native chunks into `F::StreamChunk` and sends final `Usage` through a oneshot channel
 
-The gateway currently returns `Usage::default()` for native non-streaming calls because there is not yet a generic format hook for extracting usage out of arbitrary native response types.
+The gateway currently returns `Usage::default()` for native complete calls because there is not yet a generic format hook for extracting usage out of arbitrary native response types.
 
 ## `ChatResponse<F>`
 
-`ChatResponse<F>` is introduced now even though the current code path only emits `Complete`.
+`ChatResponse<F>` uses a single public shape for both complete and stream responses.
 
-That is deliberate for two reasons:
+That is deliberate for three reasons:
 
 - the public typed entry point should not need a return-type rewrite once streaming is enabled
 - usage has different delivery timing for complete vs stream responses
+- the gateway can box either bridged hub streams or native streams behind one alias
 
 The stream field uses a type-erased alias:
 
 - `ChatResponseStream<F> = Pin<Box<dyn Stream<Item = Result<F::StreamChunk>> + Send>>`
 
-That avoids hard-coding either `BridgedStream<F>` or `NativeStream<F>` into the response type. The later streaming work can box either stream adapter without changing the outer `ChatResponse<F>` shape.
+That avoids hard-coding either `BridgedStream<F>` or `NativeStream<F>` into the response type. The gateway can box either stream adapter without changing the outer `ChatResponse<F>` shape.
 
 ## Helper Naming
 
@@ -83,21 +93,21 @@ The gateway layer will later grow non-chat entry points such as embeddings, TTS,
 
 This module does not attempt to finish the full Layer 3 design.
 
-- streaming requests are rejected explicitly and deferred to the later streaming gateway work
 - `SessionStore` is not wired yet
 - only `chat_completion()` is implemented as a convenience helper today
 - `messages()` and `responses()` remain deferred until their corresponding formats land
-- native non-streaming usage extraction is still format-specific future work
+- only `StreamReaderKind::Sse` is wired today; `AwsEventStream` and `JsonArrayStream` are still deferred
+- native complete-call usage extraction is still format-specific future work
 
 ## Why This Slice Exists
 
 This is the first point where the new provider layer, format layer, and response envelope meet under one runtime entry point.
 
-Without this slice, later stream work would still be missing:
+Without this slice, the current gateway would still be missing:
 
 - a typed orchestration entry point
 - a place to choose native vs hub routing
 - a shared `ChatResponse<F>` shape
 - a common provider error mapping path
 
-That is why the implementation stops at non-streaming correctness first and leaves stream transport integration to later gateway work.
+That is why the implementation first established typed complete-call orchestration and then extended the same entry point to stream transport integration.