You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/internals/llm-gateway.md
+27-17Lines changed: 27 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@ This document describes the current Layer 3 gateway entry point.
4
4
5
5
The current implementation is intentionally narrow:
6
6
7
-
- it only handles non-streaming chat requests
7
+
- it handles both complete and streaming chat requests
8
8
- it routes either through native format support or through the OpenAI Chat hub
9
-
- it exposes a typed `ChatResponse<F>` envelope even though the current implementation only returns the `Complete`variant
9
+
- it exposes a typed `ChatResponse<F>` envelope that can return either `Complete`or `Stream`
10
10
11
11
## Files
12
12
@@ -23,12 +23,12 @@ The current slice is implemented in two files:
23
23
24
24
Its control flow is:
25
25
26
-
1.reject `stream=true` up front
26
+
1.ask the format whether the request is streaming
27
27
2. ask the format whether the provider supports a native path
28
-
3. if native support exists, call the native non-streaming path
29
-
4. otherwise bridge through the hub request/response path
28
+
3. if native support exists, call the native path
29
+
4. otherwise use the hub request/response path for complete calls or the hub streaming path for stream calls
30
30
31
-
That keeps the current implementation limited to one closed loop: typed request in, typed response out, with `Usage` attached.
31
+
That keeps the current implementation limited to one closed loop: typed request in, typed response out, with `Usage` attached either directly or through a oneshot receiver.
32
32
33
33
## Hub Path
34
34
@@ -44,34 +44,44 @@ The sequence is:
44
44
45
45
This keeps provider-specific JSON shape handling in the provider layer and format-specific response shape handling in the format layer. The gateway itself only orchestrates the sequence.
46
46
47
+
For streaming hub calls, the sequence is:
48
+
49
+
1.`F::to_hub()` converts the request into `ChatCompletionRequest`
50
+
2.`Gateway::call_chat_hub_stream()` runs provider-side request transformation and the HTTP POST
51
+
3.`select_chat_stream_reader()` chooses the raw response reader based on `StreamReaderKind`
52
+
4.`HubChunkStream` parses the provider stream into hub `ChatCompletionChunk` values
53
+
5.`BridgedStream<F>` converts those hub chunks into `F::StreamChunk` and sends final `Usage` through a oneshot channel
54
+
55
+
Today the gateway only wires `StreamReaderKind::Sse`. Other reader kinds return a validation error until their readers are implemented.
56
+
47
57
## Native Path
48
58
49
59
The native path is used when `F::native_support()` returns a `NativeHandler` for the chosen provider.
50
60
51
-
The native path is still non-streaming only.
52
-
53
61
The sequence is:
54
62
55
63
1.`F::call_native()` chooses the endpoint path and request body
56
64
2.`Gateway::call_chat_native()` executes the HTTP POST against the provider instance base URL
57
-
3.`F::parse_native_response()` parses the JSON response into `F::Response`
65
+
3. for complete calls, `F::parse_native_response()` parses the JSON response into `F::Response`
66
+
4. for stream calls, `NativeStream<F>` converts provider-native chunks into `F::StreamChunk` and sends final `Usage` through a oneshot channel
58
67
59
-
The gateway currently returns `Usage::default()` for native non-streaming calls because there is not yet a generic format hook for extracting usage out of arbitrary native response types.
68
+
The gateway currently returns `Usage::default()` for native complete calls because there is not yet a generic format hook for extracting usage out of arbitrary native response types.
60
69
61
70
## `ChatResponse<F>`
62
71
63
-
`ChatResponse<F>`is introduced now even though the current code path only emits `Complete`.
72
+
`ChatResponse<F>`uses a single public shape for both complete and stream responses.
64
73
65
-
That is deliberate for two reasons:
74
+
That is deliberate for three reasons:
66
75
67
76
- the public typed entry point should not need a return-type rewrite once streaming is enabled
68
77
- usage has different delivery timing for complete vs stream responses
78
+
- the gateway can box either bridged hub streams or native streams behind one alias
That avoids hard-coding either `BridgedStream<F>` or `NativeStream<F>` into the response type. The later streaming work can box either stream adapter without changing the outer `ChatResponse<F>` shape.
84
+
That avoids hard-coding either `BridgedStream<F>` or `NativeStream<F>` into the response type. The gateway can box either stream adapter without changing the outer `ChatResponse<F>` shape.
75
85
76
86
## Helper Naming
77
87
@@ -83,21 +93,21 @@ The gateway layer will later grow non-chat entry points such as embeddings, TTS,
83
93
84
94
This module does not attempt to finish the full Layer 3 design.
85
95
86
-
- streaming requests are rejected explicitly and deferred to the later streaming gateway work
87
96
-`SessionStore` is not wired yet
88
97
- only `chat_completion()` is implemented as a convenience helper today
89
98
-`messages()` and `responses()` remain deferred until their corresponding formats land
90
-
- native non-streaming usage extraction is still format-specific future work
99
+
- only `StreamReaderKind::Sse` is wired today; `AwsEventStream` and `JsonArrayStream` are still deferred
100
+
- native complete-call usage extraction is still format-specific future work
91
101
92
102
## Why This Slice Exists
93
103
94
104
This is the first point where the new provider layer, format layer, and response envelope meet under one runtime entry point.
95
105
96
-
Without this slice, later stream work would still be missing:
106
+
Without this slice, the current gateway would still be missing:
97
107
98
108
- a typed orchestration entry point
99
109
- a place to choose native vs hub routing
100
110
- a shared `ChatResponse<F>` shape
101
111
- a common provider error mapping path
102
112
103
-
That is why the implementation stops at non-streaming correctness first and leaves stream transport integration to later gateway work.
113
+
That is why the implementation first established typed complete-call orchestration and then extended the same entry point to stream transport integration.
0 commit comments