You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/latest/plugins/ai-proxy-multi.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,9 @@ In addition, the Plugin also supports logging LLM request information in the acc
99
99
| instances.checks.active.unhealthy.http_statuses | array[integer]| False |[429,404,500,501,502,503,504,505]| status code between 200 and 599 inclusive | An array of HTTP status codes that defines an unhealthy node. |
100
100
| instances.checks.active.unhealthy.http_failures | integer | False | 5 | between 1 and 254 inclusive | Number of HTTP failures to define an unhealthy node. |
101
101
| instances.checks.active.unhealthy.timeout | integer | False | 3 | between 1 and 254 inclusive | Number of probe timeouts to define an unhealthy node. |
102
-
| timeout | integer | False | 30000 | greater than or equal to 1 | Request timeout in milliseconds when requesting the LLM service. |
102
+
| timeout | integer | False | 30000 | greater than or equal to 1 | Request timeout in milliseconds when requesting the LLM service. Applied per socket operation (connect / send / read block); does not cap the total duration of a streaming response. |
103
+
| max_stream_duration_ms | integer | False || greater than or equal to 1 | Maximum wall-clock duration (in milliseconds) for a streaming AI response. If the upstream keeps sending data past this deadline, the gateway closes the connection. Unset means no cap. Use this to protect the gateway from upstream bugs that produce tokens indefinitely. When the limit is hit mid-stream, the downstream SSE stream is truncated (no protocol-specific terminator such as `[DONE]`, `message_stop`, or `response.completed`); well-behaved clients should treat a missing terminator as an incomplete response. |
104
+
| max_response_bytes | integer | False || greater than or equal to 1 | Maximum total bytes read from the upstream for a single AI response (streaming or non-streaming). If exceeded, the gateway closes the connection. For non-streaming responses with `Content-Length`, the check is performed before reading the body; for chunked (no-`Content-Length`) non-streaming responses and for streaming responses, the cap is enforced incrementally as bytes are received. Unset means no cap. |
103
105
| keepalive | boolean | False | true || If true, keep the connection alive when requesting the LLM service. |
104
106
| keepalive_timeout | integer | False | 60000 | greater than or equal to 1000 | Request timeout in milliseconds when requesting the LLM service. |
105
107
| keepalive_pool | integer | False | 30 || Keepalive pool size for when connecting with the LLM service. |
| logging.payloads | boolean | False | false || If true, logs request and response payload. |
72
-
| timeout | integer | False | 30000 | ≥ 1 | Request timeout in milliseconds when requesting the LLM service. |
72
+
| timeout | integer | False | 30000 | ≥ 1 | Request timeout in milliseconds when requesting the LLM service. Applied per socket operation (connect / send / read block); does not cap the total duration of a streaming response. |
73
+
| max_stream_duration_ms | integer | False || ≥ 1 | Maximum wall-clock duration (in milliseconds) for a streaming AI response. If the upstream keeps sending data past this deadline, the gateway closes the connection. Unset means no cap. Use this to protect the gateway from upstream bugs that produce tokens indefinitely. When the limit is hit mid-stream, the downstream SSE stream is truncated (no protocol-specific terminator such as `[DONE]`, `message_stop`, or `response.completed`); well-behaved clients should treat a missing terminator as an incomplete response. |
74
+
| max_response_bytes | integer | False || ≥ 1 | Maximum total bytes read from the upstream for a single AI response (streaming or non-streaming). If exceeded, the gateway closes the connection. For non-streaming responses with `Content-Length`, the check is performed before reading the body; for chunked (no-`Content-Length`) non-streaming responses and for streaming responses, the cap is enforced incrementally as bytes are received. Unset means no cap. |
73
75
| keepalive | boolean | False | true || If true, keeps the connection alive when requesting the LLM service. |
74
76
| keepalive_timeout | integer | False | 60000 | ≥ 1000 | Keepalive timeout in milliseconds when connecting to the LLM service. |
75
77
| keepalive_pool | integer | False | 30 || Keepalive pool size for the LLM service connection. |
0 commit comments