You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
treat vLLM input validation 500s as 400 for circuit breaker (#24)
* treat vLLM input validation 500s as 400 for circuit breaker
vLLM returns 500 for prompt-too-long errors instead of 400. This
causes the circuit breaker to penalise healthy workers for bad client
input. Rewrite the status to 400 when the response body matches known
input validation patterns.
* handle vLLM input validation 500s for streaming requests too
Move the 500 body inspection before the stream/non-stream branch so
both paths get the 400 rewrite. vLLM error responses are always
synchronous JSON even when the client requested streaming.
* handle 500 body read errors explicitly instead of unwrap_or_default
Log the transport error and return a diagnostic 500 to the caller
rather than silently forwarding an empty body with stale headers.
* fix double load decrement for 500 responses
Only decrement load in the early-return path for rewritten 400s
(input validation). Genuine 500s are retryable and the caller
retry closure already handles their load cleanup. Also properly
handle body read errors without swallowing them.
"Failed to read 500 response body from worker_url={}: {}",
948
+
worker_url, e
949
+
);
950
+
return(
951
+
StatusCode::INTERNAL_SERVER_ERROR,
952
+
format!("Failed to read upstream response: {}", e),
953
+
)
954
+
.into_response();
955
+
}
956
+
}
957
+
}
958
+
893
959
if !is_stream {
894
960
// For non-streaming requests, preserve headers
895
961
let response_headers = header_utils::preserve_response_headers(res.headers());
@@ -1887,6 +1953,23 @@ mod tests {
1887
1953
}
1888
1954
}
1889
1955
1956
+
#[test]
1957
+
fntest_is_vllm_input_validation_error(){
1958
+
// Prompt too long error from vLLM
1959
+
let body = br#"{"error":{"message":"The prompt is 65537 tokens, which exceeds the model's maximum context length of 65536 tokens. Please reduce the length of the input prompt.","type":"Internal Server Error","param":null,"code":500}}"#;
1960
+
assert!(is_vllm_input_validation_error(body));
1961
+
1962
+
// Actual server error should not match
1963
+
let body = br#"{"error":{"message":"CUDA out of memory","type":"Internal Server Error","param":null,"code":500}}"#;
0 commit comments