fix: improve reliability for slow backend streams#64
Conversation
Review Summary by QodoImprove reliability for slow backend streams with heartbeats and separate timeouts
WalkthroughsDescription• Emit real SSE heartbeat events while backend stream setup is waiting • Use separate connect and read timeouts for backend requests • Refactor heartbeat mechanism to start client stream immediately • Update documentation to clarify timeout behavior and client workarounds Diagramflowchart LR
A["Backend Request"] -->|slow setup| B["_stream_with_keepalive"]
B -->|emit heartbeats| C["Client SSE Stream"]
B -->|upstream ready| D["Stream Data"]
D --> C
E["get_backend_timeout"] -->|short connect| F["30s"]
E -->|long read| G["120s"]
F --> A
G --> A
File Changes1. src/open_responses_server/api_controller.py
|
Code Review by Qodo
1.
|
There was a problem hiding this comment.
Pull request overview
Improves reliability of /responses streaming when upstream backends are slow to emit the first token by keeping SSE connections active and by separating backend connect vs. read/write timeout budgets.
Changes:
- Add a keepalive wrapper that emits a real SSE
response.heartbeatevent while waiting for upstream stream activity (including before the upstream context manager yields). - Introduce
BACKEND_CONNECT_TIMEOUTand a sharedget_backend_timeout()profile (short connect/pool, longer read/write) and apply it across backend calls. - Update tests and docs/README to cover the new heartbeat behavior and timeout semantics.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_llm_client.py | Adds assertions for the new backend timeout profile and client defaults. |
| tests/test_api_controller_endpoints.py | Adds/updates tests verifying SSE heartbeat emission and keepalive wrapper behavior. |
| src/open_responses_server/common/llm_client.py | Adds get_backend_timeout() and applies it to the shared httpx.AsyncClient. |
| src/open_responses_server/common/config.py | Introduces BACKEND_CONNECT_TIMEOUT env var and logs it at startup. |
| src/open_responses_server/chat_completions_service.py | Switches chat-completions requests to use get_backend_timeout(). |
| src/open_responses_server/api_controller.py | Replaces old heartbeat wrapper with _stream_with_keepalive and uses get_backend_timeout() for backend calls. |
| docs/open-responses-server.md | Documents new timeout variables and updated heartbeat semantics. |
| docs/events-and-tool-handling.md | Documents heartbeat as a real SSE event and points to _stream_with_keepalive(). |
| README.md | Updates configuration docs and adds client-side idle-timeout note for slow local models. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Addressed the latest cleanup comments in
|
Problem
When the backend takes a long time to process the prompt before emitting the
first streamed token,
/responsescould sit idle for minutes.That exposed two independent failure modes:
SSE clients could treat the stream as idle and disconnect before the first
backend event arrived.
ORS used a flat backend timeout, so slow prompt processing could hit the
same timeout budget that should really apply to connection setup.
This is easiest to reproduce with large local models or very long prompts.
Changes
This MR improves slow-stream handling in two ways:
stream to yield its first item
The heartbeat is emitted as a real SSE event rather than an SSE comment so
clients that drive their idle timer from parsed SSE events can stay connected.
Testing
Tested with:
uv run pytest tests/test_api_controller_endpoints.py tests/test_llm_client.py tests/test_server.py tests/test_chat_completions_service.pyAlso verified manually against slow local llama.cpp-backed models where prompt
processing can take several minutes before the first chunk is streamed.
Notes
README and server docs were updated to clarify timeout behavior and document a
client-side idle-timeout workaround for slow local backends.
The heartbeat is sent as a parsed SSE event with a JSON payload, not only as an
SSE comment. This is intentional: a real event still produces ordinary stream
activity for proxies and load balancers, but it also resets idle timers in
clients that wait for parsed SSE events before considering the stream active.
This behavior was reproduced with Codex CLI, but the change is not specific to
it - it benefits any SSE client whose liveness is driven by parsed events
rather than ignored SSE comments.
Sending both a comment and an event would add noise without any confirmed
compatibility benefit so far.
May address one class of disconnects discussed in #43.