fix: improve reliability for slow backend streams#64
fix: improve reliability for slow backend streams#64ivanopcode wants to merge 4 commits intoteabranch:mainfrom
Conversation
Review Summary by QodoImprove reliability for slow backend streams with heartbeats and separate timeouts
WalkthroughsDescription• Emit real SSE heartbeat events while backend stream setup is waiting • Use separate connect and read timeouts for backend requests • Refactor heartbeat mechanism to start client stream immediately • Update documentation to clarify timeout behavior and client workarounds Diagramflowchart LR
A["Backend Request"] -->|slow setup| B["_stream_with_keepalive"]
B -->|emit heartbeats| C["Client SSE Stream"]
B -->|upstream ready| D["Stream Data"]
D --> C
E["get_backend_timeout"] -->|short connect| F["30s"]
E -->|long read| G["120s"]
F --> A
G --> A
File Changes1. src/open_responses_server/api_controller.py
|
Code Review by Qodo
1.
|
There was a problem hiding this comment.
Pull request overview
Improves reliability of /responses streaming when upstream backends are slow to emit the first token by keeping SSE connections active and by separating backend connect vs. read/write timeout budgets.
Changes:
- Add a keepalive wrapper that emits a real SSE
response.heartbeatevent while waiting for upstream stream activity (including before the upstream context manager yields). - Introduce
BACKEND_CONNECT_TIMEOUTand a sharedget_backend_timeout()profile (short connect/pool, longer read/write) and apply it across backend calls. - Update tests and docs/README to cover the new heartbeat behavior and timeout semantics.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_llm_client.py | Adds assertions for the new backend timeout profile and client defaults. |
| tests/test_api_controller_endpoints.py | Adds/updates tests verifying SSE heartbeat emission and keepalive wrapper behavior. |
| src/open_responses_server/common/llm_client.py | Adds get_backend_timeout() and applies it to the shared httpx.AsyncClient. |
| src/open_responses_server/common/config.py | Introduces BACKEND_CONNECT_TIMEOUT env var and logs it at startup. |
| src/open_responses_server/chat_completions_service.py | Switches chat-completions requests to use get_backend_timeout(). |
| src/open_responses_server/api_controller.py | Replaces old heartbeat wrapper with _stream_with_keepalive and uses get_backend_timeout() for backend calls. |
| docs/open-responses-server.md | Documents new timeout variables and updated heartbeat semantics. |
| docs/events-and-tool-handling.md | Documents heartbeat as a real SSE event and points to _stream_with_keepalive(). |
| README.md | Updates configuration docs and adds client-side idle-timeout note for slow local models. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def _stream_with_keepalive(async_iter_factory, interval, keepalive_event=_HEARTBEAT_EVENT): | ||
| """Yield keepalive comments while a producer waits on upstream stream activity. | ||
|
|
||
| Uses asyncio.wait with timeout so the underlying task is never cancelled. | ||
| This keeps SSE connections alive when the backend LLM is slow to respond. | ||
| Starts the client-facing SSE stream immediately and runs the upstream work | ||
| in a background task, so heartbeats continue even before the backend stream | ||
| context manager yields control. The heartbeat is a real SSE data event so | ||
| clients like Codex reset idle timers after parsing it. |
There was a problem hiding this comment.
The _stream_with_keepalive docstring says it yields “keepalive comments”, but the implementation emits a real SSE event:/data: heartbeat (via _HEARTBEAT_EVENT). Please update the wording to avoid confusion for future maintainers.
| from open_responses_server.common.llm_client import LLMClient | ||
| from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS, STREAM_TIMEOUT | ||
| from open_responses_server.common.llm_client import LLMClient, get_backend_timeout | ||
| from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS |
There was a problem hiding this comment.
OPENAI_BASE_URL_INTERNAL and OPENAI_API_KEY are imported here but not used anywhere in this module. Removing unused imports will reduce confusion and avoid potential lint failures if flake8/pylint are enabled.
| from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS | |
| from open_responses_server.common.config import logger, MAX_TOOL_CALL_ITERATIONS |
| async def shutdown_llm_client(): | ||
| """Function to be called on application shutdown.""" | ||
| await LLMClient.close_client() No newline at end of file | ||
| await LLMClient.close_client() |
There was a problem hiding this comment.
There is trailing whitespace at the end of this line. Please remove it to keep formatting clean and avoid noisy diffs in future edits.
| await LLMClient.close_client() | |
| await LLMClient.close_client() |
Problem
When the backend takes a long time to process the prompt before emitting the
first streamed token,
/responsescould sit idle for minutes.That exposed two independent failure modes:
SSE clients could treat the stream as idle and disconnect before the first
backend event arrived.
ORS used a flat backend timeout, so slow prompt processing could hit the
same timeout budget that should really apply to connection setup.
This is easiest to reproduce with large local models or very long prompts.
Changes
This MR improves slow-stream handling in two ways:
stream to yield its first item
The heartbeat is emitted as a real SSE event rather than an SSE comment so
clients that drive their idle timer from parsed SSE events can stay connected.
Testing
Tested with:
uv run pytest tests/test_api_controller_endpoints.py tests/test_llm_client.py tests/test_server.py tests/test_chat_completions_service.pyAlso verified manually against slow local llama.cpp-backed models where prompt
processing can take several minutes before the first chunk is streamed.
Notes
README and server docs were updated to clarify timeout behavior and document a
client-side idle-timeout workaround for slow local backends.
The heartbeat is sent as a parsed SSE event with a JSON payload, not only as an
SSE comment. This is intentional: a real event still produces ordinary stream
activity for proxies and load balancers, but it also resets idle timers in
clients that wait for parsed SSE events before considering the stream active.
This behavior was reproduced with Codex CLI, but the change is not specific to
it - it benefits any SSE client whose liveness is driven by parsed events
rather than ignored SSE comments.
Sending both a comment and an event would add noise without any confirmed
compatibility benefit so far.
May address one class of disconnects discussed in #43.