fix: improve reliability for slow backend streams by ivanopcode · Pull Request #64 · teabranch/open-responses-server

ivanopcode · 2026-04-09T21:38:27Z

Problem

When the backend takes a long time to process the prompt before emitting the
first streamed token, /responses could sit idle for minutes.

That exposed two independent failure modes:

SSE clients could treat the stream as idle and disconnect before the first
backend event arrived.
ORS used a flat backend timeout, so slow prompt processing could hit the
same timeout budget that should really apply to connection setup.

This is easiest to reproduce with large local models or very long prompts.

Changes

This MR improves slow-stream handling in two ways:

emit a real SSE heartbeat event while ORS is still waiting for the upstream
stream to yield its first item
use separate backend timeout profiles:
- short connect timeout
- longer read timeout for slow prompt processing

The heartbeat is emitted as a real SSE event rather than an SSE comment so
clients that drive their idle timer from parsed SSE events can stay connected.

Testing

Tested with:

uv run pytest tests/test_api_controller_endpoints.py tests/test_llm_client.py tests/test_server.py tests/test_chat_completions_service.py

Also verified manually against slow local llama.cpp-backed models where prompt
processing can take several minutes before the first chunk is streamed.

Notes

README and server docs were updated to clarify timeout behavior and document a
client-side idle-timeout workaround for slow local backends.

The heartbeat is sent as a parsed SSE event with a JSON payload, not only as an
SSE comment. This is intentional: a real event still produces ordinary stream
activity for proxies and load balancers, but it also resets idle timers in
clients that wait for parsed SSE events before considering the stream active.
This behavior was reproduced with Codex CLI, but the change is not specific to
it - it benefits any SSE client whose liveness is driven by parsed events
rather than ignored SSE comments.

Sending both a comment and an event would add noise without any confirmed
compatibility benefit so far.

May address one class of disconnects discussed in #43.

qodo-code-review · 2026-04-09T21:38:49Z

Review Summary by Qodo

Improve reliability for slow backend streams with heartbeats and separate timeouts

🐞 Bug fix ✨ Enhancement

Walkthroughs

Description

• Emit real SSE heartbeat events while backend stream setup is waiting
• Use separate connect and read timeouts for backend requests
• Refactor heartbeat mechanism to start client stream immediately
• Update documentation to clarify timeout behavior and client workarounds

Diagram

flowchart LR
  A["Backend Request"] -->|slow setup| B["_stream_with_keepalive"]
  B -->|emit heartbeats| C["Client SSE Stream"]
  B -->|upstream ready| D["Stream Data"]
  D --> C
  E["get_backend_timeout"] -->|short connect| F["30s"]
  E -->|long read| G["120s"]
  F --> A
  G --> A

File Changes

1. src/open_responses_server/api_controller.py ✨ Enhancement +147/-123

Refactor heartbeat mechanism to real SSE events with queue-based producer

• Replace _with_heartbeat() with _stream_with_keepalive() that uses queue-based producer pattern
• Emit real SSE data events (event: response.heartbeat) instead of SSE comments
• Start client-facing stream immediately while upstream work runs in background task
• Replace STREAM_TIMEOUT with get_backend_timeout() for separate timeout handling
• Refactor create_response() streaming logic to wrap LLM event stream with keepalive

src/open_responses_server/api_controller.py

2. src/open_responses_server/chat_completions_service.py ✨ Enhancement +10/-6

Use separate timeout profiles for backend requests
• Import get_backend_timeout() function from llm_client module
• Replace all STREAM_TIMEOUT references with get_backend_timeout() calls
• Apply to both streaming and non-streaming request handlers
src/open_responses_server/chat_completions_service.py

3. src/open_responses_server/common/llm_client.py ✨ Enhancement +14/-2

Add separate connect and read timeout configuration

• Add BACKEND_CONNECT_TIMEOUT constant set to 30 seconds
• Implement get_backend_timeout() function returning httpx.Timeout with separate connect/read
 timeouts
• Update LLMClient initialization to use get_backend_timeout() instead of flat timeout

src/open_responses_server/common/llm_client.py

View more (5)

4. tests/test_api_controller_endpoints.py 🧪 Tests +96/-54

Update tests for new keepalive event mechanism

• Rename _with_heartbeat to _stream_with_keepalive and _HEARTBEAT to _HEARTBEAT_EVENT
• Add new test test_responses_streaming_sends_keepalive_before_backend_yields() verifying
 heartbeats during backend setup delay
• Refactor TestWithHeartbeat class to TestStreamWithKeepalive with updated test cases
• Update tests to verify real SSE data events instead of sentinel objects

tests/test_api_controller_endpoints.py

5. tests/test_llm_client.py 🧪 Tests +13/-0

Add tests for separate timeout configuration

• Add test for get_backend_timeout() verifying separate connect and read timeouts
• Import BACKEND_CONNECT_TIMEOUT and get_backend_timeout from llm_client
• Verify client initialization uses new timeout configuration

tests/test_llm_client.py

6. README.md 📝 Documentation +11/-3

Document timeout behavior and client-side workarounds

• Clarify STREAM_TIMEOUT as backend read timeout with separate 30s connect timeout
• Add client-side configuration example for Codex CLI with longer idle timeout
• Update HEARTBEAT_INTERVAL description to reference SSE heartbeat events

README.md

7. docs/events-and-tool-handling.md 📝 Documentation +13/-9

Document real SSE heartbeat event format and behavior

• Update heartbeat documentation to describe real SSE data events instead of comments
• Document event format: event: response.heartbeat with JSON payload
• Explain rationale for using data events to reset client idle timers
• Reference updated _stream_with_keepalive() implementation

docs/events-and-tool-handling.md

8. docs/open-responses-server.md 📝 Documentation +2/-2

Update configuration documentation for timeout behavior
• Clarify STREAM_TIMEOUT as backend read timeout with separate 30s connect timeout
• Update HEARTBEAT_INTERVAL description to reference SSE heartbeat events
docs/open-responses-server.md

qodo-code-review · 2026-04-09T21:38:50Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Hardcoded BACKEND_CONNECT_TIMEOUT ✓ Resolved 📘 Rule violation ⚙ Maintainability

Description

BACKEND_CONNECT_TIMEOUT is introduced as a hardcoded constant instead of being sourced from
environment-based configuration. This violates the requirement that configuration be managed via
environment variables loaded from .env, and makes timeout behavior non-configurable in different
deploy environments.

Code

src/open_responses_server/common/llm_client.py[R4-14]

Evidence

PR Compliance ID 2 requires configuration to be managed via environment variables (per
common/config.py). The PR adds a new timeout setting (BACKEND_CONNECT_TIMEOUT = 30.0) directly
in code and uses it in get_backend_timeout() rather than reading it from os.environ via
common/config.py.

CLAUDE.md: CLAUDE.md
src/open_responses_server/common/llm_client.py[4-14]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`BACKEND_CONNECT_TIMEOUT` is hardcoded in `llm_client.py`, but compliance requires configuration to come from environment variables loaded via `.env` (centralized in `common/config.py`).
## Issue Context
Timeout behavior can vary across environments (local dev vs. production). This value should be defined in `common/config.py` similarly to `STREAM_TIMEOUT` and then imported/used from there.
## Fix Focus Areas
- src/open_responses_server/common/llm_client.py[4-14]
- src/open_responses_server/common/config.py[25-28]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Unbounded stream buffering~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

_stream_with_keepalive consumes the upstream stream in a background task and writes all items into
an unbounded asyncio.Queue, so downstream backpressure (slow client/network) won’t slow upstream
reads. This can grow memory without bound and potentially OOM/DoS the server for large/fast
responses with slow readers.

Code

src/open_responses_server/api_controller.py[R34-45]

Evidence

The keepalive wrapper creates an unbounded queue and a producer task that iterates the entire
upstream stream and enqueues every item; the consumer drains the queue at the pace the ASGI server
pulls items for the client, so a slow reader can cause queue growth. This wrapper is used to feed
the /responses StreamingResponse, and the upstream generator can emit many SSE events per request.

src/open_responses_server/api_controller.py[23-69]
src/open_responses_server/api_controller.py[230-337]
src/open_responses_server/responses_service.py[313-358]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_stream_with_keepalive()` uses an unbounded `asyncio.Queue()` between an always-running producer task (reading upstream) and the downstream consumer (StreamingResponse). If the downstream client is slow or stalls, the producer can continue enqueuing items and memory can grow without bound.
## Issue Context
This architecture is needed to emit keepalives before the upstream context manager yields, but it should still enforce backpressure to avoid buffering arbitrarily large streams.
## Fix Focus Areas
- src/open_responses_server/api_controller.py[23-69]
- src/open_responses_server/api_controller.py[325-329]
## Suggested fix
1. Make the queue bounded (e.g., `asyncio.Queue(maxsize=1)` or a small configurable size) so `await queue.put(...)` blocks when the consumer is behind.
2. Optionally add a debug log/metric when the queue is full or exceeds a threshold to detect slow-reader pressure in production.
3. Ensure cancellation paths still work when the producer is blocked on `queue.put()` (cancelling the task should unblock it).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot

Pull request overview

Improves reliability of /responses streaming when upstream backends are slow to emit the first token by keeping SSE connections active and by separating backend connect vs. read/write timeout budgets.

Changes:

Add a keepalive wrapper that emits a real SSE response.heartbeat event while waiting for upstream stream activity (including before the upstream context manager yields).
Introduce BACKEND_CONNECT_TIMEOUT and a shared get_backend_timeout() profile (short connect/pool, longer read/write) and apply it across backend calls.
Update tests and docs/README to cover the new heartbeat behavior and timeout semantics.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_llm_client.py	Adds assertions for the new backend timeout profile and client defaults.
tests/test_api_controller_endpoints.py	Adds/updates tests verifying SSE heartbeat emission and keepalive wrapper behavior.
src/open_responses_server/common/llm_client.py	Adds `get_backend_timeout()` and applies it to the shared `httpx.AsyncClient`.
src/open_responses_server/common/config.py	Introduces `BACKEND_CONNECT_TIMEOUT` env var and logs it at startup.
src/open_responses_server/chat_completions_service.py	Switches chat-completions requests to use `get_backend_timeout()`.
src/open_responses_server/api_controller.py	Replaces old heartbeat wrapper with `_stream_with_keepalive` and uses `get_backend_timeout()` for backend calls.
docs/open-responses-server.md	Documents new timeout variables and updated heartbeat semantics.
docs/events-and-tool-handling.md	Documents heartbeat as a real SSE event and points to `_stream_with_keepalive()`.
README.md	Updates configuration docs and adds client-side idle-timeout note for slow local models.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ivanopcode · 2026-05-27T09:15:13Z

Addressed the latest cleanup comments in chore: address slow stream review cleanup:

updated _stream_with_keepalive() wording to describe SSE heartbeat events instead of comments
removed unused chat-completions config imports
removed trailing whitespace in llm_client.py

ivanopcode added 2 commits April 10, 2026 01:37

fix: emit heartbeat events for slow backend streams

ed25cf2

fix: use separate connect and read timeouts for backend streams

bae5449

qodo-code-review Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/open_responses_server/common/llm_client.py Outdated

ivanopcode added 2 commits April 10, 2026 01:47

fix: move backend connect timeout into env config

0857f4d

fix: preserve backpressure in stream keepalive wrapper

bedc50b

OriNachum requested a review from Copilot April 17, 2026 13:25

Copilot started reviewing on behalf of OriNachum April 17, 2026 13:26 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/open_responses_server/api_controller.py

Comment thread src/open_responses_server/chat_completions_service.py Outdated

Comment thread src/open_responses_server/common/llm_client.py Outdated

chore: address slow stream review cleanup

647e68a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve reliability for slow backend streams#64

fix: improve reliability for slow backend streams#64
ivanopcode wants to merge 5 commits into
teabranch:mainfrom
ivanopcode:fix/slow-backend-streams

ivanopcode commented Apr 9, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented Apr 9, 2026

Uh oh!

qodo-code-review Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanopcode commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ivanopcode commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Testing

Notes

Uh oh!

qodo-code-review Bot commented Apr 9, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanopcode commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivanopcode commented Apr 9, 2026 •

edited

Loading

qodo-code-review Bot commented Apr 9, 2026 •

edited

Loading