Skip to content

fix: improve reliability for slow backend streams#64

Open
ivanopcode wants to merge 4 commits intoteabranch:mainfrom
ivanopcode:fix/slow-backend-streams
Open

fix: improve reliability for slow backend streams#64
ivanopcode wants to merge 4 commits intoteabranch:mainfrom
ivanopcode:fix/slow-backend-streams

Conversation

@ivanopcode
Copy link
Copy Markdown

@ivanopcode ivanopcode commented Apr 9, 2026

Problem

When the backend takes a long time to process the prompt before emitting the
first streamed token, /responses could sit idle for minutes.

That exposed two independent failure modes:

  1. SSE clients could treat the stream as idle and disconnect before the first
    backend event arrived.

  2. ORS used a flat backend timeout, so slow prompt processing could hit the
    same timeout budget that should really apply to connection setup.

This is easiest to reproduce with large local models or very long prompts.

Changes

This MR improves slow-stream handling in two ways:

  • emit a real SSE heartbeat event while ORS is still waiting for the upstream
    stream to yield its first item
  • use separate backend timeout profiles:
    • short connect timeout
    • longer read timeout for slow prompt processing

The heartbeat is emitted as a real SSE event rather than an SSE comment so
clients that drive their idle timer from parsed SSE events can stay connected.

Testing

Tested with:

  • uv run pytest tests/test_api_controller_endpoints.py tests/test_llm_client.py tests/test_server.py tests/test_chat_completions_service.py

Also verified manually against slow local llama.cpp-backed models where prompt
processing can take several minutes before the first chunk is streamed.

Notes

README and server docs were updated to clarify timeout behavior and document a
client-side idle-timeout workaround for slow local backends.

The heartbeat is sent as a parsed SSE event with a JSON payload, not only as an
SSE comment. This is intentional: a real event still produces ordinary stream
activity for proxies and load balancers, but it also resets idle timers in
clients that wait for parsed SSE events before considering the stream active.
This behavior was reproduced with Codex CLI, but the change is not specific to
it - it benefits any SSE client whose liveness is driven by parsed events
rather than ignored SSE comments.

Sending both a comment and an event would add noise without any confirmed
compatibility benefit so far.

May address one class of disconnects discussed in #43.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Improve reliability for slow backend streams with heartbeats and separate timeouts

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Emit real SSE heartbeat events while backend stream setup is waiting
• Use separate connect and read timeouts for backend requests
• Refactor heartbeat mechanism to start client stream immediately
• Update documentation to clarify timeout behavior and client workarounds
Diagram
flowchart LR
  A["Backend Request"] -->|slow setup| B["_stream_with_keepalive"]
  B -->|emit heartbeats| C["Client SSE Stream"]
  B -->|upstream ready| D["Stream Data"]
  D --> C
  E["get_backend_timeout"] -->|short connect| F["30s"]
  E -->|long read| G["120s"]
  F --> A
  G --> A
Loading

Grey Divider

File Changes

1. src/open_responses_server/api_controller.py ✨ Enhancement +147/-123

Refactor heartbeat mechanism to real SSE events with queue-based producer

• Replace _with_heartbeat() with _stream_with_keepalive() that uses queue-based producer pattern
• Emit real SSE data events (event: response.heartbeat) instead of SSE comments
• Start client-facing stream immediately while upstream work runs in background task
• Replace STREAM_TIMEOUT with get_backend_timeout() for separate timeout handling
• Refactor create_response() streaming logic to wrap LLM event stream with keepalive

src/open_responses_server/api_controller.py


2. src/open_responses_server/chat_completions_service.py ✨ Enhancement +10/-6

Use separate timeout profiles for backend requests

• Import get_backend_timeout() function from llm_client module
• Replace all STREAM_TIMEOUT references with get_backend_timeout() calls
• Apply to both streaming and non-streaming request handlers

src/open_responses_server/chat_completions_service.py


3. src/open_responses_server/common/llm_client.py ✨ Enhancement +14/-2

Add separate connect and read timeout configuration

• Add BACKEND_CONNECT_TIMEOUT constant set to 30 seconds
• Implement get_backend_timeout() function returning httpx.Timeout with separate connect/read
 timeouts
• Update LLMClient initialization to use get_backend_timeout() instead of flat timeout

src/open_responses_server/common/llm_client.py


View more (5)
4. tests/test_api_controller_endpoints.py 🧪 Tests +96/-54

Update tests for new keepalive event mechanism

• Rename _with_heartbeat to _stream_with_keepalive and _HEARTBEAT to _HEARTBEAT_EVENT
• Add new test test_responses_streaming_sends_keepalive_before_backend_yields() verifying
 heartbeats during backend setup delay
• Refactor TestWithHeartbeat class to TestStreamWithKeepalive with updated test cases
• Update tests to verify real SSE data events instead of sentinel objects

tests/test_api_controller_endpoints.py


5. tests/test_llm_client.py 🧪 Tests +13/-0

Add tests for separate timeout configuration

• Add test for get_backend_timeout() verifying separate connect and read timeouts
• Import BACKEND_CONNECT_TIMEOUT and get_backend_timeout from llm_client
• Verify client initialization uses new timeout configuration

tests/test_llm_client.py


6. README.md 📝 Documentation +11/-3

Document timeout behavior and client-side workarounds

• Clarify STREAM_TIMEOUT as backend read timeout with separate 30s connect timeout
• Add client-side configuration example for Codex CLI with longer idle timeout
• Update HEARTBEAT_INTERVAL description to reference SSE heartbeat events

README.md


7. docs/events-and-tool-handling.md 📝 Documentation +13/-9

Document real SSE heartbeat event format and behavior

• Update heartbeat documentation to describe real SSE data events instead of comments
• Document event format: event: response.heartbeat with JSON payload
• Explain rationale for using data events to reset client idle timers
• Reference updated _stream_with_keepalive() implementation

docs/events-and-tool-handling.md


8. docs/open-responses-server.md 📝 Documentation +2/-2

Update configuration documentation for timeout behavior

• Clarify STREAM_TIMEOUT as backend read timeout with separate 30s connect timeout
• Update HEARTBEAT_INTERVAL description to reference SSE heartbeat events

docs/open-responses-server.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 9, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Hardcoded BACKEND_CONNECT_TIMEOUT📘
Description
BACKEND_CONNECT_TIMEOUT is introduced as a hardcoded constant instead of being sourced from
environment-based configuration. This violates the requirement that configuration be managed via
environment variables loaded from .env, and makes timeout behavior non-configurable in different
deploy environments.
Code

src/open_responses_server/common/llm_client.py[R4-14]

+BACKEND_CONNECT_TIMEOUT = 30.0
+
+
+def get_backend_timeout() -> httpx.Timeout:
+    """Use a long read timeout for slow local inference, but keep connect short."""
+    return httpx.Timeout(
+        connect=BACKEND_CONNECT_TIMEOUT,
+        read=STREAM_TIMEOUT,
+        write=STREAM_TIMEOUT,
+        pool=BACKEND_CONNECT_TIMEOUT,
+    )
Evidence
PR Compliance ID 2 requires configuration to be managed via environment variables (per
common/config.py). The PR adds a new timeout setting (BACKEND_CONNECT_TIMEOUT = 30.0) directly
in code and uses it in get_backend_timeout() rather than reading it from os.environ via
common/config.py.

CLAUDE.md
src/open_responses_server/common/llm_client.py[4-14]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`BACKEND_CONNECT_TIMEOUT` is hardcoded in `llm_client.py`, but compliance requires configuration to come from environment variables loaded via `.env` (centralized in `common/config.py`).
## Issue Context
Timeout behavior can vary across environments (local dev vs. production). This value should be defined in `common/config.py` similarly to `STREAM_TIMEOUT` and then imported/used from there.
## Fix Focus Areas
- src/open_responses_server/common/llm_client.py[4-14]
- src/open_responses_server/common/config.py[25-28]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Unbounded stream buffering🐞
Description
_stream_with_keepalive consumes the upstream stream in a background task and writes all items into
an unbounded asyncio.Queue, so downstream backpressure (slow client/network) won’t slow upstream
reads. This can grow memory without bound and potentially OOM/DoS the server for large/fast
responses with slow readers.
Code

src/open_responses_server/api_controller.py[R34-45]

+    queue = asyncio.Queue()
+
+    async def producer():
+        try:
+            async for item in async_iter_factory():
+                await queue.put(item)
+        except Exception as exc:
+            await queue.put((_STREAM_ERROR, exc))
+        finally:
+            await queue.put(_STREAM_DONE)
+
+    task = asyncio.create_task(producer())
Evidence
The keepalive wrapper creates an unbounded queue and a producer task that iterates the entire
upstream stream and enqueues every item; the consumer drains the queue at the pace the ASGI server
pulls items for the client, so a slow reader can cause queue growth. This wrapper is used to feed
the /responses StreamingResponse, and the upstream generator can emit many SSE events per request.

src/open_responses_server/api_controller.py[23-69]
src/open_responses_server/api_controller.py[230-337]
src/open_responses_server/responses_service.py[313-358]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_stream_with_keepalive()` uses an unbounded `asyncio.Queue()` between an always-running producer task (reading upstream) and the downstream consumer (StreamingResponse). If the downstream client is slow or stalls, the producer can continue enqueuing items and memory can grow without bound.
## Issue Context
This architecture is needed to emit keepalives before the upstream context manager yields, but it should still enforce backpressure to avoid buffering arbitrarily large streams.
## Fix Focus Areas
- src/open_responses_server/api_controller.py[23-69]
- src/open_responses_server/api_controller.py[325-329]
## Suggested fix
1. Make the queue bounded (e.g., `asyncio.Queue(maxsize=1)` or a small configurable size) so `await queue.put(...)` blocks when the consumer is behind.
2. Optionally add a debug log/metric when the queue is full or exceeds a threshold to detect slow-reader pressure in production.
3. Ensure cancellation paths still work when the producer is blocked on `queue.put()` (cancelling the task should unblock it).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread src/open_responses_server/common/llm_client.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves reliability of /responses streaming when upstream backends are slow to emit the first token by keeping SSE connections active and by separating backend connect vs. read/write timeout budgets.

Changes:

  • Add a keepalive wrapper that emits a real SSE response.heartbeat event while waiting for upstream stream activity (including before the upstream context manager yields).
  • Introduce BACKEND_CONNECT_TIMEOUT and a shared get_backend_timeout() profile (short connect/pool, longer read/write) and apply it across backend calls.
  • Update tests and docs/README to cover the new heartbeat behavior and timeout semantics.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_llm_client.py Adds assertions for the new backend timeout profile and client defaults.
tests/test_api_controller_endpoints.py Adds/updates tests verifying SSE heartbeat emission and keepalive wrapper behavior.
src/open_responses_server/common/llm_client.py Adds get_backend_timeout() and applies it to the shared httpx.AsyncClient.
src/open_responses_server/common/config.py Introduces BACKEND_CONNECT_TIMEOUT env var and logs it at startup.
src/open_responses_server/chat_completions_service.py Switches chat-completions requests to use get_backend_timeout().
src/open_responses_server/api_controller.py Replaces old heartbeat wrapper with _stream_with_keepalive and uses get_backend_timeout() for backend calls.
docs/open-responses-server.md Documents new timeout variables and updated heartbeat semantics.
docs/events-and-tool-handling.md Documents heartbeat as a real SSE event and points to _stream_with_keepalive().
README.md Updates configuration docs and adds client-side idle-timeout note for slow local models.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +29
async def _stream_with_keepalive(async_iter_factory, interval, keepalive_event=_HEARTBEAT_EVENT):
"""Yield keepalive comments while a producer waits on upstream stream activity.

Uses asyncio.wait with timeout so the underlying task is never cancelled.
This keeps SSE connections alive when the backend LLM is slow to respond.
Starts the client-facing SSE stream immediately and runs the upstream work
in a background task, so heartbeats continue even before the backend stream
context manager yields control. The heartbeat is a real SSE data event so
clients like Codex reset idle timers after parsing it.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _stream_with_keepalive docstring says it yields “keepalive comments”, but the implementation emits a real SSE event:/data: heartbeat (via _HEARTBEAT_EVENT). Please update the wording to avoid confusion for future maintainers.

Copilot uses AI. Check for mistakes.
from open_responses_server.common.llm_client import LLMClient
from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS, STREAM_TIMEOUT
from open_responses_server.common.llm_client import LLMClient, get_backend_timeout
from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPENAI_BASE_URL_INTERNAL and OPENAI_API_KEY are imported here but not used anywhere in this module. Removing unused imports will reduce confusion and avoid potential lint failures if flake8/pylint are enabled.

Suggested change
from open_responses_server.common.config import logger, OPENAI_BASE_URL_INTERNAL, OPENAI_API_KEY, MAX_TOOL_CALL_ITERATIONS
from open_responses_server.common.config import logger, MAX_TOOL_CALL_ITERATIONS

Copilot uses AI. Check for mistakes.
async def shutdown_llm_client():
"""Function to be called on application shutdown."""
await LLMClient.close_client() No newline at end of file
await LLMClient.close_client()
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace at the end of this line. Please remove it to keep formatting clean and avoid noisy diffs in future edits.

Suggested change
await LLMClient.close_client()
await LLMClient.close_client()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants