Skip to content

fix(llm): propagate per-scope LLM timeout + retry policy to the provider (#2452)#2470

Merged
nicoloboschi merged 1 commit into
mainfrom
fix/llm-timeout-propagation-2452
Jun 30, 2026
Merged

fix(llm): propagate per-scope LLM timeout + retry policy to the provider (#2452)#2470
nicoloboschi merged 1 commit into
mainfrom
fix/llm-timeout-propagation-2452

Conversation

@nicoloboschi

@nicoloboschi nicoloboschi commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #2452. Several per-operation LLM request settings were resolved into HindsightConfig but never reached the provider that uses them, so configuring them was a silent no-op. This PR threads them through for all operation scopes (default / retain / reflect / consolidation).

What was dead

  • *_llm_timeout (retain/reflect/consolidation) and the global llm_timeout never reached the provider impl — it fell back to HINDSIGHT_API_LLM_TIMEOUT/120s, so HINDSIGHT_API_RETAIN_LLM_TIMEOUT=300 did nothing (LiteLLM call exceeded timeout=120.0s).
  • reflect_llm_max_retries / initial_backoff / max_backoff and consolidation_llm_initial_backoff / max_backoff were never consumed. Reflect and consolidation used the hardcoded call()/call_with_tools() defaults (10/5), ignoring the documented "falls back to llm_max_retries" contract.

Fix

MemoryEngine resolves each operation's effective request defaults (per-op override else global) into a small _LLMCallDefaults bundle and carries them on the LLMProvider:

  • timeout → threaded config → LLMProvider → create_llm_provider → provider impl for the providers that honour a configurable request timeout (LiteLLM, LiteLLM Router, OpenAI-compatible, Nous). None preserves each provider's own default, so Anthropic/Gemini keep their bespoke timeouts and the no-config path is byte-identical.
  • max_retries / initial_backoff / max_backoff → become LLMProvider instance defaults that call()/call_with_tools() use when the per-call argument is omitted. Explicit per-call args still win (retain's resolved values; reflect's deliberately-fast structured-extraction call). Providers built without config (from_env, tests) keep the 10/5 method fallback.

The four operation scopes and multi-LLM chain members all share their operation's resolved values.

Behaviour note

Reflect/consolidation calls that previously fell through to the hardcoded call() default of 10 retries now follow the documented llm_max_retries (default 3) — i.e. the config the docs always claimed governed them. Per-op overrides (HINDSIGHT_API_REFLECT_LLM_MAX_RETRIES, …) now take effect.

max_concurrent — left as-is + doc clarified

The per-op *_LLM_MAX_CONCURRENT caps are already wired (process-global composed semaphores, added in a prior PR). They're intentionally not moved onto the per-op config path — they're server-level, env-only, fixed at startup. The docs are updated to call out that distinction vs. the timeout/retry/backoff knobs.

Drive-by

Fixes a pre-existing breakage in test_llm_router_provider's __new__-based helper (missing _default_headers after #2466) so the suite is green on this branch.

Tests

tests/test_llm_timeout_propagation.py:

  • timeout reaches _provider_impl.timeout (LiteLLM + OpenAI-compatible); NoneDEFAULT_LLM_TIMEOUT regression guard;
  • call() retry policy: falls back to the provider's configured default, an explicit per-call arg overrides it, and an unconfigured provider keeps the 10 method fallback;
  • MemoryEngine resolves each per-op timeout + retry/backoff override and falls back to global / default.

Verified locally: full focused LLM/provider/per-op/retry suite green (115 passed); ruff/ty clean.

@nicoloboschi nicoloboschi force-pushed the fix/llm-timeout-propagation-2452 branch from cde6e1a to e12b659 Compare June 30, 2026 09:19
@nicoloboschi nicoloboschi changed the title fix(retain): propagate per-scope LLM timeout to the provider impl (#2452) fix(llm): propagate per-scope LLM timeout + retry policy to the provider (#2452) Jun 30, 2026
…der (#2452)

The per-operation LLM request settings were resolved into HindsightConfig but
never reached the provider that uses them, so configuring them was a silent
no-op:

- *_llm_timeout (retain/reflect/consolidation) and the global llm_timeout never
  reached the provider impl; it fell back to HINDSIGHT_API_LLM_TIMEOUT/120s, so
  HINDSIGHT_API_RETAIN_LLM_TIMEOUT=300 did nothing ("LiteLLM call exceeded
  timeout=120.0s").
- reflect_llm_max_retries/initial_backoff/max_backoff and
  consolidation_llm_initial_backoff/max_backoff were never consumed; reflect and
  consolidation used the hardcoded call()/call_with_tools() defaults (10/5),
  ignoring the documented "falls back to llm_max_retries" contract.

Fix: resolve each operation's effective request defaults (per-op override else
global) in MemoryEngine and carry them on the LLMProvider:

- timeout is threaded config -> LLMProvider -> create_llm_provider -> provider
  impl for the providers that honour a configurable request timeout (LiteLLM,
  LiteLLM Router, OpenAI-compatible, Nous). None preserves each provider's own
  default, so Anthropic/Gemini keep their bespoke timeouts and the no-config
  path is byte-identical.
- max_retries/initial_backoff/max_backoff become LLMProvider instance defaults
  that call()/call_with_tools() use when the per-call arg is omitted. Explicit
  per-call args (retain's resolved values, reflect's fast structured-extraction
  path) still win; providers built without config (from_env, tests) keep the
  10/5 method fallback.

The four operation scopes (default/retain/reflect/consolidation) and multi-LLM
chain members all share their operation's resolved values via a small
_LLMCallDefaults bundle.

max_concurrent is intentionally left as-is (process-global semaphores read from
env at startup, server-level only); the docs are clarified to call out that
distinction.

Also fixes a pre-existing breakage in test_llm_router_provider's __new__-based
helper (missing _default_headers after #2466) so the suite is green.

Tests: tests/test_llm_timeout_propagation.py covers provider-impl timeout
threading, the call() retry-policy fallback/override, and per-op
resolution/fallback in MemoryEngine.
@nicoloboschi nicoloboschi force-pushed the fix/llm-timeout-propagation-2452 branch from e12b659 to e33d93a Compare June 30, 2026 09:31
@koriyoshi2041

Copy link
Copy Markdown
Contributor

I did a focused local pass on this branch:

cd hindsight-api-slim
env -u ALL_PROXY -u all_proxy -u HTTPS_PROXY -u https_proxy -u HTTP_PROXY -u http_proxy \
  uv run pytest tests/test_llm_timeout_propagation.py tests/test_multi_llm_config.py -q
uv run ruff check hindsight_api/engine/llm_wrapper.py hindsight_api/engine/memory_engine.py tests/test_llm_timeout_propagation.py tests/test_multi_llm_config.py

Result: 39 passed for the focused tests; ruff clean.

The current test-pip-slim CI failure looks timing-related rather than a test/assertion failure. The log shows the server starts at 09:32:17, Vertex AI verification takes 46.726s, migrations complete and the DB pool is created at 09:33:16.273, and the 60s health loop exits at 09:33:16.697. So the app was still finishing startup right at the health-check deadline.

A rerun may pass if Vertex AI is faster. If it keeps flaking, the smallest CI-only fix would be to give this smoke startup loop more headroom or skip startup LLM verification in test-pip-slim if the test only needs retain/recall smoke coverage.

@nicoloboschi nicoloboschi merged commit 40fe7aa into main Jun 30, 2026
94 of 101 checks passed
oliverv pushed a commit to oliverv/collabmind-hindsight that referenced this pull request Jul 1, 2026
…der (vectorize-io#2452) (vectorize-io#2470)

The per-operation LLM request settings were resolved into HindsightConfig but
never reached the provider that uses them, so configuring them was a silent
no-op:

- *_llm_timeout (retain/reflect/consolidation) and the global llm_timeout never
  reached the provider impl; it fell back to HINDSIGHT_API_LLM_TIMEOUT/120s, so
  HINDSIGHT_API_RETAIN_LLM_TIMEOUT=300 did nothing ("LiteLLM call exceeded
  timeout=120.0s").
- reflect_llm_max_retries/initial_backoff/max_backoff and
  consolidation_llm_initial_backoff/max_backoff were never consumed; reflect and
  consolidation used the hardcoded call()/call_with_tools() defaults (10/5),
  ignoring the documented "falls back to llm_max_retries" contract.

Fix: resolve each operation's effective request defaults (per-op override else
global) in MemoryEngine and carry them on the LLMProvider:

- timeout is threaded config -> LLMProvider -> create_llm_provider -> provider
  impl for the providers that honour a configurable request timeout (LiteLLM,
  LiteLLM Router, OpenAI-compatible, Nous). None preserves each provider's own
  default, so Anthropic/Gemini keep their bespoke timeouts and the no-config
  path is byte-identical.
- max_retries/initial_backoff/max_backoff become LLMProvider instance defaults
  that call()/call_with_tools() use when the per-call arg is omitted. Explicit
  per-call args (retain's resolved values, reflect's fast structured-extraction
  path) still win; providers built without config (from_env, tests) keep the
  10/5 method fallback.

The four operation scopes (default/retain/reflect/consolidation) and multi-LLM
chain members all share their operation's resolved values via a small
_LLMCallDefaults bundle.

max_concurrent is intentionally left as-is (process-global semaphores read from
env at startup, server-level only); the docs are clarified to call out that
distinction.

Also fixes a pre-existing breakage in test_llm_router_provider's __new__-based
helper (missing _default_headers after vectorize-io#2466) so the suite is green.

Tests: tests/test_llm_timeout_propagation.py covers provider-impl timeout
threading, the call() retry-policy fallback/override, and per-op
resolution/fallback in MemoryEngine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(retain): per-scope retain timeout does not propagate to the underlying provider implementation

2 participants