Add call-level retry to Provider.complete()#151
Merged
Conversation
Add an optional retry: RetryConfig parameter to complete(). When set, the wire call is retried in-call on transient provider errors per the config, so a node issuing several LLM calls in a loop does not re-run already-successful calls when a later call hits a transient failure. The request is built and validated once (pre-send errors are never retried) and the call stays terminal-only on the observability surface: exactly one LlmCompletionEvent or LlmFailedEvent fires per complete() call, with one call_id across attempts. The per-attempt span surface is deferred to a future sub-event; conformance.toml marks 0050 partial. Final piece of proposal 0050 (after failure isolation + the RetryConfig refactor). No spec-pin change.
There was a problem hiding this comment.
Pull request overview
Adds optional call-level retry support to the LLM provider complete() API so transient failures can be retried without re-running surrounding node work, while preserving a terminal-only observability event surface.
Changes:
- Add
retry: RetryConfig | NonetoProvider.complete()and implement an in-call retry loop around the wire call forOpenAIProvider. - Add unit tests covering transient-then-success, exhaustion behavior, non-transient skip, and
on_retrycallbacks. - Document call-level retry usage and update conformance/changelog to mark proposal 0050 as
partial(per-attempt spans deferred).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_llm_provider.py | Adds call-level retry unit tests validating attempt counts and terminal-only events. |
| src/openarmature/llm/providers/openai.py | Implements complete(retry=...) via _do_complete_with_retry around the wire call. |
| src/openarmature/llm/provider.py | Extends the Provider protocol to include the optional retry parameter and documents it. |
| docs/concepts/llms.md | Documents call-level retry and contrasts it with node-level retry. |
| conformance.toml | Marks proposal 0050 as partial and documents deferred per-attempt span surface. |
| CHANGELOG.md | Records the new call-level retry feature in the Added section. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Final PR in the proposal 0050 series (after #149 failure-isolation middleware and #150 the RetryConfig refactor). Adds call-level retry on the LLM provider.
What
Provider.complete()gains an optionalretry: RetryConfig | Noneparameter. When supplied, the wire call is retried in-call on transient provider errors per the config (classifier, backoff,on_retry,max_attempts), so a node that issues several LLM calls in a loop (chunked processing, multi-step) does not re-run the already-successful calls when a later call hits a transient failure.The implementation wraps only the wire call in a small retry helper, so the rest of
complete()is unchanged:LlmCompletionEvent(eventual success) orLlmFailedEvent(exhaustion or a non-transient error) fires percomplete()call, with a singlecall_idshared across attempts. Intermediate transient retries emit no event.Nonefor state at the call boundary (the default classifier ignores it).Deferred (conformance
partial)§7.1's per-attempt span surface (N per-attempt
openarmature.llm.completespans + theopenarmature.llm.attempt_indexattribute) is deferred to a future within-call sub-event (LlmRetryAttemptEvent): the python LLM span is rendered from the terminal typed event, so per-attempt spans need a dedicated signal.conformance.tomlmarks proposal 0050partialaccordingly. Retry visibility in the meantime is theon_retrycallback plus logs.Scope
Tests
Four call-level-retry tests (transient-then-success, exhaustion emits one failed event, non-transient skips retry,
on_retryfires per attempt), each asserting the terminal-only event surface and the attempt count. Full suite green (1265 passed), pyright and ruff clean, docs build clean.