Add RetryConfig record for RetryMiddleware (#150)

chris-colinsky · web-flow · commit 8dde25a939a5 · 2026-06-10T17:03:21.000-07:00
* Add RetryConfig record for RetryMiddleware Replace RetryMiddleware's individual constructor kwargs with a single frozen RetryConfig (max_attempts / classifier / backoff / on_retry), constructed as RetryMiddleware(RetryConfig(...)). This is the shared record the upcoming call-level complete(retry=...) parameter will take, so one config serves both the per-node and per-call retry layers. The config fields are Optional and resolve to the canonical defaults (default_classifier / exponential_jitter_backoff) once in the consumer, preserving the prior None-means-default behavior so fixture-driven construction stays robust. Breaking change to the RetryMiddleware constructor; all call sites across tests, examples, and docs are migrated. First of two refactor PRs splitting proposal 0050's remaining work; call-level retry follows. * Guard RetryMiddleware against non-RetryConfig args From CoPilot review of PR #150: RetryMiddleware now takes a positional config, so a non-RetryConfig argument (e.g. RetryMiddleware(3)) would construct and then fail with a cryptic AttributeError at retry time. Raise TypeError eagerly in __init__ with the correct-usage idiom, and add a test.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,10 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 - **`FailureIsolationMiddleware`** (proposal 0050, pipeline-utilities §6.3). A third bundled middleware primitive alongside `RetryMiddleware` and `TimingMiddleware`. It catches exceptions escaping the wrapped node's inner chain and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation. Configuration: `degraded_update` (a static mapping or a `state -> partial_update` callable, resolved at catch time), `event_name` (required, no default, since a generic name makes downstream telemetry strictly worse), an optional `predicate` (`Exception -> bool`; only matching exceptions are caught, others propagate), and an optional async `on_caught` hook. It catches `Exception`; `BaseException` (cancellation) propagates, matching `RetryMiddleware`. On a catch it dispatches a new framework-emitted `FailureIsolatedEvent` (a distinct observer-event variant carrying `event_name`, the wrapped node's lineage identity, `pre_state` / `post_state`, and a `CaughtException` record of category plus message) onto the observer delivery queue; the bundled OTel and Langfuse observers render it as a marker span / observation. Compose it OUTER of `RetryMiddleware` for the "retry transients, degrade gracefully on exhaustion" pattern. Additive: existing pipelines see no behavior change, and the spec pin is unchanged (0050 is already within the v0.53.0 pin).
 
+### Changed
+
+- **`RetryMiddleware` now takes a `RetryConfig` record** instead of individual constructor kwargs (proposal 0050 prep). The four retry settings (`max_attempts` / `classifier` / `backoff` / `on_retry`, each optional) move onto a frozen `RetryConfig`; construct as `RetryMiddleware(RetryConfig(max_attempts=...))`, while bare `RetryMiddleware()` still applies the defaults. This is a breaking change to the `RetryMiddleware` constructor. The record is the same shape the upcoming call-level `complete(retry=...)` parameter will accept, so one retry config serves both the per-node and per-call layers. `None` fields resolve to the canonical defaults (`default_classifier` / `exponential_jitter_backoff`) at use, preserving the prior behavior.
+
 ## [0.13.0] — 2026-06-09
 
 LLM provider hardening release. The pinned spec advances from v0.46.0 to v0.53.0, absorbing four implemented proposals. Proposal 0049 introduces the first spec-normatively-typed observer event variant, `LlmCompletionEvent`, dispatched on every successful LLM provider call; proposal 0058 adds the failure-side counterpart, `LlmFailedEvent`; proposal 0057 extends the completion variant with eight request-side fields. The bundled `OpenAIProvider` retires its sentinel-namespace `NodeEvent` emission for LLM calls entirely, and the OTel and Langfuse observers now drive their LLM span / Generation from the typed events with back-dated timestamps so durations reflect the adapter boundary. Proposal 0047 closes implicit prefix-cache wire-byte stability: `Response.usage` gains cache-stat fields, the OTel observer emits `openarmature.llm.cache_read` attributes, and the OpenAI Chat Completions request body is byte-stable across equivalent inputs regardless of dict insertion order. Custom observers that filtered LLM calls by sentinel namespace MUST migrate to `isinstance` discrimination; `LLM_NAMESPACE` and `LlmEventPayload` remain as a documented compatibility surface.
diff --git a/docs/concepts/middleware.md b/docs/concepts/middleware.md
@@ -126,21 +126,23 @@ hand a transformed state down the chain, pass a new state instance to
 ## Built-in: RetryMiddleware
 
 ```python
-from openarmature.graph import RetryMiddleware, exponential_jitter_backoff
+from openarmature.graph import RetryConfig, RetryMiddleware, exponential_jitter_backoff
 
 
 async def on_retry(exc: Exception, attempt: int) -> None:
     log.warning("retrying after %r (attempt %d)", exc, attempt)
 
 
 retry = RetryMiddleware(
-    max_attempts=3,
-    backoff=exponential_jitter_backoff,
-    on_retry=on_retry,
+    RetryConfig(
+        max_attempts=3,
+        backoff=exponential_jitter_backoff,
+        on_retry=on_retry,
+    )
 )
 ```
 
-Four plug points, all optional:
+Configured with a `RetryConfig`; four fields, all optional:
 
 - **`max_attempts`** is the total attempt count including the first
   call. `1` disables retry. Default `3`.
@@ -277,7 +279,7 @@ builder.add_node(
             degraded_update={"summary": ""},
             event_name="summary_degraded",
         ),
-        RetryMiddleware(max_attempts=3),
+        RetryMiddleware(RetryConfig(max_attempts=3)),
     ],
 )
 ```
diff --git a/examples/fan-out-with-retry/main.py b/examples/fan-out-with-retry/main.py
@@ -84,6 +84,7 @@
     append,
 )
 from openarmature.graph.middleware import (
+    RetryConfig,
     RetryMiddleware,
     TimingMiddleware,
     TimingRecord,
@@ -261,10 +262,12 @@ def build_graph(error_policy: str = "fail_fast") -> CompiledGraph[BatchState]:
     headline_subgraph = build_headline_subgraph()
 
     retry = RetryMiddleware(
-        max_attempts=3,
-        # Short fixed delay so the demo isn't slow. A production app would
-        # use exponential_jitter_backoff (the default).
-        backoff=deterministic_backoff(0.2),
+        RetryConfig(
+            max_attempts=3,
+            # Short fixed delay so the demo isn't slow. A production app would
+            # use exponential_jitter_backoff (the default).
+            backoff=deterministic_backoff(0.2),
+        )
     )
     timing = TimingMiddleware(
         node_name="headline_run",
diff --git a/examples/parallel-branches/main.py b/examples/parallel-branches/main.py
@@ -76,6 +76,7 @@
     append,
 )
 from openarmature.graph.middleware import (
+    RetryConfig,
     RetryMiddleware,
     deterministic_backoff,
 )
@@ -268,8 +269,10 @@ def build_graph() -> CompiledGraph[ArticleState]:
     # the same policy on a longer summarize call (where a retry doubles
     # cost) or on a topic-extract that has different transient profile.
     sentiment_retry = RetryMiddleware(
-        max_attempts=3,
-        backoff=deterministic_backoff(0.2),
+        RetryConfig(
+            max_attempts=3,
+            backoff=deterministic_backoff(0.2),
+        )
     )
 
     return (
diff --git a/src/openarmature/graph/__init__.py b/src/openarmature/graph/__init__.py
@@ -51,6 +51,7 @@
     FailureIsolationMiddleware,
     Middleware,
     NextCall,
+    RetryConfig,
     RetryMiddleware,
     TimingMiddleware,
     TimingRecord,
@@ -115,6 +116,7 @@
     "Reducer",
     "ReducerError",
     "RemoveHandle",
+    "RetryConfig",
     "RetryMiddleware",
     "RoutingError",
     "RuntimeGraphError",
diff --git a/src/openarmature/graph/middleware/__init__.py b/src/openarmature/graph/middleware/__init__.py
@@ -24,6 +24,7 @@
     BackoffStrategy,
     Classifier,
     OnRetryCallback,
+    RetryConfig,
     RetryMiddleware,
     default_classifier,
     deterministic_backoff,
@@ -41,6 +42,7 @@
     "NextCall",
     "OnCompleteCallback",
     "OnRetryCallback",
+    "RetryConfig",
     "RetryMiddleware",
     "TRANSIENT_CATEGORIES",
     "TimingMiddleware",
diff --git a/src/openarmature/graph/middleware/retry.py b/src/openarmature/graph/middleware/retry.py
@@ -20,6 +20,7 @@
 import asyncio
 import random
 from collections.abc import Awaitable, Callable, Mapping
+from dataclasses import dataclass
 from typing import Any
 
 from openarmature.llm.errors import TRANSIENT_CATEGORIES
@@ -100,39 +101,63 @@ def fn(_attempt: int) -> float:
 OnRetryCallback = Callable[[Exception, int], Awaitable[None]]
 
 
-class RetryMiddleware:
-    """Canonical retry middleware.
-
-    Configuration:
+@dataclass(frozen=True)
+class RetryConfig:
+    """Canonical retry configuration record consumed by
+    :class:`RetryMiddleware`.
 
     - ``max_attempts``: total attempts including the first call. ``1``
       disables retry. Default ``3``.
-    - ``classifier``: predicate ``(exception, state) -> bool``. Default
-      :func:`default_classifier` (matches ``category`` against
+    - ``classifier``: predicate ``(exception, state) -> bool`` deciding
+      whether a failure is retry-eligible. ``None`` (the default)
+      selects :func:`default_classifier` (matches ``category`` against
       ``TRANSIENT_CATEGORIES``).
-    - ``backoff``: callable ``(attempt_index) -> seconds``. Default
-      :func:`exponential_jitter_backoff` (base 1s, cap 30s, full jitter).
+    - ``backoff``: callable ``(attempt_index) -> seconds``. ``None``
+      (the default) selects :func:`exponential_jitter_backoff` (base
+      1s, cap 30s, full jitter).
     - ``on_retry``: optional async callback ``(exception, attempt_index)
-      -> None``. Fires before each sleep.
+      -> None`` fired before each backoff sleep.
     """
 
-    def __init__(
-        self,
-        *,
-        max_attempts: int = 3,
-        classifier: Classifier | None = None,
-        backoff: BackoffStrategy | None = None,
-        on_retry: OnRetryCallback | None = None,
-    ) -> None:
-        if max_attempts < 1:
+    max_attempts: int = 3
+    classifier: Classifier | None = None
+    backoff: BackoffStrategy | None = None
+    on_retry: OnRetryCallback | None = None
+
+    def __post_init__(self) -> None:
+        if self.max_attempts < 1:
             raise ValueError("max_attempts must be >= 1")
-        self.max_attempts = max_attempts
-        self.classifier: Classifier = classifier or default_classifier
-        self.backoff: BackoffStrategy = backoff or exponential_jitter_backoff
-        self.on_retry: OnRetryCallback | None = on_retry
+
+
+class RetryMiddleware:
+    """Canonical retry middleware.
+
+    Configured with a :class:`RetryConfig` (or the default
+    ``RetryConfig()`` when omitted). Construct as
+    ``RetryMiddleware(RetryConfig(max_attempts=...))``.
+    """
+
+    def __init__(self, config: RetryConfig | None = None) -> None:
+        if config is None:
+            config = RetryConfig()
+        # Defensive guard for untyped callers: the static type already
+        # rules a non-RetryConfig out (pyright flags this as redundant),
+        # but an eager TypeError beats a cryptic AttributeError when a
+        # mistyped value (e.g. ``RetryMiddleware(3)``) reaches ``.config``.
+        if not isinstance(config, RetryConfig):  # pyright: ignore[reportUnnecessaryIsInstance]
+            raise TypeError(
+                f"RetryMiddleware expects a RetryConfig (or None); got "
+                f"{type(config).__name__}. Construct as "
+                f"RetryMiddleware(RetryConfig(max_attempts=...))."
+            )
+        self.config = config
 
     async def __call__(self, state: Any, next_: NextCall) -> Mapping[str, Any]:
         attempt = 0
+        # ``None`` config fields select the canonical defaults; resolve
+        # once here so the loop works against concrete callables.
+        classifier = self.config.classifier or default_classifier
+        backoff = self.config.backoff or exponential_jitter_backoff
         # Spec observability §3.4 per-attempt scoping: each retry
         # attempt sees only the metadata in scope at retry-loop entry
         # ("pre-attempt baseline") plus that attempt's own writes;
@@ -176,11 +201,11 @@ async def __call__(self, state: Any, next_: NextCall) -> Mapping[str, Any]:
                     # metadata for the error span) sees the baseline,
                     # not the failed attempt's transient state.
                     _reset_invocation_metadata(metadata_token)
-                    if attempt + 1 >= self.max_attempts or not self.classifier(exc, state):
+                    if attempt + 1 >= self.config.max_attempts or not classifier(exc, state):
                         raise
-                    if self.on_retry is not None:
-                        await self.on_retry(exc, attempt)
-                    await asyncio.sleep(self.backoff(attempt))
+                    if self.config.on_retry is not None:
+                        await self.config.on_retry(exc, attempt)
+                    await asyncio.sleep(backoff(attempt))
                     attempt += 1
                 except BaseException:
                     # Cancellation path. `CancelledError` (or other
@@ -202,6 +227,7 @@ async def __call__(self, state: Any, next_: NextCall) -> Mapping[str, Any]:
     "BackoffStrategy",
     "Classifier",
     "OnRetryCallback",
+    "RetryConfig",
     "RetryMiddleware",
     "TRANSIENT_CATEGORIES",
     "default_classifier",
diff --git a/tests/conformance/test_observability.py b/tests/conformance/test_observability.py
@@ -678,7 +678,7 @@ async def _run_fixture_007_case(case: Mapping[str, Any]) -> None:
     from opentelemetry.trace import StatusCode
 
     from openarmature.graph import RuntimeGraphError
-    from openarmature.graph.middleware import RetryMiddleware
+    from openarmature.graph.middleware import RetryConfig, RetryMiddleware
     from openarmature.graph.middleware.retry import deterministic_backoff
 
     observer, exporter = _build_observer()
@@ -725,9 +725,11 @@ def _classifier(exc: Exception, _state: Any, _transient: frozenset[str] = transi
             classifier_fn = None
         node_middleware.setdefault(flaky_node_name, []).append(
             RetryMiddleware(
-                max_attempts=int(mw_spec.get("max_attempts", 3)),
-                backoff=backoff,
-                classifier=classifier_fn,
+                RetryConfig(
+                    max_attempts=int(mw_spec.get("max_attempts", 3)),
+                    backoff=backoff,
+                    classifier=classifier_fn,
+                )
             )
         )
 
diff --git a/tests/conformance/test_pipeline_utilities.py b/tests/conformance/test_pipeline_utilities.py
@@ -30,6 +30,7 @@
 from openarmature.graph.middleware import (
     Middleware,
     OnCompleteCallback,
+    RetryConfig,
     RetryMiddleware,
     TimingMiddleware,
     TimingRecord,
@@ -234,9 +235,11 @@ def _build_middleware(
         classifier_cfg = config.get("classifier")
         classifier = _build_classifier(classifier_cfg) if classifier_cfg is not None else None
         return RetryMiddleware(
-            max_attempts=int(config.get("max_attempts", 3)),
-            backoff=backoff,
-            classifier=classifier,
+            RetryConfig(
+                max_attempts=int(config.get("max_attempts", 3)),
+                backoff=backoff,
+                classifier=classifier,
+            )
         )
     if mw_type == "timing":
         on_complete_cfg = cast("dict[str, Any]", config.get("on_complete") or {})
diff --git a/tests/unit/test_failure_isolation_middleware.py b/tests/unit/test_failure_isolation_middleware.py
@@ -22,6 +22,7 @@
     FailureIsolationMiddleware,
     GraphBuilder,
     ObserverEvent,
+    RetryConfig,
     RetryMiddleware,
     State,
     append,
@@ -290,7 +291,7 @@ async def _flaky(_s: _DocState) -> Mapping[str, Any]:
                     degraded_update={"note": "gave_up"},
                     event_name="flaky_failed",
                 ),
-                RetryMiddleware(max_attempts=3, backoff=deterministic_backoff(0.0)),
+                RetryMiddleware(RetryConfig(max_attempts=3, backoff=deterministic_backoff(0.0))),
             ],
         )
         .add_edge("flaky", END)
diff --git a/tests/unit/test_fan_out.py b/tests/unit/test_fan_out.py
@@ -37,6 +37,7 @@
     FanOutFieldNotList,
     GraphBuilder,
     NodeException,
+    RetryConfig,
     RetryMiddleware,
     State,
     append,
@@ -578,7 +579,7 @@ async def maybe_fail(state: WorkerState) -> Mapping[str, Any]:
     inner_builder.add_edge("compute", END)
     inner = inner_builder.compile()
 
-    retry = RetryMiddleware(max_attempts=3, backoff=deterministic_backoff(0))
+    retry = RetryMiddleware(RetryConfig(max_attempts=3, backoff=deterministic_backoff(0)))
 
     builder: GraphBuilder[InstanceMwParentState] = GraphBuilder(InstanceMwParentState)
     builder.set_entry("process")
diff --git a/tests/unit/test_middleware.py b/tests/unit/test_middleware.py
diff --git a/tests/unit/test_observability_metadata.py b/tests/unit/test_observability_metadata.py
diff --git a/tests/unit/test_observability_otel.py b/tests/unit/test_observability_otel.py

Original file line number	Diff line number	Diff line change
`@@ -76,6 +76,7 @@`
`76`	`76`	`append,`
`77`	`77`	`)`
`78`	`78`	`from openarmature.graph.middleware import (`
	`79`	`+ RetryConfig,`
`79`	`80`	`RetryMiddleware,`
`80`	`81`	`deterministic_backoff,`
`81`	`82`	`)`
`@@ -268,8 +269,10 @@ def build_graph() -> CompiledGraph[ArticleState]:`
`268`	`269`	`# the same policy on a longer summarize call (where a retry doubles`
`269`	`270`	`# cost) or on a topic-extract that has different transient profile.`
`270`	`271`	`sentiment_retry = RetryMiddleware(`
`271`		`- max_attempts=3,`
`272`		`- backoff=deterministic_backoff(0.2),`
	`272`	`+ RetryConfig(`
	`273`	`+ max_attempts=3,`
	`274`	`+ backoff=deterministic_backoff(0.2),`
	`275`	`+ )`
`273`	`276`	`)`
`274`	`277`
`275`	`278`	`return (`