Skip to content

Commit 795c549

Browse files
Add FailureIsolationMiddleware (proposal 0050) (#149)
* Add FailureIsolationMiddleware (proposal 0050) A third bundled middleware primitive alongside RetryMiddleware and TimingMiddleware. It catches exceptions escaping a wrapped node and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation. On a catch it emits a distinct FailureIsolatedEvent (with a CaughtException record) that the bundled OTel and Langfuse observers render, keeping the degradation visible in traces. A raising on_caught hook is isolated so a buggy telemetry hook cannot defeat the recovery. First of two PRs for proposal 0050; call-level retry follows. No spec-pin change (0050 is already within the v0.53.0 pin). * Clarify failure-isolation docs from PR review Two doc-only changes from CoPilot review of PR #149, no behavior change: - The module docstring now states that degraded_update is resolved once at catch time (which populates the event's post_state), with the numbered steps covering only the observable order after that. - A comment at the FailureIsolatedEvent dispatch documents that attempt_index is the intentional node-level baseline (not a per-attempt index) and that span parenting is unaffected.
1 parent b948372 commit 795c549

11 files changed

Lines changed: 853 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
66

77
## [Unreleased]
88

9+
### Added
10+
11+
- **`FailureIsolationMiddleware`** (proposal 0050, pipeline-utilities §6.3). A third bundled middleware primitive alongside `RetryMiddleware` and `TimingMiddleware`. It catches exceptions escaping the wrapped node's inner chain and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation. Configuration: `degraded_update` (a static mapping or a `state -> partial_update` callable, resolved at catch time), `event_name` (required, no default, since a generic name makes downstream telemetry strictly worse), an optional `predicate` (`Exception -> bool`; only matching exceptions are caught, others propagate), and an optional async `on_caught` hook. It catches `Exception`; `BaseException` (cancellation) propagates, matching `RetryMiddleware`. On a catch it dispatches a new framework-emitted `FailureIsolatedEvent` (a distinct observer-event variant carrying `event_name`, the wrapped node's lineage identity, `pre_state` / `post_state`, and a `CaughtException` record of category plus message) onto the observer delivery queue; the bundled OTel and Langfuse observers render it as a marker span / observation. Compose it OUTER of `RetryMiddleware` for the "retry transients, degrade gracefully on exhaustion" pattern. Additive: existing pipelines see no behavior change, and the spec pin is unchanged (0050 is already within the v0.53.0 pin).
12+
913
## [0.13.0] — 2026-06-09
1014

1115
LLM provider hardening release. The pinned spec advances from v0.46.0 to v0.53.0, absorbing four implemented proposals. Proposal 0049 introduces the first spec-normatively-typed observer event variant, `LlmCompletionEvent`, dispatched on every successful LLM provider call; proposal 0058 adds the failure-side counterpart, `LlmFailedEvent`; proposal 0057 extends the completion variant with eight request-side fields. The bundled `OpenAIProvider` retires its sentinel-namespace `NodeEvent` emission for LLM calls entirely, and the OTel and Langfuse observers now drive their LLM span / Generation from the typed events with back-dated timestamps so durations reflect the adapter boundary. Proposal 0047 closes implicit prefix-cache wire-byte stability: `Response.usage` gains cache-stat fields, the OTel observer emits `openarmature.llm.cache_read` attributes, and the OpenAI Chat Completions request body is byte-stable across equivalent inputs regardless of dict insertion order. Custom observers that filtered LLM calls by sentinel namespace MUST migrate to `isinstance` discrimination; `LLM_NAMESPACE` and `LlmEventPayload` remain as a documented compatibility surface.

docs/concepts/middleware.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,96 @@ Two implementation details worth knowing:
199199
globally patching `time.monotonic` (which would also distort
200200
asyncio's scheduling).
201201

202+
## Built-in: FailureIsolationMiddleware
203+
204+
```python
205+
from openarmature.graph import FailureIsolationMiddleware
206+
207+
builder.add_node(
208+
"extract_segments",
209+
extract_fn,
210+
middleware=[
211+
FailureIsolationMiddleware(
212+
degraded_update={"segments": []},
213+
event_name="segment_extraction_degraded",
214+
),
215+
],
216+
)
217+
```
218+
219+
`FailureIsolationMiddleware` catches an exception escaping the wrapped
220+
chain and returns a degraded partial update instead of letting it abort
221+
the invocation. Reach for it when a node is not load-bearing enough to
222+
kill the whole run: a failed enrichment step degrades to an empty list,
223+
the graph continues, and the failure is still visible in your traces.
224+
It is the named, observable form of the "catch and recover" pattern
225+
from [Error semantics](#error-semantics) above.
226+
227+
Configuration:
228+
229+
- **`degraded_update`** (required) is the partial update returned on a
230+
caught exception. It may be a static mapping, or a callable
231+
`state -> partial_update` when the fallback shape depends on the input
232+
state. The callable is resolved once, at catch time.
233+
- **`event_name`** (required, no default) is a stable identifier for
234+
this catch site. It rides on the emitted event (below) and any
235+
downstream logging. There is no default on purpose: a generic name
236+
like `"failure_isolated"` collapses every degraded path into one
237+
indistinguishable bucket in a dashboard, so the name is forced at the
238+
construction site, where the context to name it well is available.
239+
- **`predicate`** is an optional `Exception -> bool`. When supplied,
240+
only exceptions where it returns true are caught; everything else
241+
propagates. The default catches every `Exception`.
242+
- **`on_caught`** is an optional async hook `Exception -> None`, fired
243+
when the middleware catches. Use it to pump the caught exception to
244+
caller-specific telemetry beyond the framework event. It fires inline
245+
before the degraded update returns, and an exception it raises is
246+
isolated (logged, not propagated) so a buggy hook cannot defeat the
247+
recovery.
248+
249+
Like `RetryMiddleware`, it catches `Exception` only; `BaseException`
250+
(cancellation, keyboard interrupt) propagates so aborts still work.
251+
252+
### The failure-isolated event
253+
254+
On a catch, the middleware dispatches a `FailureIsolatedEvent` onto the
255+
observer stream. It is a distinct event variant, not a node event: it
256+
carries the `event_name`, the wrapped node's lineage identity, the input
257+
and degraded states, and a `CaughtException` record holding the
258+
exception's `category` (when it has one) and message. Observers narrow
259+
on it with `isinstance(event, FailureIsolatedEvent)`. The bundled OTel
260+
and Langfuse observers render it as a marker span / observation so the
261+
catch shows up alongside the node's own span. The default emission path
262+
is the observer stream only, with no logging-library dependency;
263+
`on_caught` is the escape hatch for anything else.
264+
265+
### Composing with RetryMiddleware
266+
267+
The two compose into the canonical "retry transients, then give up
268+
gracefully" pattern. The order is load-bearing: failure isolation is the
269+
**outer** layer, retry is **inner**.
270+
271+
```python
272+
builder.add_node(
273+
"summarize",
274+
summarize_fn,
275+
middleware=[
276+
FailureIsolationMiddleware(
277+
degraded_update={"summary": ""},
278+
event_name="summary_degraded",
279+
),
280+
RetryMiddleware(max_attempts=3),
281+
],
282+
)
283+
```
284+
285+
Retry sits closest to the node, so it sees raw transient failures first
286+
and retries them. Only what escapes retry (an exhausted budget, or a
287+
non-transient exception retry's classifier declines) reaches the outer
288+
failure isolation, which degrades. Reverse the order and the inner
289+
isolation would swallow transients before retry ever saw them, defeating
290+
the retry entirely.
291+
202292
## Related
203293

204294
- [Parallel branches](parallel-branches.md): per-branch middleware

src/openarmature/graph/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636
UnreachableNode,
3737
)
3838
from .events import (
39+
CaughtException,
40+
FailureIsolatedEvent,
3941
InvocationCompletedEvent,
4042
InvocationStartedEvent,
4143
LlmCompletionEvent,
@@ -45,6 +47,8 @@
4547
)
4648
from .fan_out import FanOutConfig, FanOutNode
4749
from .middleware import (
50+
DegradedUpdate,
51+
FailureIsolationMiddleware,
4852
Middleware,
4953
NextCall,
5054
RetryMiddleware,
@@ -64,15 +68,19 @@
6468

6569
__all__ = [
6670
"END",
71+
"CaughtException",
6772
"CompileError",
6873
"CompiledGraph",
6974
"ConditionalEdge",
7075
"ConflictingReducers",
7176
"DanglingEdge",
77+
"DegradedUpdate",
7278
"DrainSummary",
7379
"EdgeException",
7480
"EndSentinel",
7581
"ExplicitMapping",
82+
"FailureIsolatedEvent",
83+
"FailureIsolationMiddleware",
7684
"FanOutConfig",
7785
"FanOutCountModeAmbiguous",
7886
"FanOutEmpty",

src/openarmature/graph/events.py

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,7 +659,69 @@ class LlmFailedEvent:
659659
caller_invocation_metadata: Mapping[str, AttributeValue] | None = None
660660

661661

662+
@dataclass(frozen=True)
663+
class CaughtException:
664+
"""Structured record of an exception caught by
665+
``FailureIsolationMiddleware``.
666+
667+
- ``category``: the exception's failure category when it carries
668+
one (e.g. an llm-provider error's ``category`` attribute), else
669+
``None`` for a bare exception that carries no category.
670+
- ``message``: the human-readable exception message (``str(exc)``);
671+
the empty string when the exception carried no message.
672+
"""
673+
674+
category: str | None
675+
message: str
676+
677+
678+
# Spec: realizes pipeline-utilities §6.3 failure-isolation middleware
679+
# (proposal 0050). Emitted by FailureIsolationMiddleware when it
680+
# catches an exception escaping the inner chain and substitutes a
681+
# degraded partial update. A distinct framework-emitted event kind
682+
# (NOT a NodeEvent — does not reuse node_name / namespace / error),
683+
# mirroring the proposal 0040 MetadataAugmentationEvent mechanism:
684+
# enqueued on the engine's serial observer-delivery queue via
685+
# ``current_dispatch()`` and NOT subject to the observer ``phases``
686+
# filter (matches MetadataAugmentationEvent / InvocationStartedEvent /
687+
# InvocationCompletedEvent / LlmCompletionEvent / LlmFailedEvent
688+
# treatment).
689+
@dataclass(frozen=True)
690+
class FailureIsolatedEvent:
691+
"""A failure-isolation event delivered to observers.
692+
693+
Reports that ``FailureIsolationMiddleware`` caught an exception at
694+
a node and substituted a degraded partial update for the node's
695+
output. Observer code filters by type discrimination
696+
(``isinstance(event, FailureIsolatedEvent)``).
697+
698+
Field set:
699+
700+
- ``event_name``: the caller-supplied identifier for this catch
701+
site, from the middleware's configuration.
702+
- ``namespace`` / ``attempt_index`` / ``fan_out_index`` /
703+
``branch_name``: the wrapped node's lineage identity, surfaced
704+
for correlation with the node's other events.
705+
- ``pre_state``: the state the wrapped node received.
706+
- ``post_state``: the degraded partial update the middleware
707+
returned in place of the node's output.
708+
- ``caught_exception``: a :class:`CaughtException` record of the
709+
caught exception (category + message).
710+
"""
711+
712+
event_name: str
713+
namespace: tuple[str, ...]
714+
attempt_index: int
715+
fan_out_index: int | None
716+
branch_name: str | None
717+
pre_state: Any
718+
post_state: Mapping[str, Any]
719+
caught_exception: CaughtException
720+
721+
662722
__all__ = [
723+
"CaughtException",
724+
"FailureIsolatedEvent",
663725
"FanOutEventConfig",
664726
"InvocationCompletedEvent",
665727
"InvocationStartedEvent",

src/openarmature/graph/middleware/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
"""
1919

2020
from ._core import ChainCall, Middleware, NextCall, compose_chain
21+
from .failure_isolation import DegradedUpdate, FailureIsolationMiddleware
2122
from .retry import (
2223
TRANSIENT_CATEGORIES,
2324
BackoffStrategy,
@@ -34,6 +35,8 @@
3435
"BackoffStrategy",
3536
"ChainCall",
3637
"Classifier",
38+
"DegradedUpdate",
39+
"FailureIsolationMiddleware",
3740
"Middleware",
3841
"NextCall",
3942
"OnCompleteCallback",

0 commit comments

Comments
 (0)