Add FailureIsolationMiddleware (proposal 0050)#149
Merged
Conversation
A third bundled middleware primitive alongside RetryMiddleware and TimingMiddleware. It catches exceptions escaping a wrapped node and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation. On a catch it emits a distinct FailureIsolatedEvent (with a CaughtException record) that the bundled OTel and Langfuse observers render, keeping the degradation visible in traces. A raising on_caught hook is isolated so a buggy telemetry hook cannot defeat the recovery. First of two PRs for proposal 0050; call-level retry follows. No spec-pin change (0050 is already within the v0.53.0 pin).
There was a problem hiding this comment.
Pull request overview
Adds proposal-0050 failure isolation as a first-class middleware primitive, allowing non-critical nodes to degrade to a configured partial update instead of aborting an invocation, while emitting an observer-visible event for traceability.
Changes:
- Introduces
FailureIsolationMiddleware,CaughtException, andFailureIsolatedEvent, including public exports. - Extends bundled OTel and Langfuse observers to render failure-isolation markers.
- Adds comprehensive unit + integration tests plus documentation and changelog entry.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_failure_isolation_middleware.py | New unit/integration tests for middleware behavior, event emission, and observer rendering. |
| src/openarmature/observability/otel/observer.py | Handle FailureIsolatedEvent by emitting a marker span. |
| src/openarmature/observability/langfuse/observer.py | Handle FailureIsolatedEvent by emitting a marker observation. |
| src/openarmature/observability/correlation.py | Extends dispatch typing to include FailureIsolatedEvent. |
| src/openarmature/graph/observer.py | Extends ObserverEvent union and documentation to include FailureIsolatedEvent. |
| src/openarmature/graph/middleware/failure_isolation.py | New middleware implementation and event emission via current_dispatch(). |
| src/openarmature/graph/middleware/init.py | Re-exports FailureIsolationMiddleware / DegradedUpdate. |
| src/openarmature/graph/events.py | Adds CaughtException and FailureIsolatedEvent dataclasses and exports. |
| src/openarmature/graph/init.py | Public exports for new middleware and event types. |
| docs/concepts/middleware.md | Documents FailureIsolationMiddleware, event semantics, and retry composition ordering. |
| CHANGELOG.md | Unreleased “Added” entry for the new middleware/event. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two doc-only changes from CoPilot review of PR #149, no behavior change: - The module docstring now states that degraded_update is resolved once at catch time (which populates the event's post_state), with the numbered steps covering only the observable order after that. - A comment at the FailureIsolatedEvent dispatch documents that attempt_index is the intentional node-level baseline (not a per-attempt index) and that span parenting is unaffected.
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First of two PRs implementing proposal 0050 (retry and degradation primitives). This one adds the failure-isolation middleware; call-level retry on
LlmProvider.complete()follows in a second PR.What
FailureIsolationMiddlewareis a third bundled middleware primitive alongsideRetryMiddlewareandTimingMiddleware. It catches an exception escaping the wrapped node's chain and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation.Configuration:
degraded_update(a static mapping or astate -> partial_updatecallable), a requiredevent_name, an optionalpredicate(Exception -> bool), and an optional asyncon_caughthook. It catchesException;BaseException(cancellation) propagates, matchingRetryMiddleware. Compose it outer ofRetryMiddlewarefor the "retry transients, degrade gracefully on exhaustion" pattern.On a catch it dispatches a new framework-emitted
FailureIsolatedEvent(carryingevent_name, the wrapped node's lineage, pre/post state, and aCaughtExceptionrecord of category plus message) onto the observer delivery queue, via the samecurrent_dispatch()pathset_invocation_metadatauses. No engine changes were needed. The bundled OTel and Langfuse observers render it as a marker span / observation so the degradation stays visible in traces.A raising
on_caughthook is isolated (warned, not propagated) so a buggy telemetry hook cannot turn a recovered node back into a crash.Scope
Tests
15 tests in
tests/unit/test_failure_isolation_middleware.py: unit coverage (static and callable degraded update, predicate filtering,on_caughtfires and is exception-isolated, cancellation propagates, event-field population, bare vs categorized exception), full-engine integration (degrade-via-invoke with a recording observer, three-piece composition withRetryMiddleware), and rendering by both bundled observers. Full suite green (1259 passed), pyright and ruff clean, docs build clean.