Skip to content

Add FailureIsolationMiddleware (proposal 0050)#149

Merged
chris-colinsky merged 2 commits into
mainfrom
feature/0050-failure-isolation-middleware
Jun 10, 2026
Merged

Add FailureIsolationMiddleware (proposal 0050)#149
chris-colinsky merged 2 commits into
mainfrom
feature/0050-failure-isolation-middleware

Conversation

@chris-colinsky

Copy link
Copy Markdown
Member

First of two PRs implementing proposal 0050 (retry and degradation primitives). This one adds the failure-isolation middleware; call-level retry on LlmProvider.complete() follows in a second PR.

What

FailureIsolationMiddleware is a third bundled middleware primitive alongside RetryMiddleware and TimingMiddleware. It catches an exception escaping the wrapped node's chain and returns a configured degraded partial update, so a non-critical node can fail without aborting the whole invocation.

Configuration: degraded_update (a static mapping or a state -> partial_update callable), a required event_name, an optional predicate (Exception -> bool), and an optional async on_caught hook. It catches Exception; BaseException (cancellation) propagates, matching RetryMiddleware. Compose it outer of RetryMiddleware for the "retry transients, degrade gracefully on exhaustion" pattern.

On a catch it dispatches a new framework-emitted FailureIsolatedEvent (carrying event_name, the wrapped node's lineage, pre/post state, and a CaughtException record of category plus message) onto the observer delivery queue, via the same current_dispatch() path set_invocation_metadata uses. No engine changes were needed. The bundled OTel and Langfuse observers render it as a marker span / observation so the degradation stays visible in traces.

A raising on_caught hook is isolated (warned, not propagated) so a buggy telemetry hook cannot turn a recovered node back into a crash.

Scope

  • No spec-pin change: proposal 0050 is already within the current v0.53.0 pin.
  • Call-level retry (the other half of 0050) lands in a follow-up PR.

Tests

15 tests in tests/unit/test_failure_isolation_middleware.py: unit coverage (static and callable degraded update, predicate filtering, on_caught fires and is exception-isolated, cancellation propagates, event-field population, bare vs categorized exception), full-engine integration (degrade-via-invoke with a recording observer, three-piece composition with RetryMiddleware), and rendering by both bundled observers. Full suite green (1259 passed), pyright and ruff clean, docs build clean.

A third bundled middleware primitive alongside RetryMiddleware and
TimingMiddleware. It catches exceptions escaping a wrapped node and
returns a configured degraded partial update, so a non-critical node
can fail without aborting the whole invocation. On a catch it emits a
distinct FailureIsolatedEvent (with a CaughtException record) that the
bundled OTel and Langfuse observers render, keeping the degradation
visible in traces. A raising on_caught hook is isolated so a buggy
telemetry hook cannot defeat the recovery.

First of two PRs for proposal 0050; call-level retry follows. No
spec-pin change (0050 is already within the v0.53.0 pin).
Copilot AI review requested due to automatic review settings June 10, 2026 21:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds proposal-0050 failure isolation as a first-class middleware primitive, allowing non-critical nodes to degrade to a configured partial update instead of aborting an invocation, while emitting an observer-visible event for traceability.

Changes:

  • Introduces FailureIsolationMiddleware, CaughtException, and FailureIsolatedEvent, including public exports.
  • Extends bundled OTel and Langfuse observers to render failure-isolation markers.
  • Adds comprehensive unit + integration tests plus documentation and changelog entry.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_failure_isolation_middleware.py New unit/integration tests for middleware behavior, event emission, and observer rendering.
src/openarmature/observability/otel/observer.py Handle FailureIsolatedEvent by emitting a marker span.
src/openarmature/observability/langfuse/observer.py Handle FailureIsolatedEvent by emitting a marker observation.
src/openarmature/observability/correlation.py Extends dispatch typing to include FailureIsolatedEvent.
src/openarmature/graph/observer.py Extends ObserverEvent union and documentation to include FailureIsolatedEvent.
src/openarmature/graph/middleware/failure_isolation.py New middleware implementation and event emission via current_dispatch().
src/openarmature/graph/middleware/init.py Re-exports FailureIsolationMiddleware / DegradedUpdate.
src/openarmature/graph/events.py Adds CaughtException and FailureIsolatedEvent dataclasses and exports.
src/openarmature/graph/init.py Public exports for new middleware and event types.
docs/concepts/middleware.md Documents FailureIsolationMiddleware, event semantics, and retry composition ordering.
CHANGELOG.md Unreleased “Added” entry for the new middleware/event.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/openarmature/graph/middleware/failure_isolation.py Outdated
Comment thread src/openarmature/graph/middleware/failure_isolation.py
Two doc-only changes from CoPilot review of PR #149, no behavior
change:

- The module docstring now states that degraded_update is resolved
  once at catch time (which populates the event's post_state), with
  the numbered steps covering only the observable order after that.
- A comment at the FailureIsolatedEvent dispatch documents that
  attempt_index is the intentional node-level baseline (not a
  per-attempt index) and that span parenting is unaffected.
@chris-colinsky chris-colinsky merged commit 795c549 into main Jun 10, 2026
6 checks passed
@chris-colinsky chris-colinsky deleted the feature/0050-failure-isolation-middleware branch June 10, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants