Skip to content

Commit d803356

Browse files
Add failure-isolation catch gate and cause-chain classification primitive (0074) (#174)
* Pin spec v0.65.0 for proposal 0074 Advance the spec submodule pin v0.64.0 -> v0.65.0 for accepted proposal 0074 (failure-isolation catch gate + §6.4 cause-chain classification primitive). Updates __spec_version__, the pyproject spec_version, the smoke assertion, the conformance.toml spec_pin, and regenerates the bundled AGENTS.md. conformance.toml records 0074 as implemented. * Add failure-isolation catch gate and §6.4 primitive (0074) FailureIsolationMiddleware gains an optional `catch` set of error categories: an exception is caught only if the derived category of its cause chain (resolved through node_exception carriers) is in the set, conjoined with `predicate` (catch checked first, short-circuiting). The carrier-skipping walk behind `catch` and `caught_exception` becomes a public primitive, classify_cause_chain(exc) -> CaughtException. The cause-chain types (CauseLink, CaughtException) move into the new cause_chain module alongside it, so the concept has one home and events consumes it; the public openarmature.graph paths are unchanged. The default retry classifier's single-level depth is documented as deliberate (no behavior change). Unit tests cover the gate, the short-circuit, and the primitive. * Wire failure-isolation catch conformance fixture 072 Parse the `catch` directive on the failure_isolation fixture middleware config and add fixture 072 to the failure-isolation fixture set. 072 (two cases) drives the catch gate matching through a §9.7 instance node_exception carrier (degrade) and a non-matching catch (propagate). * Document failure-isolation catch + classification (0074) Document the `catch` category gate and the public classify_cause_chain primitive in the middleware concepts page, and add the 0.15.0 changelog entry (advancing the spec-pin bullet to v0.65.0). * Harden catch typing and tighten derived-category wording PR #174 review: reject a bare str for FailureIsolationMiddleware.catch (a str is a Collection[str], so it would substring-match and silently mis-gate) and normalize to a frozenset. Tighten the derived-category wording in the docstring, the concepts page, and the classify example to the outermost non-carrier link with a category (an uncategorized surface link resolves to the deeper categorized cause). Fix the stale events/errors import comment now that cause_chain imports only errors.
1 parent a881c5f commit d803356

15 files changed

Lines changed: 403 additions & 143 deletions

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,11 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
1212
- **Per-attempt LLM spans under call-level retry** (proposal 0050, observability §5.5 / llm-provider §7.1). Completes proposal 0050, which shipped `partial` in v0.14.0 (failure-isolation middleware and the `complete(retry=...)` loop landed then; the per-attempt span surface was deferred). Under call-level retry the OTel observer now emits one `openarmature.llm.complete` span per attempt, each carrying `openarmature.llm.attempt_index` (0-based, 0..N-1, and 0 for a no-retry call). An intermediate failed attempt's span carries ERROR status plus its error category and the request-side attributes; the final attempt's span carries the terminal outcome and, on success, the full response surface. A python-internal `LlmRetryAttemptEvent`, dispatched once per attempt, is the sole source of the OTel span; the terminal `LlmCompletionEvent` / `LlmFailedEvent` stay one per call (payload, latency, Langfuse Generation) and no longer drive the OTel span. Langfuse renders one terminal Generation per call, with the per-attempt detail on the OTel span surface only (a spec-side §8 clarification to pin this is tracked, non-blocking). `conformance.toml` flips proposal 0050 to `implemented`; the call-level fixtures 056-058 are driven through the provider plus OTel observer and the single-attempt observability fixture 057 is wired.
1313
- **Langfuse `trace.userId` / `trace.sessionId` population** (proposal 0064, observability §8.4.1, spec v0.62.0). The Langfuse observer now promotes a recognized `userId` key in the caller-supplied invocation metadata to Langfuse's first-class `trace.userId` field (the Users dashboard), additively: the key also remains at `trace.metadata.userId`. Promotion is automatic and unconditional; an absent key leaves `trace.userId` unset. The `LangfuseClient.trace()` surface (the Protocol, the in-memory client, and the SDK adapter) gains `session_id` / `user_id`. `trace.sessionId` is sourced from `openarmature.session_id`, which the sessions capability (proposal 0020) establishes; that capability is not yet implemented in python, so the `sessionId` plumbing is in place but dormant (no source) and unset in the interim. `conformance.toml` records proposal 0064 `partial` on that basis: fixture 084 cases 2/3/4 (not session-bound, `userId` present additively, `userId` absent) run, and the session-bound cases 1/5 defer until 0020. Langfuse-only: the OTel side already carries `openarmature.session_id` and `openarmature.user.*` as span attributes, and OTel has no trace-level session/user field.
1414
- **Per-fetch prompt cache control: `cache_ttl_seconds`** (proposal 0072, prompt-management §5 / §6, spec v0.63.0). `PromptBackend.fetch`, `PromptManager.fetch`, and `PromptManager.get` gain an optional `cache_ttl_seconds` read-side control: `None` preserves current behavior, `0` forces a fresh read past any client-side cache, and `N > 0` bounds a served entry's staleness to N seconds; a negative value is rejected at the manager. It governs only which cached entry may be served, not whether or how results are cached. The bundled filesystem backend is cacheless and ignores it; the bundled Langfuse backend forwards it to the Langfuse SDK's `get_prompt` cache. Conformance fixtures 033/034 run through a caching harness backend (conformance-adapter §6.8: `source_read_count` plus a controllable `advance_clock`).
15+
- **Failure-isolation `catch` gate + cause-chain classification primitive** (proposal 0074, pipeline-utilities §6.3 / §6.4, spec v0.65.0). `FailureIsolationMiddleware` gains an optional `catch`: a set of error categories. An exception is caught only if the *derived category* of its cause chain (the outermost non-carrier link's category, resolved through the engine's `node_exception` carriers, the same value reported as `caught_exception.category`) is in the set. This closes a degrade-into-crash footgun: at a wrapping placement (subgraph, fan-out instance, branch) the engine wraps the originating failure in a carrier, so a `predicate` inspecting the surface exception sees only the carrier and misses it, whereas `catch` classifies through the carrier. `catch` composes with `predicate` as a conjunction; both default permissive (both unset stays catch-all), and a null derived category never matches a non-empty set. The carrier-skipping walk behind `catch` and `caught_exception` is promoted to a public primitive, `classify_cause_chain(exc) -> CaughtException` (the ordered `chain`, the derived `category`, and its `message` — the same record the event carries), exported from `openarmature.graph` for use in a custom `predicate`, a router, a metric, or a full-chain retry classifier. The default retry classifier stays deliberately single-level (it classifies at re-attempt granularity); this is now documented, with no behavior change. Conformance fixture 072 (catch matches through an instance-placement carrier and degrades; a non-matching catch propagates with no event). The optional native-exception-type `catch` form (spec MAY) is not shipped.
1516

1617
### Changed
1718

18-
- **Pinned spec advances v0.60.0 → v0.64.0** across the v0.15.0 cycle: v0.61.0 (proposal 0061, the detached-trace invocation span above), v0.62.0 (proposal 0064, the Langfuse session/user population above), v0.63.0 (proposal 0072, the prompt cache control above), the v0.63.1 patch (pipeline-utilities coverage fixtures 070/071 for the already-implemented 0069 / 0070 behavior, no new proposal), and v0.64.0 (proposal 0073, GenAI semconv adoption reconciliation: OA retains `gen_ai.system` despite the upstream rename to `gen_ai.provider.name`; textual-only, with no emitted-attribute or fixture change, so the existing `gen_ai.*` fixtures stand as the retention regression). `conformance.toml` records 0061 / 0072 `implemented`, 0064 `partial` (its `sessionId` half is dormant pending the sessions capability), and 0073 `textual-only`. Proposal 0050 needed no pin bump of its own (it was already within the pin from its v0.42.0 acceptance); its v0.14.0 `partial` entry flips to `implemented` with the per-attempt span surface above.
19+
- **Pinned spec advances v0.60.0 → v0.65.0** across the v0.15.0 cycle: v0.61.0 (proposal 0061, the detached-trace invocation span above), v0.62.0 (proposal 0064, the Langfuse session/user population above), v0.63.0 (proposal 0072, the prompt cache control above), the v0.63.1 patch (pipeline-utilities coverage fixtures 070/071 for the already-implemented 0069 / 0070 behavior, no new proposal), and v0.64.0 (proposal 0073, GenAI semconv adoption reconciliation: OA retains `gen_ai.system` despite the upstream rename to `gen_ai.provider.name`; textual-only, with no emitted-attribute or fixture change, so the existing `gen_ai.*` fixtures stand as the retention regression), and v0.65.0 (proposal 0074, the failure-isolation `catch` gate above). `conformance.toml` records 0061 / 0072 / 0074 `implemented`, 0064 `partial` (its `sessionId` half is dormant pending the sessions capability), and 0073 `textual-only`. Proposal 0050 needed no pin bump of its own (it was already within the pin from its v0.42.0 acceptance); its v0.14.0 `partial` entry flips to `implemented` with the per-attempt span surface above.
1920

2021
## [0.14.0] — 2026-06-17
2122

conformance.toml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232

3333
[manifest]
3434
implementation = "openarmature-python"
35-
spec_pin = "v0.64.0"
35+
spec_pin = "v0.65.0"
3636

3737
# Status values:
3838
# implemented — shipped behavior matches the proposal's contract
@@ -719,3 +719,10 @@ note = "PromptBackend.fetch / PromptManager.fetch / get gain an optional cache_t
719719
status = "textual-only"
720720
since = "0.15.0"
721721
note = "Governance + observability §5.5 rationale change: reconciles the gen_ai.* adoption with upstream reality (the whole GenAI semconv surface is at Development status, and gen_ai.system was removed upstream in favor of gen_ai.provider.name). Adds a GenAI-scoped de-facto-interoperability carve-out (OA adopts the recognized core gen_ai.* names directly even at Development; peripheral attributes are mirrored to openarmature.*) and a post-adoption RETENTION rule (an adopted name is kept through an upstream rename / removal). No emitted-attribute change and no conformance-expectation change: python already emits the recognized core gen_ai.* set (including gen_ai.system, now RETAINED despite the upstream rename), so the existing gen_ai.* observability fixtures (e.g. 019-021) stand as the retention regression coverage. No python code and no new fixtures. The gen_ai.system -> gen_ai.provider.name migration is a deferred follow-on."
722+
723+
# Spec v0.65.0 (proposal 0074). Failure-isolation `catch` cause-chain category
724+
# gate (§6.3) + public cause-chain classification primitive (§6.4).
725+
[proposals."0074"]
726+
status = "implemented"
727+
since = "0.15.0"
728+
note = "FailureIsolationMiddleware gains an optional `catch` set of error categories (§6.3): an exception is caught only if the DERIVED category of its cause chain (the outermost non-carrier link, resolved THROUGH node_exception carriers -- the same value reported as caught_exception.category) is in the set, composing with `predicate` as a conjunction (both default permissive, both unset = catch-all; a null derived category never matches a non-empty set). This classifies a carrier-wrapped failure correctly at a wrapping placement where a surface check sees only the carrier. The §6.4 cause-chain classification walk is promoted to a public primitive classify_cause_chain(exc) -> CaughtException (the existing failure-isolation record: chain + derived category + message) in openarmature.graph, shared by the catch gate, the emitted event, and any consumer. §6.1: the default retry classifier's single-level depth is documented as deliberate (re-run granularity vs §6.3 full-chain degrade); no behavior change. Fixture 072 (catch matches through an instance-placement carrier and degrades; a non-matching catch propagates with no event). The optional native-exception-type catch sugar (spec MAY) is not shipped."

docs/concepts/middleware.md

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -238,9 +238,26 @@ Configuration:
238238
like `"failure_isolated"` collapses every degraded path into one
239239
indistinguishable bucket in a dashboard, so the name is forced at the
240240
construction site, where the context to name it well is available.
241-
- **`predicate`** is an optional `Exception -> bool`. When supplied,
242-
only exceptions where it returns true are caught; everything else
243-
propagates. The default catches every `Exception`.
241+
- **`catch`** is an optional set of error categories. When supplied, an
242+
exception is caught only if the *derived category* of its cause chain
243+
is in the set: the category of the outermost non-carrier link that
244+
carries one, resolved *through* the engine's `node_exception` carriers (the same value the
245+
event reports as `caught_exception.category`). This is the recommended
246+
gate for category-scoped degradation. At a wrapping placement (a
247+
subgraph, a fan-out instance, a branch) the engine wraps the real
248+
failure in a carrier, so a check on the surface exception sees only the
249+
carrier and misses it; `catch` classifies through the carrier and
250+
matches the originating category. A bare uncategorized error has no
251+
derived category and is not matched, so it propagates.
252+
- **`predicate`** is an optional `Exception -> bool` over the *surface*
253+
(caught) exception. When supplied, only exceptions where it returns true
254+
are caught; everything else propagates. The default is always-true. It
255+
composes with `catch` as a conjunction (both must admit), and both
256+
default permissive, so the both-unset default catches every
257+
`Exception`. Because `predicate` sees the surface exception, it
258+
misclassifies a carrier-wrapped failure at a wrapping placement; reach
259+
for `catch` for category gating, or classify the chain yourself with
260+
`classify_cause_chain` (below).
244261
- **`on_caught`** is an optional async hook `Exception -> None`, fired
245262
when the middleware catches. Use it to pump the caught exception to
246263
caller-specific telemetry beyond the framework event. It fires inline
@@ -267,6 +284,30 @@ catch shows up alongside the node's own span. The default emission path
267284
is the observer stream only, with no logging-library dependency;
268285
`on_caught` is the escape hatch for anything else.
269286

287+
### Cause-chain classification
288+
289+
The walk behind `catch` and `caught_exception` is exposed as a public
290+
primitive, `classify_cause_chain`, so any consumer classifies a
291+
carrier-wrapped failure the same way the framework does:
292+
293+
```python
294+
from openarmature.graph import classify_cause_chain
295+
296+
result = classify_cause_chain(exc)
297+
result.category # derived category (outermost non-carrier link with a category), or None
298+
result.message # the message that category came from
299+
result.chain # the ordered CauseLink chain, carriers flagged
300+
```
301+
302+
It returns a `CaughtException` (the same record the failure-isolated
303+
event's `caught_exception` field holds) carrying the ordered `chain` (one
304+
`CauseLink` per exception, carriers flagged), the derived `category`, and
305+
its `message`. Use it in a custom `predicate` that needs to see through
306+
carriers, in a router or metric keyed on the originating category, or in a
307+
retry classifier that wants full-chain depth (the default retry classifier
308+
is deliberately single-level, classifying at re-attempt granularity rather
309+
than walking the full chain).
310+
270311
### Composing with RetryMiddleware
271312

272313
The two compose into the canonical "retry transients, then give up

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Specification = "https://github.com/LunarCommand/openarmature-spec"
6363
openarmature = "openarmature.cli:main"
6464

6565
[tool.openarmature]
66-
spec_version = "0.64.0"
66+
spec_version = "0.65.0"
6767

6868
[dependency-groups]
6969
dev = [

src/openarmature/AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenArmature — Agent documentation
22

3-
*This is the agent guide bundled with the openarmature Python package, version 0.14.0 (spec v0.64.0). For the full docs site see [openarmature.ai](https://openarmature.ai). For the canonical spec text see [openarmature.org/capabilities](https://openarmature.org/capabilities/). For project-specific conventions for the code you're editing, see the host project's `AGENTS.md` or `CLAUDE.md`.*
3+
*This is the agent guide bundled with the openarmature Python package, version 0.14.0 (spec v0.65.0). For the full docs site see [openarmature.ai](https://openarmature.ai). For the canonical spec text see [openarmature.org/capabilities](https://openarmature.org/capabilities/). For project-specific conventions for the code you're editing, see the host project's `AGENTS.md` or `CLAUDE.md`.*
44

55
## TL;DR
66

@@ -10,7 +10,7 @@ OpenArmature is a workflow framework for LLM pipelines and tool-calling agents:
1010

1111
## Capability contracts
1212

13-
_Sourced from openarmature-spec v0.64.0. Each entry below reproduces §1 (Purpose) and §2 (Concepts) of the capability's `spec.md` verbatim — including additions from accepted proposals that this Python implementation may not yet ship. For per-proposal implementation status (implemented / partial / textual-only / not-yet), see the `conformance.toml` manifest at the repo root. For the full spec text (execution model, error semantics, determinism, observer hooks, etc.) see the linked docs site._
13+
_Sourced from openarmature-spec v0.65.0. Each entry below reproduces §1 (Purpose) and §2 (Concepts) of the capability's `spec.md` verbatim — including additions from accepted proposals that this Python implementation may not yet ship. For per-proposal implementation status (implemented / partial / textual-only / not-yet), see the `conformance.toml` manifest at the repo root. For the full spec text (execution model, error semantics, determinism, observer hooks, etc.) see the linked docs site._
1414

1515
### Capability: `graph-engine`
1616

src/openarmature/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"""
2626

2727
__version__ = "0.14.0"
28-
__spec_version__ = "0.64.0"
28+
__spec_version__ = "0.65.0"
2929
# Proposal 0052 (spec observability §5.1 / §8.4.1): canonical
3030
# package-registry name for this implementation. Surfaces on every
3131
# OTel invocation span as ``openarmature.implementation.name`` and on

src/openarmature/graph/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
"""
1111

1212
from .builder import GraphBuilder
13+
from .cause_chain import CaughtException, CauseLink, classify_cause_chain
1314
from .compiled import CompiledGraph
1415
from .edges import END, ConditionalEdge, EndSentinel, StaticEdge
1516
from .errors import (
@@ -37,8 +38,6 @@
3738
UnreachableNode,
3839
)
3940
from .events import (
40-
CaughtException,
41-
CauseLink,
4241
FailureIsolatedEvent,
4342
InvocationCompletedEvent,
4443
InvocationStartedEvent,
@@ -135,6 +134,7 @@
135134
"TimingRecord",
136135
"UnreachableNode",
137136
"append",
137+
"classify_cause_chain",
138138
"concat_flatten",
139139
"default_classifier",
140140
"deterministic_backoff",

0 commit comments

Comments
 (0)