Skip to content

Commit 44b38f9

Browse files
Surface failure-isolation cause in fan-out example (#160)
The fan-out-with-retry example already runs FailureIsolation wrapping Retry at a fan-out instance, but it only showed the degraded state, never the FailureIsolatedEvent. Add a failure_isolation_observer that captures the events and, in degrade mode, print each one's event_name, caught_exception.category, and attempt_index. The category is the point: at an instance placement it resolves to the originating cause (provider_unavailable) rather than the node_exception the engine wrapped it in, so the demo now shows the failure telemetry naming what actually failed. Update the example's docstring and doc page to describe the new block, and add a MODE=degrade line to the Run with section.
1 parent 0cc6233 commit 44b38f9

2 files changed

Lines changed: 62 additions & 1 deletion

File tree

docs/examples/fan-out-with-retry.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,12 @@ intact.
5656
recording the resolved `item_count` / `concurrency` /
5757
`error_policy` at runtime. Inner-instance events carry
5858
`fan_out_index` but not the config.
59+
- In `degrade` mode, a `failure_isolation_observer` captures each
60+
`FailureIsolatedEvent` and the demo prints its
61+
`caught_exception.category`. At a fan-out instance placement the
62+
category resolves to the originating cause (`provider_unavailable`)
63+
rather than the masking `node_exception`, so the telemetry names what
64+
actually failed.
5965

6066
## Composing with checkpointing
6167

@@ -182,3 +188,16 @@ exhausted-retry failure and returned the degraded partial, so the
182188
fan-out recorded the instance as a (degraded) success. The
183189
per-instance timings still show the sentinel's failed attempts, so
184190
you can see the retries happened before the instance was degraded.
191+
192+
The degrade run also prints a `Failure-isolation events` block from the
193+
`failure_isolation_observer`:
194+
195+
```
196+
Failure-isolation events (1):
197+
event='headline_degraded' cause=provider_unavailable attempt_index=2
198+
```
199+
200+
`cause` is the resolved originating category, `provider_unavailable`,
201+
not the masking `node_exception` the engine wraps the failure in before
202+
isolation catches it. `attempt_index` is the final, exhausting attempt
203+
of the three the retry middleware made.

examples/fan-out-with-retry/main.py

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,11 @@
5454
the fan-out node's own started / completed pair and gives observers
5555
a record of the resolved item_count, concurrency, and error_policy
5656
at dispatch time.
57+
- In ``degrade`` mode a ``failure_isolation_observer`` captures each
58+
``FailureIsolatedEvent``; the demo prints its ``event_name``, the
59+
resolved ``caught_exception.category`` (the originating cause, e.g.
60+
``provider_unavailable``, not the masking ``node_exception``), and the
61+
exhausting ``attempt_index``.
5762
5863
**Configuration** (env vars; OpenAI defaults shown):
5964
@@ -68,6 +73,10 @@
6873
uv sync --group examples
6974
cd examples/fan-out-with-retry
7075
LLM_API_KEY=sk-... uv run python main.py
76+
77+
# exercise the degrade failure-path: prepends a synthetic failing
78+
# headline and prints the Failure-isolation events block
79+
MODE=degrade LLM_API_KEY=sk-... uv run python main.py
7180
"""
7281

7382
from __future__ import annotations
@@ -83,6 +92,7 @@
8392
from openarmature.graph import (
8493
END,
8594
CompiledGraph,
95+
FailureIsolatedEvent,
8696
GraphBuilder,
8797
NodeEvent,
8898
ObserverEvent,
@@ -238,6 +248,28 @@ async def _record_timing(record: TimingRecord) -> None:
238248
_timings.append(record)
239249

240250

251+
# Captured failure-isolation events, populated by the observer below.
252+
# Only fires in ``degrade`` mode, where FailureIsolationMiddleware catches
253+
# an exhausted instance and emits one FailureIsolatedEvent per degraded
254+
# instance.
255+
_isolated: list[FailureIsolatedEvent] = []
256+
257+
258+
async def failure_isolation_observer(event: ObserverEvent) -> None:
259+
"""Capture each FailureIsolatedEvent so the demo can surface the
260+
resolved failure cause.
261+
262+
When FailureIsolation wraps Retry at a fan-out instance, the engine
263+
has already wrapped the originating error as a node_exception carrier
264+
by the time isolation catches it. ``caught_exception.category``
265+
resolves through that carrier to the originating cause, so a degraded
266+
instance reports ``provider_unavailable`` (what actually failed)
267+
rather than the masking ``node_exception``.
268+
"""
269+
if isinstance(event, FailureIsolatedEvent):
270+
_isolated.append(event)
271+
272+
241273
# ---------------------------------------------------------------------------
242274
# Outer graph
243275
# ---------------------------------------------------------------------------
@@ -371,8 +403,9 @@ async def fan_out_config_observer(event: ObserverEvent) -> None:
371403

372404
async def main() -> None:
373405
# Reset module-level capture so a REPL or repeated-main() driver
374-
# doesn't accumulate timings across invocations.
406+
# doesn't accumulate timings / isolation events across invocations.
375407
_timings.clear()
408+
_isolated.clear()
376409

377410
# MODE selects the per-instance failure posture: fail_fast (default,
378411
# abort on the first exhausted-retry failure), collect (record
@@ -382,6 +415,7 @@ async def main() -> None:
382415
mode = os.environ.get("MODE", "fail_fast")
383416
graph = build_graph(mode=mode)
384417
graph.attach_observer(fan_out_config_observer)
418+
graph.attach_observer(failure_isolation_observer)
385419

386420
# collect and degrade both need a failure to demonstrate, so prepend
387421
# a deliberately-failing headline that summarize() always raises on.
@@ -430,6 +464,14 @@ async def main() -> None:
430464
for err in final.instance_errors:
431465
print(f" {err}")
432466
print()
467+
if _isolated:
468+
print(f"Failure-isolation events ({len(_isolated)}):")
469+
for ev in _isolated:
470+
print(
471+
f" event={ev.event_name!r} cause={ev.caught_exception.category} "
472+
f"attempt_index={ev.attempt_index}"
473+
)
474+
print()
433475
print("Per-instance timings (in completion order):")
434476
for nth, record in enumerate(_timings):
435477
print(f" #{nth} {record.duration_ms:7.1f} ms outcome={record.outcome}")

0 commit comments

Comments
 (0)