Skip to content

Commit 7a3e0b2

Browse files
Reset metadata between retry attempts per spec 3.4 (#130)
* Reset metadata between retry attempts per spec 3.4 Spec observability 3.4 requires per-attempt scoping under retry middleware: each attempt sees only the metadata in scope at retry-loop entry plus that attempt's own writes; failed-attempt writes are discarded along with the attempt itself. The python RetryMiddleware was only managing the attempt_index ContextVar and leaving the invocation-metadata ContextVar untouched across attempts, so a set_invocation_metadata call inside a failed attempt remained visible on the next attempt. Capture the pre-attempt baseline once at retry-loop entry. Reset the metadata ContextVar to that baseline at the start of each iteration. Discard the failed attempt's writes on both retry-eligible and terminal failure paths. Leave the successful attempt's writes in place so downstream nodes see them. Two new unit tests pin the contract: one mirrors spec fixture 045's case shape (attempt 0 writes a marker and fails, attempt 1 reads and confirms the marker is absent, writes a new marker, and succeeds; downstream node reads the successful attempt's marker), the other exercises the middleware via compose_chain to verify that after a terminal failure the metadata is back at baseline with no leak from the final failed attempt. Closes the v0.12.0 cycle's partial claim on proposal 0048; conformance manifest flips back to implemented and the "Per-attempt retry scoping is partial in v0.12.0" docs callout disappears. * Reset metadata on cancellation path too The original PR-2c except-Exception branch caught transient retryable errors but missed cancellation. CancelledError extends BaseException, not Exception, so a node body that called set_invocation_metadata and then got cancelled would leak its writes upward into any code that caught the cancellation. The except-Exception was deliberate for retry semantics (cancellation MUST propagate, never retry), but it left a metadata cleanup hole. Add a parallel except-BaseException branch that resets the metadata token and re-raises. No retry, no on_retry callback, no sleep on this path — just clean up and propagate. New unit test mirrors the exact scenario: node writes a marker, raises CancelledError, retry middleware lets cancellation propagate (single attempt, no swallow per spec 6.1), and the post-failure metadata view is back at the pre-attempt baseline per spec 3.4. Conformance manifest 0048 pin enumeration extended to name the new test.
1 parent 39cb4f3 commit 7a3e0b2

5 files changed

Lines changed: 217 additions & 36 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
2626
- **Three new patterns docs.** `docs/patterns/state-migration-on-resume.md`, `docs/patterns/caller-supplied-trace-identifiers.md`, and `docs/patterns/observer-state-reconciliation.md` graduate the corresponding entries from `docs/agent/non-obvious-shapes.md` into full pattern recipes with code snippets and "when this is right / when it isn't" guidance. The programmatic patterns API (`openarmature.patterns.list()` / `get(name)`) grows from 4 to 7 entries.
2727
- **HyperDX OTel integration test path and "Production swap" docs in the observer-hooks example.** `examples/observer-hooks/main.py`'s module docstring grows a "Production swap" section showing how to substitute the demo's `SimpleSpanProcessor` + `ConsoleSpanExporter` for `BatchSpanProcessor` + `OTLPSpanExporter` pointed at HyperDX (or any other OTLP-HTTP collector). A new opt-in integration test (`tests/integration/test_otel_hyperdx_export.py`, gated by `HYPERDX_API_KEY` + `HYPERDX_OTLP_ENDPOINT` env vars and `@pytest.mark.integration`) drives the same production export path end-to-end against a live endpoint. `opentelemetry-exporter-otlp-proto-http` lands as a dev-only dep; not promoted to a public extras group yet.
2828

29+
### Fixed
30+
31+
- **`RetryMiddleware` now enforces per-attempt invocation-metadata scoping** (proposal 0048 / spec observability §3.4). Each retry attempt sees only the metadata in scope at retry-loop entry plus that attempt's own writes; writes from a prior attempt that subsequently failed are discarded. Prior to this fix the retry middleware only managed the `attempt_index` ContextVar and left the invocation-metadata ContextVar unchanged across attempts, so a `set_invocation_metadata(...)` call inside a failed attempt remained visible on the retry. The fix captures the pre-attempt baseline once at retry-loop entry, resets the metadata ContextVar to that baseline on each iteration, discards the failed attempt's writes on retry-eligible and terminal failure paths, and leaves the successful attempt's writes in place so downstream nodes see them. Closes the v0.12.0 cycle's `partial` claim on proposal 0048; manifest entry flips back to `implemented`.
32+
2933
### Changed (breaking)
3034

3135
- **`OpenAIProvider.ready()` default probe flipped to `chat_completions`.** A new constructor kwarg `readiness_probe: Literal["models", "chat_completions", "both"]` selects which wire path `ready()` exercises; the default is now the chat-completions path (`POST /v1/chat/completions` with `max_tokens=1`), which actually exercises the inference path. The previous catalog-only behavior is still available as `readiness_probe="models"`, and `readiness_probe="both"` runs catalog then chat for the strongest signal. Motivation: OpenAI-compatible proxies (Bifrost and similar) can return 200 on `GET /v1/models` while rejecting `POST /v1/chat/completions`, leaving the catalog probe green while every real call fails. The new default surfaces that class of failure at preflight rather than at first inference. Non-200 chat-probe responses route through `classify_http_error`, so the canonical error categories (`provider_authentication`, `provider_unavailable`, `provider_invalid_model`, etc.) surface consistently. Callers that depended on the catalog-only behavior (cost-sensitive cloud setups where every `ready()` would now bill prompt tokens) can opt back in by passing `readiness_probe="models"`.

conformance.toml

Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -251,11 +251,14 @@ status = "not-yet"
251251
# Spec v0.40.0 (proposal 0048). Read-symmetric invocation metadata.
252252
# Adds ``get_invocation_metadata()`` symmetric to the existing
253253
# ``set_invocation_metadata()`` write API. The python implementation
254-
# satisfies most of the §3.4 read contract via the pre-existing
254+
# satisfies the §3.4 read contract via the pre-existing
255255
# ``current_invocation_metadata`` machinery: returns ``MappingProxyType``
256256
# snapshot of the current async context's view, silent no-op outside
257257
# an invocation, no observer emission, per-async-context scoping under
258-
# fan-out and parallel-branches (inherited from 0034/0040/0045).
258+
# fan-out and parallel-branches (inherited from 0034/0040/0045),
259+
# per-attempt scoping under retry (engine-side reset of
260+
# ``_invocation_metadata_var`` around each ``RetryMiddleware`` iteration,
261+
# landed in the same release cycle as 0048).
259262
# ``get_invocation_metadata`` lands as the canonical spec-idiomatic
260263
# public name; ``current_invocation_metadata`` stays as a stable
261264
# alias (Option A per the proposal-0048-implementation coord thread).
@@ -265,24 +268,15 @@ status = "not-yet"
265268
# lifecycle. §9 is convention-only per spec — no new abstract
266269
# surface on ``Observer``.
267270
#
268-
# Open gap: §3.4 *per-attempt scoping under retry middleware* is NOT
269-
# enforced in v0.12.0 — the retry middleware
270-
# (``src/openarmature/graph/middleware/retry.py``) manages the
271-
# ``attempt_index`` ContextVar but does NOT reset the
272-
# invocation-metadata ContextVar between attempts. A write from a
273-
# failed attempt remains visible on the next attempt. Closing this
274-
# gap is a follow-on PR (engine-side reset of ``_invocation_metadata_var``
275-
# around each retry iteration + a pinning unit test against spec
276-
# fixture 045's contract).
277-
#
278271
# Conformance fixtures 043-049 introduce new directive shapes
279272
# (augment_metadata / capture_invocation_metadata_into /
280273
# capture_queryable_observer_read_into / per_attempt_behavior +
281274
# queryable_observers / inner_subgraphs / caller_metadata /
282275
# direct_call / sequential_invocations top-level keys) that the
283276
# cross-capability parser doesn't model; fixture-shape activation is
284-
# queued for a future PR. The shipped portion of the contract is
285-
# pinned by explicit unit tests:
277+
# queued for a future PR slotted after the upcoming spec
278+
# conformance-adapter capability ratifies the directive vocabulary.
279+
# The shipped contract is pinned by explicit unit tests:
286280
# - alias identity + roundtrip + immutable-mapping return:
287281
# ``tests/unit/test_observability_metadata.py::
288282
# test_get_invocation_metadata_is_same_callable_as_current``,
@@ -291,11 +285,14 @@ status = "not-yet"
291285
# - mid-invocation augmentation visible to subsequent reads:
292286
# ``::test_mid_invocation_augmentation_persists_to_next_node``;
293287
# - outside-invocation empty:
294-
# ``::test_current_invocation_metadata_empty_outside_invocation``.
288+
# ``::test_current_invocation_metadata_empty_outside_invocation``;
289+
# - per-attempt scoping under retry (mirrors spec fixture 045):
290+
# ``::test_per_attempt_scoping_under_retry_discards_failed_attempt_writes``,
291+
# ``::test_terminal_failure_discards_final_failed_attempt_writes``,
292+
# ``::test_cancellation_discards_in_flight_attempt_writes``.
295293
[proposals."0048"]
296-
status = "partial"
294+
status = "implemented"
297295
since = "0.12.0"
298-
note = "Read API + §9 queryable observer pattern docs ship in v0.12.0. Spec §3.4 per-attempt scoping under retry middleware is NOT yet enforced — the python retry middleware does not reset the invocation-metadata ContextVar between attempts, so a failed attempt's writes remain visible on retry. Per-async-context scoping under fan-out and parallel-branches is fully implemented. Closing the retry-side gap is tracked for a follow-on PR."
299296

300297
# Spec v0.41.0 (proposal 0049). Typed LLM Completion Event — first
301298
# typed event variant on the observer event union. Queued for

docs/concepts/observability.md

Lines changed: 4 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -391,22 +391,10 @@ that context, per normal `ContextVar` semantics.
391391
`openarmature.observability.get_invocation_metadata()` returns an
392392
immutable `MappingProxyType` snapshot of the entries visible in the
393393
current async context's view, or an empty mapping outside any active
394-
invocation. Reads do NOT emit a metadata-augmentation event; the
395-
augmentation event signals mutations to backends, not consumer
396-
reads.
397-
398-
!!! info "Per-attempt retry scoping is partial in v0.12.0"
399-
Spec §3.4 specifies that values written during a failed retry
400-
attempt MUST NOT carry over to subsequent attempts. The v0.12.0
401-
python retry middleware manages the `attempt_index` ContextVar
402-
but does NOT reset the invocation-metadata ContextVar between
403-
attempts; a write from a failed attempt remains visible on the
404-
next attempt. Closing this gap is tracked for a follow-on PR
405-
(engine-side reset of the metadata ContextVar around each retry
406-
iteration, with the per-attempt scoping unit test pinned by
407-
spec fixture 045). Per-async-context scoping under fan-out and
408-
parallel-branches is fully implemented; only the retry-side
409-
reset is the open gap.
394+
invocation. The read is per-attempt scoped under retry middleware:
395+
values written in a prior failed attempt are not visible. Reads do
396+
NOT emit a metadata-augmentation event; the augmentation event
397+
signals mutations to backends, not consumer reads.
410398

411399
The existing `current_invocation_metadata()` is a stable alias
412400
pointing at the same function; both names live in `__all__`. Pick

src/openarmature/graph/middleware/retry.py

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424

2525
from openarmature.llm.errors import TRANSIENT_CATEGORIES
2626
from openarmature.observability.correlation import _reset_attempt_index, _set_attempt_index
27+
from openarmature.observability.metadata import (
28+
_invocation_metadata_var,
29+
_reset_invocation_metadata,
30+
_set_invocation_metadata,
31+
)
2732

2833
from ._core import NextCall
2934

@@ -128,6 +133,14 @@ def __init__(
128133

129134
async def __call__(self, state: Any, next_: NextCall) -> Mapping[str, Any]:
130135
attempt = 0
136+
# Spec observability §3.4 per-attempt scoping: each retry
137+
# attempt sees only the metadata in scope at retry-loop entry
138+
# ("pre-attempt baseline") plus that attempt's own writes;
139+
# writes from a prior attempt that subsequently failed do NOT
140+
# carry over. Captured once outside the loop because the
141+
# baseline is the metadata view at retry-middleware entry, not
142+
# at each iteration.
143+
pre_attempt_baseline = _invocation_metadata_var.get()
131144
while True:
132145
# Spec graph-engine §6 (clarified in v0.16.1): the wrapping
133146
# retry's attempt counter MUST propagate to events emitted
@@ -137,22 +150,52 @@ async def __call__(self, state: Any, next_: NextCall) -> Mapping[str, Any]:
137150
# reset on exit; Python's ContextVar token stack gives
138151
# innermost-wins precedence for free when retry middlewares
139152
# nest.
140-
token = _set_attempt_index(attempt)
153+
attempt_token = _set_attempt_index(attempt)
154+
# Reset the metadata ContextVar to the pre-attempt baseline.
155+
# The token captures the var's state at the moment of the
156+
# set call — on failure we reset against this token to
157+
# discard any writes the attempt's node body issued.
158+
metadata_token = _set_invocation_metadata(pre_attempt_baseline)
141159
try:
142160
try:
143-
return await next_(state)
161+
result = await next_(state)
162+
# Success path: keep the successful attempt's
163+
# metadata writes in scope so downstream nodes see
164+
# them. Do NOT reset metadata_token here — the
165+
# engine's outer reset (around the whole invoke)
166+
# pops the stack at invocation exit.
167+
return result
144168
except Exception as exc:
145169
# Spec §6.1: cancellation propagates by virtue of
146170
# `CancelledError` extending `BaseException`, not
147171
# `Exception` — it never enters this branch in Python.
172+
# Failure path (retry-eligible OR terminal):
173+
# discard the failed attempt's metadata writes per
174+
# §3.4. Reset BEFORE the re-raise so the caller's
175+
# error-handling path (e.g., observer hooks reading
176+
# metadata for the error span) sees the baseline,
177+
# not the failed attempt's transient state.
178+
_reset_invocation_metadata(metadata_token)
148179
if attempt + 1 >= self.max_attempts or not self.classifier(exc, state):
149180
raise
150181
if self.on_retry is not None:
151182
await self.on_retry(exc, attempt)
152183
await asyncio.sleep(self.backoff(attempt))
153184
attempt += 1
185+
except BaseException:
186+
# Cancellation path. `CancelledError` (or other
187+
# `BaseException`) ends the attempt without retry —
188+
# spec §6.1 cancellation MUST propagate, never get
189+
# swallowed or retried. But spec §3.4 per-attempt
190+
# scoping still applies: cancellation IS a failed
191+
# attempt from the metadata-scoping perspective, so
192+
# its writes must be discarded too. Reset the token,
193+
# then re-raise. NO on_retry, NO sleep — straight
194+
# propagation.
195+
_reset_invocation_metadata(metadata_token)
196+
raise
154197
finally:
155-
_reset_attempt_index(token)
198+
_reset_attempt_index(attempt_token)
156199

157200

158201
__all__ = [

tests/unit/test_observability_metadata.py

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from __future__ import annotations
1414

1515
import asyncio
16+
from collections.abc import Mapping
1617
from typing import Any
1718

1819
import pytest
@@ -555,3 +556,151 @@ async def _read_after_write(_s: _SimpleState) -> dict[str, Any]:
555556
# Caller baseline + in-node write, both visible to the read.
556557
assert captured == {"tenantId": "T1", "audit_kind": "fraud"}
557558
assert captured_type == [MappingProxyType]
559+
560+
561+
# Spec observability §3.4 *Per-attempt scoping*: under retry
562+
# middleware, each attempt sees only the metadata in scope at
563+
# retry-entry plus that attempt's own writes; failed-attempt
564+
# writes are discarded along with the attempt itself. The pin
565+
# below mirrors the spec's fixture 045 case shape (attempt 0
566+
# writes + fails, attempt 1 asserts marker absent + writes +
567+
# succeeds, downstream reads successful attempt's marker).
568+
# Companion test verifies the same discard discipline on
569+
# terminal failure (all retries exhausted).
570+
571+
572+
class _RetryTransient(Exception):
573+
"""Carries a transient category so the default classifier
574+
treats it as retryable. Matches the ``provider_rate_limit``
575+
category used in ``tests/unit/test_middleware.py``."""
576+
577+
category = "provider_rate_limit"
578+
579+
580+
async def test_per_attempt_scoping_under_retry_discards_failed_attempt_writes() -> None:
581+
from openarmature.graph.middleware import RetryMiddleware
582+
583+
captured_attempt_1_read: dict[str, Any] = {}
584+
captured_downstream_read: dict[str, Any] = {}
585+
attempts: list[int] = []
586+
587+
async def _retried(_s: _SimpleState) -> dict[str, Any]:
588+
attempt_n = len(attempts)
589+
attempts.append(attempt_n)
590+
if attempt_n == 0:
591+
# First attempt: write a marker, then raise transient.
592+
set_invocation_metadata(attempt_marker="first")
593+
raise _RetryTransient()
594+
# Second attempt: read first — assert the failed-attempt's
595+
# marker is NOT visible — then write a new marker and succeed.
596+
captured_attempt_1_read.update(dict(get_invocation_metadata()))
597+
set_invocation_metadata(attempt_marker="second")
598+
return {"counter": 1}
599+
600+
async def _downstream(_s: _SimpleState) -> dict[str, Any]:
601+
captured_downstream_read.update(dict(get_invocation_metadata()))
602+
return {"counter": 2}
603+
604+
graph = (
605+
GraphBuilder(_SimpleState)
606+
.add_node(
607+
"retried",
608+
_retried,
609+
middleware=[RetryMiddleware(max_attempts=2, backoff=lambda _i: 0.0)],
610+
)
611+
.add_node("downstream", _downstream)
612+
.add_edge("retried", "downstream")
613+
.add_edge("downstream", END)
614+
.set_entry("retried")
615+
.compile()
616+
)
617+
await graph.invoke(_SimpleState(), metadata={"tenantId": "T1"})
618+
619+
assert attempts == [0, 1]
620+
# Attempt 1's read: baseline only — attempt 0's transient
621+
# ``attempt_marker=first`` write was discarded on failure.
622+
assert captured_attempt_1_read == {"tenantId": "T1"}
623+
# Downstream node: baseline + the successful attempt's write
624+
# persists past the retry boundary.
625+
assert captured_downstream_read == {"tenantId": "T1", "attempt_marker": "second"}
626+
627+
628+
async def test_terminal_failure_discards_final_failed_attempt_writes() -> None:
629+
# Exercises the middleware directly via ``compose_chain`` so the
630+
# post-retry metadata view is readable in the test scope (the
631+
# engine's outer invoke() reset would otherwise pop the var back
632+
# to empty before control returns to the test, masking the
633+
# middleware's own discard). The contract pinned here is that
634+
# AFTER the retry middleware re-raises a terminal failure, the
635+
# metadata ContextVar is back at the pre-attempt baseline — no
636+
# leak of the final failed attempt's writes.
637+
from openarmature.graph.middleware import RetryMiddleware, compose_chain
638+
from openarmature.observability.metadata import (
639+
_reset_invocation_metadata,
640+
_set_invocation_metadata,
641+
validate_invocation_metadata,
642+
)
643+
644+
attempts: list[int] = []
645+
646+
async def _always_fails(_state: Any) -> Mapping[str, Any]:
647+
attempts.append(len(attempts))
648+
set_invocation_metadata(attempt_marker=f"attempt_{len(attempts) - 1}")
649+
raise _RetryTransient()
650+
651+
retry = RetryMiddleware(max_attempts=2, backoff=lambda _i: 0.0)
652+
chain = compose_chain([retry], _always_fails)
653+
654+
# Establish a baseline outside the middleware so we can read it
655+
# back post-failure. Mirrors how the engine sets the baseline
656+
# at the invoke() boundary.
657+
baseline_token = _set_invocation_metadata(validate_invocation_metadata({"tenantId": "T1"}))
658+
try:
659+
with pytest.raises(_RetryTransient):
660+
await chain(_SimpleState())
661+
# Both attempts ran.
662+
assert attempts == [0, 1]
663+
# Post-failure view: the pre-attempt baseline, with NO
664+
# ``attempt_marker`` leaked from the final failed attempt.
665+
assert dict(get_invocation_metadata()) == {"tenantId": "T1"}
666+
finally:
667+
_reset_invocation_metadata(baseline_token)
668+
669+
670+
async def test_cancellation_discards_in_flight_attempt_writes() -> None:
671+
# Spec §3.4: failed-attempt metadata writes are discarded along
672+
# with the attempt. When ``CancelledError`` (or any other
673+
# ``BaseException``) ends the attempt, the same discard discipline
674+
# applies — cancellation IS a failed attempt from the
675+
# metadata-scoping perspective. Spec §6.1: cancellation MUST
676+
# propagate (no retry, no swallow), so the reset must happen IN
677+
# ADDITION to, not instead of, propagating ``CancelledError``.
678+
from openarmature.graph.middleware import RetryMiddleware, compose_chain
679+
from openarmature.observability.metadata import (
680+
_reset_invocation_metadata,
681+
_set_invocation_metadata,
682+
validate_invocation_metadata,
683+
)
684+
685+
attempts: list[int] = []
686+
687+
async def _writes_then_cancels(_state: Any) -> Mapping[str, Any]:
688+
attempts.append(len(attempts))
689+
set_invocation_metadata(attempt_marker="leaked")
690+
raise asyncio.CancelledError("aborted")
691+
692+
retry = RetryMiddleware(max_attempts=3, backoff=lambda _i: 0.0)
693+
chain = compose_chain([retry], _writes_then_cancels)
694+
695+
baseline_token = _set_invocation_metadata(validate_invocation_metadata({"tenantId": "T1"}))
696+
try:
697+
with pytest.raises(asyncio.CancelledError):
698+
await chain(_SimpleState())
699+
# Cancellation propagated — exactly ONE attempt ran (retry
700+
# MUST NOT swallow ``CancelledError`` per spec §6.1).
701+
assert attempts == [0]
702+
# The cancelled attempt's metadata write was discarded per
703+
# §3.4 — post-failure view is the pre-attempt baseline.
704+
assert dict(get_invocation_metadata()) == {"tenantId": "T1"}
705+
finally:
706+
_reset_invocation_metadata(baseline_token)

0 commit comments

Comments
 (0)