Skip to content

Commit a3a22c6

Browse files
Implement proposal 0010: bounded drain timeout (#69)
* Add drain timeout + DrainSummary (proposal 0010) CompiledGraph.drain() gains an optional timeout parameter and returns a DrainSummary frozen dataclass (undelivered_count, timeout_reached). The timeout-fired path cancels in-flight delivery workers cleanly so the graph remains usable for subsequent invocations. Per-invocation dispatched/delivered counters on _InvocationContext track undelivered events; _active_workers changes from set[Task] to dict[Task, _InvocationContext] so drain() can read each worker's counters at cancellation time. Solves the slow-observer-blocks-process-exit footgun for short-lived processes (CLIs, scripts, serverless functions). * Extend conformance harness for drain fixtures Adds the slow-observer directive (sleep_ms_per_event, int form or dict form with first_invocation / subsequent_invocations keys), drain timeout passthrough (invoke.drain.timeout_seconds), DrainSummary assertions (timeout_reached, undelivered_count, undelivered_count_min), invariants block (drain_returned_within_timeout, graph_state_intact_after_timeout, drain_waited_for_all_events), and multi-invocation invocations: array handling for fixture 024's cross-invocation cleanliness contract. Per-event observer comparison switches from full equality to a key- subset check so fixtures that omit pre_state / post_state (the drain fixtures) do not fail on incidental keys present in the recorded event. Pydantic fixture-parsing models extended for the new directives so fixture parse-tests cover the new shapes. * Bump spec to v0.19.0; refresh docs and CHANGELOG Submodule pin advances from v0.18.1 to v0.19.0 (proposal 0010 drain timeout). The retagged v0.19.0 commit carries the fixture 052 results literal fix backported from v0.18.1, so fixture 052 passes cleanly under the new pin. Runtime spec_version pins in pyproject and __init__ updated to match; smoke test asserts v0.19.0. CHANGELOG Unreleased section gains drain timeout + DrainSummary entries under Added; drain() return-type change noted under Changed; cumulative pin-bump summary updated to v0.17.0 -> v0.19.0 across four spec versions absorbed in this cycle. docs/concepts/observability.md drain section rewritten to describe the DrainSummary return value and the new Bounded drain (optional timeout) subsection. * Address PR #69 review: validate timeout + gather all workers drain() now validates `timeout` is non-negative (and not NaN) at the API boundary. Negative values previously fell through to asyncio.wait as an immediate cancel; surface as ValueError with a clear message. Restructured the post-wait branch to gather all workers (both _done and pending) with return_exceptions=True after cancellation. Previous shape only awaited pending in the timeout-fired branch and skipped the gather entirely on the clean path, so any exception escaping a delivery worker would surface as a "Task exception was never retrieved" warning. Defensive — deliver_loop catches observer exceptions internally — but cheap and prevents the silent-failure mode. Two new unit tests cover the validation behavior (negative + NaN inputs raise ValueError with the expected message).
1 parent f381fef commit a3a22c6

16 files changed

Lines changed: 697 additions & 48 deletions

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,22 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
88

99
### Added
1010

11+
- **Bounded drain timeout on `CompiledGraph.drain()`** (proposal 0010, accepted in spec v0.19.0). `drain()` accepts an optional `timeout: float | None = None` parameter (non-negative seconds). When supplied, drain returns no later than the deadline; any observer events still queued or in-flight are reported as undelivered. Workers are cancelled cleanly so the compiled graph remains usable for subsequent invocations — partial delivery state from one drain does NOT leak into the next. Solves the "slow / hung / misbehaving observer blocks process exit" footgun for short-lived processes (CLIs, scripts, serverless functions). Observers SHOULD be cancellation-safe (idempotent writes, `try/finally` cleanup); the spec doesn't mandate it but the docs recommend it.
12+
- **`DrainSummary` frozen dataclass** at `openarmature.graph.DrainSummary`. Returned from every `drain()` call (with or without `timeout`). Fields: `undelivered_count: int`, `timeout_reached: bool`. The shape is consistent across timed and untimed drains — callers receive the same dataclass whether the timeout was supplied or not. Per the v0.19.0 contract the two declared fields are the spec-mandated minimum; richer diagnostic detail (per-observer counts, sampled event metadata) is reserved for follow-on PRs.
1113
- **Per-instance fan-out resume contract** (proposal 0009, accepted in spec v0.18.0). The engine now writes a checkpoint record at every `completed` event inside a fan-out instance (in addition to the existing outermost-graph + subgraph-internal + fan-out node completion saves). On resume the engine consults the saved record's `fan_out_progress` field and treats each instance as `completed` (skip, contribution rolls forward), `in_flight` (re-run from subgraph entry), or `not_started` (dispatch normally). The `append` reducer's no-double-merge guarantee holds across resume because `completed` is a one-shot accumulator state.
1214
- **`FanOutProgress` and `FanOutInstanceProgress` public dataclasses** on `openarmature.checkpoint`. The `CheckpointRecord.fan_out_progress` field is now `tuple[FanOutProgress, ...]` (default empty tuple), with per-instance state, result, and `completed_inner_positions` observability. Was a `None` placeholder under proposal 0008.
1315
- **`FanOutInternalSaveBatching` config** on `InMemoryCheckpointer`. Backends MAY opt into batching scoped to fan-out instance internal saves to bound the write volume of high-instance-count fan-outs. Outermost-graph, subgraph-internal, and the fan-out node's own completion save remain synchronous regardless. Default off. Buffered-but-unflushed saves are lost on crash by design; on resume, instances whose `completed` state was only buffered revert and re-run. Surfaces a new optional `save_fan_out_internal` / `save_fan_out_in_flight_failure` Checkpointer Protocol seam; backends that don't implement either fall back to the standard `save`.
1416
- **Patterns docs section** at `docs/patterns/`, sibling to Concepts. Seeded with four recipes drawn from downstream usage and proposal 0008's alternatives section: parameterized entry point, tool-dispatch-as-node, session-as-checkpoint-resume, and bypass-if-output-exists. Patterns are user-level how-to recipes composing existing primitives, not framework contracts; new patterns can be added without spec coordination. Each page follows a problem / approach / snippet / when this is the right pattern / when it isn't / cross-references structure.
1517

1618
### Changed
1719

20+
- **`CompiledGraph.drain()` return type** changed from `None` to `DrainSummary` (pre-1.0; per proposal 0010 v0.19.0 contract). Callers that ignored the return are unaffected — `await graph.drain()` discards the returned dataclass exactly as before. Callers that explicitly typed the return as `None` will need to update their annotation.
1821
- **Fan-out resume behavior** flipped from atomic restart (0008's v1 contract) to per-instance resume. A crash mid-fan-out used to re-run the entire fan-out on resume; now only the instances that did not complete-and-record their contribution re-run. The economics matter for large fan-outs of expensive work (LLM calls, long extractions): an 80% complete fan-out crash now restores 80% of its results rather than discarding them.
1922
- **`SQLiteCheckpointer` schema** picks up a new `fan_out_progress_blob` column (added via `ALTER TABLE` for backward compatibility with pre-0009 databases). Pre-0009 rows back-fill as NULL on load and round-trip as the empty-tuple default. Both `pickle` and `json` serialization modes round-trip the new field.
2023

2124
### Notes
2225

23-
- **Pinned spec version bumped from v0.17.0 to v0.18.1 over this Unreleased cycle.** Three spec versions absorbed: v0.17.1 (proposal 0019, multi-provider wire-format extension; purely textual reframe of llm-provider §8 as a catalog of wire-format mappings, OpenAI-compatible body nested under §8.1, code references updated to §8.1 / §8.1.1 / §8.1.2 / §8.1.3 / §8.1.5.1 / §8.1.1.1), v0.18.0 (proposal 0009, per-instance fan-out resume; pipeline-utilities §10.3 / §10.7 revised, §10.11 added with per-instance state machine plus composition rules plus configurable batching; the `append` reducer no-double-merge invariant from §10.11.1 is the load-bearing correctness story; see Added / Changed above), and v0.18.1 (fixture-only patch on `release/v0.18.1` correcting an off-by-one literal in fixture 052's expected `results`). All existing conformance fixtures continue to pass.
26+
- **Pinned spec version bumped from v0.17.0 to v0.19.0 over this Unreleased cycle.** Four spec versions absorbed: v0.17.1 (proposal 0019, multi-provider wire-format extension; purely textual reframe of llm-provider §8 as a catalog of wire-format mappings, OpenAI-compatible body nested under §8.1, code references updated to §8.1 / §8.1.1 / §8.1.2 / §8.1.3 / §8.1.5.1 / §8.1.1.1), v0.18.0 (proposal 0009, per-instance fan-out resume; pipeline-utilities §10.3 / §10.7 revised, §10.11 added with per-instance state machine plus composition rules plus configurable batching; the `append` reducer no-double-merge invariant from §10.11.1 is the load-bearing correctness story; see Added / Changed above), v0.18.1 (fixture-only patch on `release/v0.18.1` correcting an off-by-one literal in fixture 052's expected `results`), and v0.19.0 (proposal 0010, bounded drain timeout; graph-engine §6 amended with the `timeout` parameter and `DrainSummary` return contract; see Added / Changed above). All existing conformance fixtures continue to pass.
2427

2528
## [0.8.0] — 2026-05-23
2629

docs/concepts/observability.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,11 +221,13 @@ finished. For long-running services that's fine. For short-lived
221221
processes (scripts, serverless, CLIs), events dispatched late in the
222222
run may not be delivered before the process exits.
223223

224-
`drain()` blocks until every dispatched event has been delivered:
224+
`drain()` waits until every dispatched event has been delivered and
225+
returns a `DrainSummary` reporting the outcome:
225226

226227
```python
227228
final = await compiled.invoke(initial)
228-
await compiled.drain()
229+
summary = await compiled.drain()
230+
# DrainSummary(undelivered_count=0, timeout_reached=False)
229231
```
230232

231233
- Per-graph, not per-invoke. Drain awaits *all* prior invocations'
@@ -239,6 +241,32 @@ await compiled.drain()
239241
If you forget `drain()` in a CLI, the symptom is an empty trace file
240242
or missing log entries.
241243

244+
### Bounded drain (optional timeout)
245+
246+
`drain()` accepts an optional `timeout` parameter (non-negative
247+
seconds) — `await compiled.drain(timeout=5.0)` bounds the wait at five
248+
seconds. When the deadline fires, in-flight workers are cancelled
249+
cleanly so the compiled graph stays usable for subsequent invocations
250+
— partial delivery state from one drain does NOT leak into the next.
251+
252+
The returned `DrainSummary` carries:
253+
254+
- `timeout_reached: bool``True` only when the timeout actually
255+
fired. A drain that finishes before the deadline reports `False`.
256+
- `undelivered_count: int` — events dispatched but not fully delivered
257+
to every subscribed observer before the deadline. Always `0` when
258+
`timeout_reached is False`.
259+
260+
Observers **should** be cancellation-safe (idempotent writes,
261+
`try/finally` cleanup) so that interruption by drain timeout does not
262+
leave partial side effects in an inconsistent state.
263+
264+
When to set a timeout: short-lived processes (CLIs, scripts,
265+
serverless functions) where a misbehaving observer holding drain
266+
indefinitely would stall process exit. Long-running services that
267+
control their own lifecycle can leave the timeout off and let drain
268+
wait for natural completion.
269+
242270
## Error isolation
243271

244272
An observer that raises:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ Repository = "https://github.com/LunarCommand/openarmature-python"
4848
Specification = "https://github.com/LunarCommand/openarmature-spec"
4949

5050
[tool.openarmature]
51-
spec_version = "0.18.1"
51+
spec_version = "0.19.0"
5252

5353
[dependency-groups]
5454
dev = [

src/openarmature/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
"""OpenArmature: workflow framework for LLM pipelines and tool-calling agents."""
22

33
__version__ = "0.8.0"
4-
__spec_version__ = "0.18.1"
4+
__spec_version__ = "0.19.0"

src/openarmature/graph/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@
4848
exponential_jitter_backoff,
4949
)
5050
from .nodes import FunctionNode, Node
51-
from .observer import Observer, RemoveHandle, SubscribedObserver
51+
from .observer import DrainSummary, Observer, RemoveHandle, SubscribedObserver
5252
from .parallel_branches import BranchSpec, ParallelBranchesNode
5353
from .projection import ExplicitMapping, FieldNameMatching, ProjectionStrategy
5454
from .reducers import Reducer, append, last_write_wins, merge
@@ -62,6 +62,7 @@
6262
"ConditionalEdge",
6363
"ConflictingReducers",
6464
"DanglingEdge",
65+
"DrainSummary",
6566
"EdgeException",
6667
"EndSentinel",
6768
"ExplicitMapping",

src/openarmature/graph/compiled.py

Lines changed: 90 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@
9393
from .nodes import Node
9494
from .observer import (
9595
_DRAIN_SENTINEL,
96+
DrainSummary,
9697
Observer,
9798
RemoveHandle,
9899
SubscribedObserver,
@@ -523,10 +524,14 @@ class CompiledGraph[StateT: State]:
523524
# dataclass: the list reference is fixed but its contents change.
524525
# Parameterized factories so pyright infers the element types.
525526
_attached_observers: list[SubscribedObserver] = field(default_factory=list[SubscribedObserver])
526-
# `set` (not list) so a per-task `add_done_callback(self._active_workers.discard)`
527-
# auto-removes completed workers — long-running services that never call
528-
# drain() don't accumulate completed Task references indefinitely.
529-
_active_workers: set[asyncio.Task[None]] = field(default_factory=set[asyncio.Task[None]])
527+
# Per-task `add_done_callback` auto-removes completed workers — long-
528+
# running services that never call drain() don't accumulate completed
529+
# Task references indefinitely. Values are the per-invocation
530+
# `_InvocationContext` so `drain()` can read each worker's
531+
# `drain_counters` to compute the undelivered-event count at timeout.
532+
_active_workers: dict[asyncio.Task[None], _InvocationContext] = field(
533+
default_factory=dict[asyncio.Task[None], _InvocationContext]
534+
)
530535
# Single-element list so the frozen-dataclass binding is stable but
531536
# the user can swap the registered Checkpointer via
532537
# ``attach_checkpointer``. ``None`` when no backend is registered.
@@ -680,35 +685,90 @@ async def _migrate_record(
680685
)
681686
return migrated, summary
682687

683-
async def drain(self) -> None:
688+
async def drain(self, timeout: float | None = None) -> DrainSummary:
684689
"""Await delivery of every observer event produced by prior
685-
invocations of this graph.
690+
invocations of this graph, optionally bounded by ``timeout``.
686691
687692
Callers running in short-lived processes (scripts, serverless
688-
functions, CLIs) MUST use drain to avoid losing observer
689-
events that were dispatched but not yet delivered.
693+
functions, CLIs) MUST use drain to avoid losing observer events
694+
that were dispatched but not yet delivered.
690695
691696
Only events dispatched before this call are awaited; events
692697
from invocations started concurrently with drain may or may
693698
not be included. Subgraph events from active invocations are
694699
part of the parent invocation's worker and are covered
695700
automatically.
696701
697-
**Unbounded by design.** Drain blocks until every queued event has
698-
been delivered to every subscribed observer. A slow, hung, or
699-
misbehaving observer can therefore hold drain, and the calling
700-
process, indefinitely. If you need a bounded wait, wrap the call
701-
in `asyncio.wait_for` and accept that events still queued when the
702-
deadline elapses will not be delivered::
703-
704-
await asyncio.wait_for(compiled.drain(), timeout=5.0)
702+
``timeout`` is a non-negative duration in seconds. If omitted
703+
or ``None``, drain waits indefinitely — a slow, hung, or
704+
misbehaving observer can therefore hold drain (and the calling
705+
process) indefinitely. If supplied, drain returns no later
706+
than ``timeout`` seconds after the call begins; any observer
707+
events still queued or in-flight at that point are considered
708+
undelivered. Workers are cancelled via ``Task.cancel()`` so
709+
the compiled graph remains usable for subsequent invocations
710+
— partial delivery state from one drain does NOT leak into
711+
the next invocation.
712+
713+
Returns a :class:`DrainSummary` with ``undelivered_count`` and
714+
``timeout_reached`` fields. The shape is the same whether or
715+
not a timeout was supplied; on the no-timeout / timeout-not-
716+
fired path both fields are zero / false.
717+
718+
Observers SHOULD be written to be cancellation-safe
719+
(idempotent writes, try/finally cleanup) so that interruption
720+
by drain timeout does not leave partial side effects in an
721+
inconsistent state.
722+
723+
Raises ``ValueError`` if ``timeout`` is negative or NaN.
724+
Non-numeric input raises ``TypeError`` from the comparison.
705725
"""
726+
# ``not (timeout >= 0)`` is the right check: catches negative
727+
# values, catches NaN (all comparisons with NaN return False),
728+
# and lets non-numeric input raise ``TypeError`` from the
729+
# comparison operator itself. Silently treating a negative
730+
# timeout as "immediate cancel" would be a user-hostile failure
731+
# mode — the spec contract is non-negative seconds.
732+
if timeout is not None and not (timeout >= 0):
733+
raise ValueError(f"drain timeout must be non-negative, got {timeout!r}")
706734
if not self._active_workers:
707-
return
708-
# Snapshot the set: each worker's done-callback removes itself
709-
# from `_active_workers`, so iterating it directly while gather
710-
# awaits would mutate during iteration.
711-
await asyncio.gather(*list(self._active_workers), return_exceptions=True)
735+
return DrainSummary(undelivered_count=0, timeout_reached=False)
736+
# Snapshot the dict: each worker's done-callback removes its
737+
# entry from `_active_workers`, so iterating directly while
738+
# `asyncio.wait` awaits would mutate during iteration.
739+
snapshot = dict(self._active_workers)
740+
workers = list(snapshot.keys())
741+
742+
_done, pending = await asyncio.wait(
743+
workers,
744+
timeout=timeout,
745+
return_when=asyncio.ALL_COMPLETED,
746+
)
747+
748+
if pending:
749+
undelivered = sum(
750+
snapshot[w].drain_counters.dispatched - snapshot[w].drain_counters.delivered for w in pending
751+
)
752+
timeout_reached = True
753+
for w in pending:
754+
w.cancel()
755+
else:
756+
undelivered = 0
757+
timeout_reached = False
758+
759+
# Gather ALL workers (done + pending) so any exception that
760+
# escaped a delivery worker surfaces here instead of leaking
761+
# as a "Task exception was never retrieved" warning. The
762+
# ``return_exceptions=True`` absorbs both the synthetic
763+
# ``CancelledError`` from cancelled workers and any genuine
764+
# bug-escape from a ``deliver_loop`` that ever raised past
765+
# its inner ``warnings.warn`` isolation. Also load-bearing
766+
# for the cross-invocation cleanliness contract — done-
767+
# callbacks fire on cancellation, so ``_active_workers`` is
768+
# empty by the time we return.
769+
await asyncio.gather(*workers, return_exceptions=True)
770+
771+
return DrainSummary(undelivered_count=undelivered, timeout_reached=timeout_reached)
712772

713773
# ------------------------------------------------------------------
714774
# Public invocation
@@ -893,12 +953,16 @@ async def invoke(
893953
# "per-invocation is OUTERMOST invoke" wording).
894954
correlation_token = _set_correlation_id(resolved_correlation_id)
895955
invocation_token = _set_invocation_id(invocation_id)
896-
worker = asyncio.create_task(deliver_loop(queue))
897-
self._active_workers.add(worker)
956+
worker = asyncio.create_task(deliver_loop(queue, context.drain_counters))
957+
self._active_workers[worker] = context
898958
# Auto-prune: when the worker completes (after the sentinel is
899-
# processed), remove it from the active set so long-running
900-
# services don't leak Task references between drain() calls.
901-
worker.add_done_callback(self._active_workers.discard)
959+
# processed, or after cancellation by drain() on timeout), remove
960+
# it from the active set so long-running services don't leak Task
961+
# references between drain() calls. ``pop(key, None)`` is the
962+
# idempotent form — if a concurrent drain() removed the entry
963+
# already (it shouldn't with the current design, but the no-arg
964+
# form would raise KeyError), this is a safe no-op.
965+
worker.add_done_callback(lambda t: self._active_workers.pop(t, None))
902966
# Per spec §6 cross-ref in proposal 0014: dispatch the
903967
# ``checkpoint_migrated`` event as soon as the delivery
904968
# worker is alive but before any node runs, so the OTel

0 commit comments

Comments
 (0)