Extend production-observability example with accumulator pattern (#133)

chris-colinsky · web-flow · commit e0c738d282b8 · 2026-06-05T18:38:50.000-07:00
* Extend production-observability with accumulator pattern

Two pre-release polish items for the v0.12.0 cycle.

CHANGELOG: the Unreleased Changed entry for the spec-pin advance
originally said proposal 0052 "lands in a follow-on PR of this
cycle" and 0054 "lands in a follow-on PR". Both have since landed
(PRs 131 + 132). Rewrites the bullet to factually describe the
final state: three proposals (0048, 0052, 0054) ship as fully
implemented, two (0051, 0053) ship as textual-only.

production-observability example: adds an LlmUsageAccumulator
class plus a terminal persist node that demonstrates the queryable
observer + drain_events_for pattern end-to-end. The accumulator
subscribes to the LLM-namespace event stream, accumulates per-
invocation token totals via current_invocation_id() bucket keys,
and exposes convention-only get_bucket / drop methods. The persist
node calls drain_events_for to synchronize on the deliver loop
before reading the bucket so the rollup reflects every LLM call
in the invocation, drops the bucket per the explicit-cleanup
discipline, and prints a cost summary. The graph grows from
respond -&gt; END to respond -&gt; persist -&gt; END. Module-level
singletons (_accumulator + _compiled_graph) keep the persist
node closure-free and follow the existing _provider_instance
precedent.

Walkthrough doc updates the H1, overview, what-it-teaches list,
captured-output sample, and reading-the-output walkthrough to
cover the new pattern.

* Surface invocation-span attribution attrs in OTel formatter

The example's _format_otel_spans excluded the root
openarmature.invocation span from its captured-output listing
because two issues kept it from landing in the in-memory
exporter and from showing usefully even when it did:

1. The OTel observer's shutdown() was never called, so the root
   invocation span stayed open and never moved into the
   exporter's finished-spans list. Adds otel_observer.shutdown()
   to the finally block after drain(), mirroring the pattern in
   the OTel unit tests.

2. The formatter's curated key set didn't include the
   invocation-level attributes the new span carries
   (openarmature.graph.entry_node, .spec_version,
   openarmature.implementation.name + .version). The formatter
   now picks the right key set based on span name: the
   invocation span surfaces its four invocation-level attrs
   only, inner-node spans surface the per-node + cross-cutting
   user.* + GenAI semconv attrs. Skipping cross-cutting attrs
   on the invocation line avoids repeating data that appears
   three more times below.

Net visible change: the captured-OTel-spans block now opens with
a [openarmature.invocation] line carrying
implementation_name='openarmature-python' +
implementation_version + spec_version + entry_node. Operators
filtering traces by library version in Phoenix / Datadog /
Honeycomb / Tempo / HyperDX read these directly from the root
invocation span.

Walkthrough doc's reading-the-output bullet now distinguishes
the three OTel attribute families (invocation-level 5.1,
cross-cutting 5.6, GenAI semconv) and explains why the
invocation span only closes on observer shutdown().

* Tighten accumulator example per PR review

Eight PR review threads, addressing four distinct issues.

state.invocation_id -&gt; current_invocation_id() in the example
module docstring and walkthrough doc. The runnable persist() uses
current_invocation_id() because State has no invocation_id field
by default; the docstring snippets had drifted to the wrong shape.

assert -&gt; RuntimeError in persist(). The three runtime
preconditions (_compiled_graph not None, _accumulator not None,
current_invocation_id() not None) now raise explicit
RuntimeError so the failure mode stays informative under python
-O, which strips asserts and would otherwise produce silent None
dereferences.

InvocationCompletedEvent backstop cleanup in the accumulator.
persist()'s drop is the fast path; if drain_events_for times out
and the deliver loop later processes late-arriving LLM events,
setdefault() would recreate a bucket that nothing ever cleans up.
Adding InvocationCompletedEvent handling at the top of __call__
drops any leftover bucket on invocation completion. The drop is
idempotent so it composes with persist()'s drop without harm.

Defensive total_tokens derivation. LlmEventPayload makes all
three usage fields optional; providers that emit prompt +
completion but no total (anything non-OpenAI in practice) would
leave bucket.total_tokens at zero while the sub-fields are
correct. Now derives total from prompt + completion when total
is None on the payload.

build_graph() self-contained per the demo convention. Previously,
persist() depended on _compiled_graph + _accumulator module
globals that only main() populated, so a copy-pasting reader
doing `graph = build_graph(); await graph.invoke(...)` would
hit RuntimeError at persist time. build_graph() now owns the
accumulator construction, the graph attach, and the global
wiring. main() drops the duplicate construction and just
attaches OTel + Langfuse on top.

* Reconcile docstrings + comments with refactor

Four stale-text findings from a second PR review pass — all
caused by the previous review pass changing behavior without
fully sweeping the surrounding documentation.

Module docstring snippet now shows the full call shape:
``await graph.drain_events_for(current_invocation_id(),
timeout=2.0)``, matching the runnable persist() pattern (the
previous snippet stripped the await + timeout for brevity but
under-described the API).

The accumulator's drop() comment block was rewritten to describe
the actual two-step lifecycle: fast-path explicit drop after read
by the terminal node, plus the InvocationCompletedEvent backstop
that the prior pass added. The old comment claimed "does NOT
auto-drop on InvocationCompletedEvent" which directly contradicted
the implementation.

RuntimeError message in persist() now points readers at
build_graph() instead of main() for the initialization pattern —
the prior pass moved the singleton wiring into build_graph but
left the error message pointing at the old call site.

Walkthrough doc's "three observers attached at compile time"
becomes "three observers attached before invoke", which is honest
for both the build_graph-side accumulator attachment and the
main-side OTel + Langfuse attachments. attach_observer happens
after compile() in the OA API regardless of which function calls
it.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 ### Changed
 
-- **Pinned spec advanced from v0.38.0 to v0.46.0.** Submodule + `[tool.openarmature].spec_version` + `conformance.toml` `spec_pin` advance together. Absorbs eight new proposals (0047-0054) into the conformance manifest. Two of them ship as part of the v0.12.0 cycle as textual-only acknowledgments with no code change required: **proposal 0051** (observability §8.4.1 Langfuse `trace.input` / `trace.output` implementation-surface caveat — documents that vendor SDK round-trip is required to project caller-side trace I/O updates onto the wire; the v0.11.0 (proposal 0043) caller-hook shape already matches the documented behavior) and **proposal 0053** (observability §3.4 shared-parent boundary clarification — tightens the structural-shared-parent classification to predicate the invocation span on whether at least one fan-out or parallel-branches dispatch is on the augmenter's call-stack path; behavior already matches via fixtures 034 + 039). Two more ship as implemented in this cycle: **proposal 0048** (read-symmetric metadata + queryable observer pattern docs — see *Added* below) and **proposal 0052** (implementation attribution attributes — landing in a follow-on PR of this cycle). The remaining proposals are marked `not-yet` in the conformance manifest with roadmap targets: 0047 + 0049 (v0.13.0 LLM provider hardening batch), 0050 (v0.14.0 retry & reliability batch), 0054 (per-invocation observer event drain — bundled into this v0.12.0 cycle alongside 0048 per the §9.4 accumulator-lifecycle pairing; lands in a follow-on PR).
+- **Pinned spec advanced from v0.38.0 to v0.46.0.** Submodule + `[tool.openarmature].spec_version` + `conformance.toml` `spec_pin` advance together. Absorbs eight new proposals (0047-0054) into the conformance manifest. Two ship as textual-only acknowledgments with no code change required: **proposal 0051** (observability §8.4.1 Langfuse `trace.input` / `trace.output` implementation-surface caveat — documents that vendor SDK round-trip is required to project caller-side trace I/O updates onto the wire; the v0.11.0 (proposal 0043) caller-hook shape already matches the documented behavior) and **proposal 0053** (observability §3.4 shared-parent boundary clarification — tightens the structural-shared-parent classification to predicate the invocation span on whether at least one fan-out or parallel-branches dispatch is on the augmenter's call-stack path; behavior already matches via fixtures 034 + 039). Three ship as fully implemented this cycle: **proposal 0048** (read-symmetric metadata + queryable observer pattern docs — see *Added* below), **proposal 0052** (implementation attribution attributes — see *Added* below), and **proposal 0054** (per-invocation observer event drain — see *Added* below; bundled with 0048 as the §9.4 accumulator-lifecycle pair). The remaining proposals are marked `not-yet` in the conformance manifest with roadmap targets: 0047 + 0049 (v0.13.0 LLM provider hardening batch) and 0050 (v0.14.0 retry & reliability batch).
 - **README and docs homepage refreshed around reasons-to-choose.** Replaced the 10-bullet "Why OpenArmature" feature inventory in `README.md` with 5 differentiating reasons (LLM-infused workflows to agents on one engine; crash-safe resume by contract; destination-pluggable observability with OTel + Langfuse, no SaaS lock-in; compile-time topology checks; spec + conformance). The docs homepage (`docs/index.md`) card grid carries the same five plus a sixth card retained from the previous grid for async-first / LLM-agnostic: workflows-to-agents, crash-safe, pluggable observability, bad-graphs-don't-compile, parallelism (fan-out + parallel-branches + nested correctness), async-first.
 - **Docs sweep: stale references and em-dash normalization.** Fixed three definite stale references (`spec_version='0.26.0'` in the Langfuse example output now reads `'0.38.0'`; the dangling `v0.16.1` qualifier dropped from the parallel-branches concept page; `compiled.attach_observer` corrected to `graph.attach_observer` in `non-obvious-shapes.md` for variable-name consistency with the rest of the docs). Swept em dashes out of the user-facing docs (130 instances across 17 files) per the convention set during the patterns expansion. mkdocs strict build clean; no broken intra-docs links.
 - **The checkpointing-and-migration example grows a crash-and-resume drama.** The first invoke of the v1 graph now hits a simulated transient failure inside `size_crew` (raises a `RuntimeError` on its first attempt only). The example catches `NodeException` at the `invoke()` boundary, prints what's saved on disk (`define_objective`'s position is already in `completed_positions`), then re-invokes with `resume_invocation=<id>`. The retried `size_crew` succeeds, `draft_timeline` runs, and the pipeline finishes - dramatizing the synchronous-checkpoint-by-contract reliability claim from the README pitch. The existing v1->v2 migration phase rides on top of the crash-survived checkpoint, so both reliability stories compose in one demo. Walk-through doc rewritten to cover both phases.
diff --git a/docs/examples/production-observability.md b/docs/examples/production-observability.md
@@ -1,4 +1,4 @@
-# Production observability with dual observers and timing middleware
+# Production observability with dual observers, timing middleware, and per-invocation cost rollup
 
 !!! info "Source"
     [https://github.com/LunarCommand/openarmature-python/blob/main/examples/production-observability/main.py](https://github.com/LunarCommand/openarmature-python/blob/main/examples/production-observability/main.py){target="_blank" rel="noopener"}
@@ -7,14 +7,19 @@ A single-turn lunar-mission Q&A endpoint instrumented the way you'd
 ship it: BOTH OTel and Langfuse observers attached to the same
 graph, caller hooks deriving domain-shaped `trace.input` /
 `trace.output` from State, the built-in `TimingMiddleware`
-recording per-node duration, and multi-tenant caller-supplied
-metadata propagating to both observers in one `invoke()` call.
+recording per-node duration, multi-tenant caller-supplied
+metadata propagating to both observers in one `invoke()` call, AND
+a third queryable-accumulator observer that a terminal `persist`
+node reads at request scope after synchronizing on the deliver
+loop with `drain_events_for`.
 
 ## Overview
 
-One node, one LLM call, two production-grade observability
-backends. The pipeline takes a question, calls the LLM, returns the
-answer. The interesting part is the observability wiring:
+Two nodes (`respond` then `persist`), one LLM call, three observers
+attached before invoke. The pipeline takes a question, calls the
+LLM, returns the answer, then synchronizes on the observer queue
+and rolls up token cost. The interesting part is the observability
+wiring:
 
 - `OTelObserver` attached with an `InMemorySpanExporter`
   (production swaps this for `BatchSpanProcessor` +
@@ -77,6 +82,23 @@ sees the same logical events represented two ways.
   `InMemorySpanExporter` records every Span. Production
   deployments swap each for a real exporter / SDK adapter; the
   observer call surface doesn't change.
+- **Queryable accumulator + `drain_events_for`**
+  ([queryable observer pattern](../concepts/observability.md)).
+  A third observer — `LlmUsageAccumulator` — subscribes to the
+  same event stream but only records the LLM-namespace events
+  carrying an `LlmEventPayload`. It accumulates per-invocation
+  token totals in memory, indexed by `current_invocation_id()`.
+  The terminal `persist` node calls
+  `await graph.drain_events_for(current_invocation_id(), timeout=2.0)`
+  to synchronize on the deliver loop, then reads the accumulator's
+  bucket and drops it. Without the drain, the bucket might be
+  missing the most-recent LLM event's tokens (the deliver loop
+  hasn't reached them yet). The `Observer` protocol itself stays
+  a single-callable shape; the accumulator just exposes its own
+  read methods (`get_bucket` / `drop`) that the persist node knows
+  about. This is the canonical shape for per-invocation cost
+  attribution at request scope, replacing the round-trip-through-
+  State workarounds that pre-v0.12.0 deployments used.
 
 ## How to run
 
@@ -105,13 +127,15 @@ request id:  <uuid>
 feature flag:v2-canary
 
 [timing] respond: 1234.5ms (success)
+[persist] LLM usage: prompt=42, completion=38, total=80 across 1 call(s)
 answer:      The primary objective of Apollo 11 was ...
 model:       gpt-4o-mini-2024-07-18
 
 --- captured OTel spans ---
-  [openarmature.invocation] 1240.0ms  openarmature.user.tenantId='demo-acme', ...
+  [openarmature.invocation] 1240.0ms  openarmature.graph.entry_node='respond', openarmature.graph.spec_version='0.46.0', openarmature.implementation.name='openarmature-python', openarmature.implementation.version='0.12.0'
   [respond] 1235.0ms  openarmature.node.name='respond', openarmature.user.tenantId='demo-acme', ...
-  [openarmature.llm.complete] 1200.0ms  gen_ai.system='openai', gen_ai.usage.input_tokens=42, ...
+  [openarmature.llm.complete] 1200.0ms  openarmature.user.tenantId='demo-acme', gen_ai.system='openai', gen_ai.usage.input_tokens=42, ...
+  [persist] 2.0ms  openarmature.node.name='persist', openarmature.user.tenantId='demo-acme', ...
 
 --- captured Langfuse trace ---
 Trace id=<uuid>
@@ -133,12 +157,33 @@ Trace id=<uuid>
   `TimingMiddleware` callback as soon as the respond chain returns.
   `outcome` is `"success"` here; a `ProviderRateLimit` would surface
   as `outcome="exception"` with `exception_category="provider_rate_limit"`.
+- **`[persist] LLM usage: ...`**: emitted by the `persist` node
+  after it drains the deliver loop and reads the
+  `LlmUsageAccumulator`'s bucket for this invocation. If the drain
+  times out (slow / hung observer), the persist line is prefixed by
+  a `[persist] drain incomplete: N events still pending after 2.0s`
+  surface — the production version of that log would also flip an
+  SLO-breach metric.
 - **OTel spans block**: one line per captured span, sorted by
   start time. The relevant attributes shown are a curated subset
   for readability; the full attribute set is on each `Span` object
-  for any reader inspecting them programmatically. Note the
-  `openarmature.user.*` attributes appearing on every span (the
-  cross-cutting attribute propagation from `invoke(metadata=...)`).
+  for any reader inspecting them programmatically. Note three
+  attribute families worth telling apart:
+    - The root `openarmature.invocation` span carries
+      `openarmature.graph.spec_version` plus the
+      `openarmature.implementation.name` / `.version` attribution
+      attributes. These are invocation-span-only (per spec §5.1) —
+      operators filtering by library version use these.
+    - The `openarmature.user.*` attributes appear on every span,
+      reflecting the cross-cutting propagation from
+      `invoke(metadata=...)`.
+    - `gen_ai.usage.*` lands on the LLM span only, sourced from the
+      provider's wire response.
+
+    The invocation span only lands in the exporter after the OTel
+    observer's `shutdown()` is called (closing the root span). The
+    demo calls it after `drain()` in the `finally` block; production
+    long-running processes call it at process exit.
 - **Langfuse trace block**: the same invocation as seen by the
   Langfuse data model. `trace.input` / `trace.output` come from the
   caller hooks (`{"question": ...}` / `{"answer": ..., "model": ...}`)
diff --git a/examples/production-observability/main.py b/examples/production-observability/main.py