Migrate LlmUsageAccumulator to typed event filter (#138)

chris-colinsky · web-flow · commit 5e98f4e35bb5 · 2026-06-07T10:48:56.000-07:00
* Migrate LlmUsageAccumulator to typed event filter

Switch the reference accumulator's filter from the 4-stage
NodeEvent + namespace + phase + payload-narrow check to a single
isinstance(event, LlmCompletionEvent) call. Field access moves
from event.pre_state.X to event.usage.X (reusing the same Usage
shape the sentinel payload mirrored). Per-invocation bucketing
reads event.invocation_id directly off the typed event instead
of going through current_invocation_id(), reducing one level of
indirection that the typed-event surface enables.

Add a defensive guard against a None usage record. Spec types the
field as nullable; python's provider always passes a Usage instance
today, but the guard keeps the accumulator robust against future
providers that exercise the null option.

Document the success-only semantic of LlmCompletionEvent and its
effect on bucket.call_count: failed LLM calls flow through the
exception path and do not emit the typed event, so call_count now
reflects successful calls only. Production code migrating an
existing accumulator from the sentinel pattern should expect this
counting shift if it was previously counting failure-path events.
A pipeline tracking attempt-level failure rates needs a separate
listener (sentinel NodeEvent pair, or a future failure-event
typed variant if that proposal lands).

Disclaimer in the docs walkthrough that the captured OTel span
name and gen_ai.usage.* attribute family still come from the
sentinel handler; the OTel + Langfuse observers have not yet
migrated to consuming the typed event. Span names and attribute
paths may shift when the observer migration lands.

Reframe comment and docs language to describe the architectural
state without pinning specific release numbers. Forward-looking
promises and historical version pins both age poorly; the
CHANGELOG is the authoritative reference for cutoff timing.

* Count successful LLM calls without reported usage

Address PR review feedback: the early-return on usage is None was
tying call_count to "successful AND reported usage" rather than just
"successful." The spec contract has those independent — providers may
legitimately omit usage on a successful call.

Restructure so the bucket creation and call_count increment happen
unconditionally for any LlmCompletionEvent, with the usage-None guard
gating only the token-counting math. Successful calls without reported
usage now count toward call_count and contribute zero tokens (the only
honest value we can record).

Smoke test confirms a sequence of one event with usage populated and
one event with usage=None produces call_count=2 with token totals
matching the populated event only.
diff --git a/docs/examples/production-observability.md b/docs/examples/production-observability.md
@@ -85,10 +85,11 @@ sees the same logical events represented two ways.
 - **Queryable accumulator + `drain_events_for`**
   ([queryable observer pattern](../concepts/observability.md)).
   A third observer — `LlmUsageAccumulator` — subscribes to the
-  same event stream but only records the LLM-namespace events
-  carrying an `LlmEventPayload`. It accumulates per-invocation
-  token totals in memory, indexed by `current_invocation_id()`.
-  The terminal `persist` node calls
+  same event stream but only records the typed
+  `LlmCompletionEvent` variant (one event per successful LLM call;
+  outcome fields read directly off the event). It accumulates
+  per-invocation token totals in memory, indexed by
+  `event.invocation_id`. The terminal `persist` node calls
   `await graph.drain_events_for(current_invocation_id(), timeout=2.0)`
   to synchronize on the deliver loop, then reads the accumulator's
   bucket and drops it. Without the drain, the bucket might be
@@ -97,8 +98,32 @@ sees the same logical events represented two ways.
   a single-callable shape; the accumulator just exposes its own
   read methods (`get_bucket` / `drop`) that the persist node knows
   about. This is the canonical shape for per-invocation cost
-  attribution at request scope, replacing the round-trip-through-
-  State workarounds that pre-v0.12.0 deployments used.
+  attribution at request scope, in place of routing every token
+  count through State (a workaround that pollutes the state
+  schema with non-pipeline data).
+
+  The filter shape is `isinstance(event, LlmCompletionEvent)` —
+  one isinstance check against the typed event variant on the
+  observer event union. The provider also dual-emits a sentinel
+  `NodeEvent` pair during the transition period for backwards
+  compatibility with older accumulators; this example's
+  accumulator ignores the sentinel pair because the typed event
+  carries the same outcome data without the pair-join logic. New
+  accumulators should follow the isinstance-based filter shape
+  here; the CHANGELOG tracks when the sentinel emission is
+  removed.
+
+  `LlmCompletionEvent` is success-only by spec design. Failed LLM
+  calls flow through the exception path and do not emit the typed
+  event, so `bucket.call_count` reflects successful calls only.
+  This is the right semantic for a usage accumulator (failed
+  calls produce no tokens). A pipeline tracking attempt-level
+  failure rates needs a separate listener — either a custom
+  observer on the sentinel `NodeEvent` pair or a future
+  failure-event typed variant if and when that proposal lands.
+  Production code migrating an existing accumulator from the
+  sentinel pattern should expect this counting shift if it was
+  previously counting failure-path events.
 
 ## How to run
 
@@ -167,8 +192,15 @@ Trace id=<uuid>
 - **OTel spans block**: one line per captured span, sorted by
   start time. The relevant attributes shown are a curated subset
   for readability; the full attribute set is on each `Span` object
-  for any reader inspecting them programmatically. Note three
-  attribute families worth telling apart:
+  for any reader inspecting them programmatically. The
+  `openarmature.llm.complete` span name + the `gen_ai.usage.*`
+  attribute family come from the OTel observer's current
+  sentinel-`NodeEvent` handler — the OTel and Langfuse observers
+  have not yet migrated to consuming the typed `LlmCompletionEvent`
+  variant. Span names and attribute paths may shift when the
+  observer migration lands; the example's emitted span structure
+  tracks the current observer behavior. Note three attribute
+  families worth telling apart:
     - The root `openarmature.invocation` span carries
       `openarmature.graph.spec_version` plus the
       `openarmature.implementation.name` / `.version` attribution
diff --git a/examples/production-observability/main.py b/examples/production-observability/main.py
@@ -49,9 +49,10 @@
   bucket and drops it. Without the drain, the bucket would be
   missing the most recent LLM event's tokens (the deliver loop
   hasn't reached them yet). This is the canonical shape for
-  per-invocation cost attribution at request scope, replacing the
-  round-trip-through-State workarounds that pre-v0.12.0 deployments
-  used. The pattern is convention-only at the observer level:
+  per-invocation cost attribution at request scope, in place of
+  routing every token count through State (a workaround pattern
+  that pollutes the state schema with non-pipeline data). The
+  pattern is convention-only at the observer level:
   ``Observer`` itself stays a single-callable protocol; the
   queryable accumulator just exposes its own read methods
   (``get_bucket`` / ``drop``) that the persist node knows about.
@@ -99,7 +100,7 @@
     CompiledGraph,
     GraphBuilder,
     InvocationCompletedEvent,
-    NodeEvent,
+    LlmCompletionEvent,
     NodeException,
     ObserverEvent,
     State,
@@ -112,7 +113,6 @@
     SystemMessage,
     UserMessage,
 )
-from openarmature.observability import LLM_NAMESPACE, LlmEventPayload
 from openarmature.observability.correlation import current_invocation_id
 from openarmature.observability.langfuse import (
     InMemoryLangfuseClient,
@@ -163,13 +163,31 @@ class BriefingState(State):
 # consume.  Convention only; openarmature does not ship a base class
 # for accumulators.
 #
-# The accumulator subscribes to every event but only records the LLM-
-# namespace ones (provider-emitted ``openarmature.llm.complete`` event
-# pair carrying an LlmEventPayload on ``pre_state``).  Per-invocation
-# isolation is by ``current_invocation_id()`` — read inside the
-# observer callback from the worker's Context, populated by the
-# engine at worker create time. Concurrent invocations on one
-# observer each get their own bucket.
+# The accumulator subscribes to every event but only records the
+# typed ``LlmCompletionEvent`` variant — one event per successful LLM
+# call, structured outcome fields read directly off the event without
+# the namespace-string-match + payload-narrow dance the legacy
+# sentinel pattern needed. The provider also dual-emits a sentinel
+# ``NodeEvent`` pair during the transition period for backwards
+# compatibility with older accumulators; this accumulator ignores
+# the sentinel pair because the typed event carries the same outcome
+# data without the pair-join logic. New accumulators should follow
+# the isinstance-based filter shape here; the CHANGELOG tracks when
+# the sentinel emission is removed.
+#
+# Per-invocation isolation is by ``LlmCompletionEvent.invocation_id``
+# — read directly off the event, no ContextVar lookup needed.
+# Concurrent invocations on one observer each get their own bucket.
+#
+# ``LlmCompletionEvent`` is success-only by spec design. Failed LLM
+# calls flow through the exception path and do NOT emit the typed
+# event, so ``bucket.call_count`` here reflects successful calls
+# only. This is the right semantic for a usage accumulator (failed
+# calls produce no tokens / cost). A pipeline tracking attempt-level
+# failure rates needs a separate listener — either a custom observer
+# on the sentinel ``NodeEvent`` pair, or a future
+# ``LlmCallFailedEvent`` typed variant if and when that proposal
+# lands.
 
 
 @dataclass
@@ -199,39 +217,40 @@ async def __call__(self, event: ObserverEvent) -> None:
         if isinstance(event, InvocationCompletedEvent):
             self._by_invocation.pop(event.invocation_id, None)
             return
-        if not isinstance(event, NodeEvent):
-            return
-        if event.namespace != LLM_NAMESPACE:
-            return
-        # Only the completed half of the pair carries the token counts.
-        if event.phase != "completed":
+        if not isinstance(event, LlmCompletionEvent):
             return
-        if not isinstance(event.pre_state, LlmEventPayload):
-            return
-        # NodeEvent doesn't carry invocation_id on the dataclass;
-        # observers read it from the ContextVar, which the
-        # deliver-loop worker's Context carries from the engine task
-        # at worker create-time (per-invocation worker, per-invocation
-        # Context).
-        invocation_id = current_invocation_id()
-        if invocation_id is None:
+        # call_count tracks successful LLM calls (the typed event is
+        # success-only by spec design). Spec contract has "call
+        # happened" and "usage reported" as INDEPENDENT — a provider
+        # may legitimately omit usage on a successful call. Create the
+        # bucket and increment call_count unconditionally so the
+        # counter reflects all successful calls; gate only the
+        # token-counting math on usage being populated.
+        bucket = self._by_invocation.setdefault(event.invocation_id, _UsageBucket())
+        bucket.call_count += 1
+        # The typed event's usage field is nullable per the spec
+        # contract ("may be null when the provider does not report
+        # usage"). Python's provider always passes a Usage instance
+        # (with all-None fields when not reported), but the defensive
+        # guard keeps the accumulator robust against future providers
+        # that exercise the null option. Calls without reported usage
+        # contribute zero tokens (the only honest value we can record).
+        usage = event.usage
+        if usage is None:
             return
-        payload = event.pre_state
-        bucket = self._by_invocation.setdefault(invocation_id, _UsageBucket())
-        if payload.prompt_tokens is not None:
-            bucket.prompt_tokens += payload.prompt_tokens
-        if payload.completion_tokens is not None:
-            bucket.completion_tokens += payload.completion_tokens
+        if usage.prompt_tokens is not None:
+            bucket.prompt_tokens += usage.prompt_tokens
+        if usage.completion_tokens is not None:
+            bucket.completion_tokens += usage.completion_tokens
         # Prefer the provider-reported total when present; otherwise
         # derive from prompt + completion when at least one is known.
-        # A payload with all three None (rare; provider didn't report
-        # usage at all) contributes zero, which is the only honest
-        # value we can record.
-        if payload.total_tokens is not None:
-            bucket.total_tokens += payload.total_tokens
-        elif payload.prompt_tokens is not None or payload.completion_tokens is not None:
-            bucket.total_tokens += (payload.prompt_tokens or 0) + (payload.completion_tokens or 0)
-        bucket.call_count += 1
+        # A usage record with all three None (rare; provider didn't
+        # report counts at all) contributes zero, which is the only
+        # honest value we can record.
+        if usage.total_tokens is not None:
+            bucket.total_tokens += usage.total_tokens
+        elif usage.prompt_tokens is not None or usage.completion_tokens is not None:
+            bucket.total_tokens += (usage.prompt_tokens or 0) + (usage.completion_tokens or 0)
 
     # Consumers MUST synchronize on ``drain_events_for`` before
     # calling ``get_bucket`` if completeness matters — without the