LunarCommand · chris-colinsky · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,22 +8,20 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 ### Added
 
-- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_dict_keys` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
+- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). Closes proposal 0047 end-to-end across three pieces all landing in v0.13.0: (1) `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from the OpenAI `prompt_tokens_details` payload (PR #136); (2) the OTel observer emits `openarmature.llm.cache_read.input_tokens` and optional `openarmature.llm.cache_creation.input_tokens` when the corresponding usage field is populated (PR #140); (3) the OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding) via a new `_canonicalize_dict_keys` helper that recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering, plus a top-level belt-and-suspenders canonicalization pass over the assembled body (PR #145). Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
 - **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.
+- **`LlmCompletionEvent` extended with proposal 0057 request-side fields** (spec v0.51.0). The typed event now carries `input_messages`, `output_content`, `request_params`, `request_extras`, `active_prompt`, `active_prompt_group`, `call_id`, and `response_model` alongside the existing v0.49.0 fields. `request_id` renamed to `response_id` per the proposal's response-side naming. Inline image bytes in `input_messages` stay redacted per observability §5.5.5 — the OpenAI provider reuses the existing message-serialization helper for the projection. Observer-side privacy gates (OTel `disable_llm_payload`, Langfuse equivalents) apply at rendering, symmetric with the §5.5.1 span attribute path.
 
 ### Changed
 
 - **Sentinel-namespace `NodeEvent` emission for LLM events retired entirely from `OpenAIProvider`** (proposal 0058 cleanup). The provider no longer dispatches the `("openarmature.llm.complete",)`-namespaced `NodeEvent`s on either outcome path; both success and failure flow through their respective typed variants exclusively. The `_make_llm_event` helper is removed. External custom observers that filtered LLM calls by `event.namespace == LLM_NAMESPACE` MUST migrate to `isinstance(event, LlmCompletionEvent)` for success and `isinstance(event, LlmFailedEvent)` for failure to keep receiving LLM-call notifications. `LlmEventPayload` and `LLM_NAMESPACE` remain in `openarmature.observability.llm_event` as a documented compatibility surface for custom providers that haven't migrated; neither is referenced by the bundled provider or observers anymore.
-- **Pinned spec advances from v0.51.0 to v0.53.0** (absorbs proposals 0023 + 0058). Proposal 0023 (canonical state reducers) ships in spec v0.52.0 but is not implemented this cycle — `conformance.toml` marks 0023 as `not-yet`; fixtures 034–038 stay parser-deferred.
+- **Pinned spec advances from v0.46.0 to v0.53.0** across the v0.13.0 cycle. Absorbs four implemented proposals (0047 — implicit prefix-cache wire-byte stability; 0049 — typed `LlmCompletionEvent`; 0057 — `LlmCompletionEvent` request-side field-set extension; 0058 — typed `LlmFailedEvent`) plus 0023 (canonical state reducers, v0.52.0) carried as `not-yet` in the manifest. Pin journey: v0.46.0 → v0.51.0 (PR #141 absorbs 0057) → v0.53.0 (PR #144 absorbs 0058; spec v0.52.0's 0023 entry rides along as `not-yet`). Fixtures 034–038 (0023) stay parser-deferred.
+- **`tool_call.arguments` JSON encoding now uses `sort_keys=True`** (proposal 0047 §8 byte-stability requirement for caller-supplied dicts JSON-encoded into a string field). Functionally equivalent — the encoded string parses to the same dict — but byte-different from the previous insertion-order encoding. Downstream consumers that snapshot wire bodies (golden-file tests, audit logging, recorded fixtures) will see byte-different `tool_calls[].function.arguments` strings across this upgrade for any call whose argument dict was emitted in non-sorted insertion order before.
 - **OTel and Langfuse observers drive the `openarmature.llm.complete` span / Generation observation lifecycle from the typed `LlmCompletionEvent`** (proposal 0049 + 0057, observability §5.5.7). Successful LLM-provider calls now open + close the OTel span and the Langfuse Generation in one shot at typed-event arrival, with `start_time` back-dated by `LlmCompletionEvent.latency_ms` so duration reflects the adapter-boundary measurement rather than dispatcher queue delay. The §5.5 attribute set and §8.4 Generation metadata are unchanged. (Failure paths land on `LlmFailedEvent` later in the same cycle — see the proposal 0058 entry above.)
 - **`OpenAIProvider.complete()` no longer emits the sentinel `NodeEvent` pair on the success path** (v0.13.0 cleanup). The bundled OTel and Langfuse observers now consume the typed `LlmCompletionEvent` directly; the sentinel pair was kept on the success path through earlier releases for compatibility with pre-typed-event observers. External custom observers that filtered LLM calls by `event.namespace == LLM_NAMESPACE` MUST migrate to `isinstance(event, LlmCompletionEvent)` to continue seeing successful LLM calls. (The failure-path sentinel emission is retired entirely later in the same cycle — see the proposal 0058 entry above.)
 - **`LangfuseClient` Protocol gains optional `start_time` / `end_time` timestamps** on `generation(...)` and the Generation/Span handles' `end(...)`. The Langfuse observer passes back-dated timestamps on the typed-event success path so the Langfuse UI shows the actual adapter-boundary duration. The SDK adapter handles v4 Langfuse SDK quirks transparently: `Langfuse.start_observation()` does NOT accept `start_time`, so back-dated generations are routed through the private `_otel_tracer.start_span(name=..., start_time=int_ns)` API (mirroring the SDK's own `create_event` precedent) and the resulting OTel span is wrapped in `LangfuseGeneration` directly; the non-back-dated path still uses `start_observation`. `LangfuseSpan.end()` is typed `Optional[int]` (nanoseconds), so the adapter converts the Protocol's `datetime` surface to int nanoseconds before forwarding. The `InMemoryLangfuseClient` stores both fields verbatim on `LangfuseObservation` for test assertions.
 - **`OpenAIProvider(populate_caller_metadata=...)` default flipped from `False` to `True`.** The python implementation now populates `LlmCompletionEvent.caller_invocation_metadata` by default so the bundled OTel and Langfuse observers can emit the §5.6 `openarmature.user.<key>` span-attribute family without a separate opt-in. Pass `populate_caller_metadata=False` to suppress the snapshot when no downstream consumer needs it. The spec-defined opt-in mechanism is unchanged; only the python default flips.
 
-### Added
-
-- **`LlmCompletionEvent` extended with proposal 0057 request-side fields** (spec v0.51.0). The typed event now carries `input_messages`, `output_content`, `request_params`, `request_extras`, `active_prompt`, `active_prompt_group`, `call_id`, and `response_model` alongside the existing v0.49.0 fields. `request_id` renamed to `response_id` per the proposal's response-side naming. Inline image bytes in `input_messages` stay redacted per observability §5.5.5 — the OpenAI provider reuses the existing message-serialization helper for the projection. Observer-side privacy gates (OTel `disable_llm_payload`, Langfuse equivalents) apply at rendering, symmetric with the §5.5.1 span attribute path.
-
 ## [0.12.0] — 2026-06-05
 
 Observability release. The pinned spec advances from v0.38.0 to v0.46.0, absorbing eight accepted proposals (0047-0054). Three ship as fully implemented this cycle: proposal 0048 grows a read-symmetric `get_invocation_metadata()` API + a §9 *Queryable observer pattern* concept doc section; proposal 0052 puts `openarmature.implementation.name` + `.version` attribution attributes on every OTel invocation span + every Langfuse Trace; proposal 0054 ships `CompiledGraph.drain_events_for(invocation_id, *, timeout)` as the architectural pair to 0048's §9.4 accumulator lifecycle. Two ship as textual-only acks (0051 Langfuse trace I/O caveat; 0053 §3.4 shared-parent boundary clarification). One Fixed: the retry middleware now resets the invocation-metadata ContextVar between attempts per §3.4. The production-observability example grows the queryable accumulator + drain_events_for pattern end-to-end so the new APIs have a runnable demo.

diff --git a/conformance.toml b/conformance.toml
@@ -266,33 +266,38 @@ status = "implemented"
 since = "0.11.0"
 
 # Spec v0.39.0 (proposal 0047).  Implicit prefix-cache wire-byte
-# stability.  Cross-capability proposal landed in v0.13.0 across
-# three pieces: (1) ``Response.usage`` cache-stat fields
-# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
-# OpenAI ``prompt_tokens_details`` payload, with conditional emission
+# stability.  Cross-capability proposal landed end-to-end in the
+# v0.13.0 cycle across three pieces, all post-v0.12.0:
+# (1) ``Response.usage`` cache-stat fields (``cached_tokens`` /
+# ``cache_creation_tokens``) sourced from the OpenAI
+# ``prompt_tokens_details`` payload, with conditional emission
 # preserved (absent-vs-zero distinction stays observable) — landed
-# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
+# in PR #136 as the proposal's payload-side prerequisite;
 # (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
 # (and optional ``openarmature.llm.cache_creation.input_tokens``)
-# when the corresponding usage field is populated — also v0.12.0;
-# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
-# adapter — landed here. The canonicalizer recursively sorts dict
-# keys at every nesting level while preserving caller-supplied
-# array order, applied at the four user-input boundaries
+# when the corresponding usage field is populated — landed in
+# PR #140; (3) §8.1 intra-impl wire-byte canonicalization in the
+# OpenAI adapter — landed in PR #145. The canonicalizer recursively
+# sorts dict keys at every nesting level while preserving caller-
+# supplied array order, applied at the four user-input boundaries
 # (``tool.parameters`` / ``tool.function`` record top-level per
 # spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
 # extras, ``tool_call.arguments`` JSON encoding) plus a top-level
-# belt-and-suspenders pass over the assembled request body.  Scope
-# is the Chat Completions endpoint only; the OpenAI Responses API
-# endpoint is deferred to a future cycle (no python consumer
-# today).  Prompt-management §13 cross-variable substring stability
-# is satisfied by the existing Jinja2 ``StrictUndefined`` render
-# path; pinned by ``tests/unit/test_prompts.py::
+# belt-and-suspenders pass over the assembled request body.
+# Downstream-observable wire-byte shift on
+# ``tool_call.arguments``: the encoded string now uses
+# ``sort_keys=True`` (functionally equivalent — parses to the same
+# dict — but byte-different for golden-file / audit-snapshot
+# consumers).  Scope is the Chat Completions endpoint only; the
+# OpenAI Responses API endpoint is deferred to a future cycle (no
+# python consumer today).  Prompt-management §13 cross-variable
+# substring stability is satisfied by the existing Jinja2
+# ``StrictUndefined`` render path; pinned by
+# ``tests/unit/test_prompts.py::
 # test_cross_variable_substring_stability_text_prompt`` and
 # ``test_cross_variable_substring_stability_chat_prompt``.
-# Anthropic / Gemini
-# wire-byte conformance fixtures stay deferred — neither provider
-# is implemented in python today.
+# Anthropic / Gemini wire-byte conformance fixtures stay deferred
+# — neither provider is implemented in python today.
 [proposals."0047"]
 status = "implemented"
 since = "0.13.0"
@@ -344,16 +349,24 @@ status = "implemented"
 since = "0.12.0"
 
 # Spec v0.41.0 (proposal 0049).  Typed LLM Completion Event — first
-# typed event variant on the observer event union.  Shipped in
-# v0.13.0: provider dual-emits the typed event alongside the sentinel
-# NodeEvent pair (success-only per spec scope); LlmCompletionEvent
-# carries identity/scoping/outcome fields per the spec field table.
-# Conformance fixtures 050-056 activated by the typed_event_collector
-# harness directive.  The OTel + Langfuse observers continue to drive
-# their §5.5 / §8.4.4 surface off the sentinel NodeEvent pair during
-# the dual-emit transition window; type-discrimination migration
-# lands once the follow-on request-side-fields extension (proposal
-# 0057) ships.
+# typed event variant on the observer event union.  Shipped fully in
+# v0.13.0 across PRs #141 (typed-event definition + provider
+# emission + 0057 field-set extension), #142 (OTel observer migration
+# to type discrimination), #143 (Langfuse observer migration +
+# success-side sentinel emission dropped), and #144 (0058 typed
+# LlmFailedEvent + sentinel-namespace NodeEvent emission for LLM
+# events retired entirely from the bundled OpenAIProvider).
+# LlmCompletionEvent carries identity/scoping/outcome fields per
+# the spec field table.  Both bundled observers (OTel + Langfuse)
+# consume the typed events via isinstance discrimination on both
+# outcome paths.  Conformance fixtures 050-056 activated by the
+# typed_event_collector harness directive.  Fixtures 057-068
+# (proposal 0057 request-side fields) and 069-073 (proposal 0058
+# typed failure event) stay parser-deferred pending the harness's
+# typed_event_collector directive schema catch-up + the event_counts
+# list directive introduced by fixture 071; behavior pinned by
+# unit tests in tests/unit/test_llm_provider.py +
+# test_observability_otel.py + test_observability_langfuse.py.
 [proposals."0049"]
 status = "implemented"
 since = "0.13.0"

diff --git a/docs/agent/non-obvious-shapes.md b/docs/agent/non-obvious-shapes.md
@@ -127,7 +127,7 @@ Catching `Exception` works but is too broad; catching one hierarchy misses the o
 
 ### Filter `openarmature.*`-namespaced events when your observer only cares about user nodes
 
-OA emits observer events under sentinel node-names for its own internal dispatch: `openarmature.llm.complete` for LLM provider calls (proposal 0024), `openarmature.checkpoint.migrate` for state-migration runs (proposal 0014), `openarmature.checkpoint.save` for checkpoint saves (proposal 0010). These events let the OTel / Langfuse observers emit LLM-provider spans, checkpoint-migrate spans, etc., but a custom observer that only cares about user-defined node activity sees them as noise:
+OA emits observer events under sentinel node-names for some internal dispatch: `openarmature.checkpoint.migrate` for state-migration runs (proposal 0014) and `openarmature.checkpoint.save` for checkpoint saves (proposal 0010) ride on `NodeEvent` with a sentinel namespace. (LLM provider calls used to follow the same pattern but moved to typed `LlmCompletionEvent` / `LlmFailedEvent` variants in v0.13.0 per proposals 0049 + 0058 — those are filtered by `isinstance` instead.) The sentinel-namespace events let the OTel / Langfuse observers emit checkpoint-migrate spans, etc., but a custom observer that only cares about user-defined node activity sees them as noise:
 
 ```python
 async def __call__(self, event: NodeEvent) -> None:
@@ -137,7 +137,7 @@ async def __call__(self, event: NodeEvent) -> None:
     # … user-node handling
 ```
 
-`event.namespace[0]` is the safest discriminator (the leaf `event.node_name` would also work for LLM events but won't match the checkpoint sentinels since those repurpose `node_name` differently). Don't try to filter on `current_invocation_id() is None`: OA-internal events are dispatched within the same invocation context as user-node events, so `invocation_id` is set for both; the namespace-prefix check is the stable contract.
+`event.namespace[0]` is the safest discriminator. Don't try to filter on `current_invocation_id() is None`: OA-internal events are dispatched within the same invocation context as user-node events, so `invocation_id` is set for both; the namespace-prefix check is the stable contract.
 
 ### Fan-out subgraphs that emit `list[X]` per instance produce `list[list[X]]` at `target_field`