Implement wire-byte stability (proposal 0047) (#145)

chris-colinsky · web-flow · commit a6b6f265cf99 · 2026-06-09T17:08:36.000-07:00
* Implement wire-byte stability (proposal 0047)

Add intra-impl wire-byte stability to the OpenAI provider so
equivalent OA inputs produce byte-identical wire output regardless
of dict insertion order. A new ``_canonicalize_dict_keys`` helper
recursively sorts dict keys at every nesting level while preserving
caller-supplied array ordering (the spec's split: object keys are
sorted, array order is caller-controlled).

The helper applies at four user-supplied-dict boundaries: tool
definitions (the ``function`` record top-level plus the parameters
JSON Schema), ``response_format.json_schema.schema``, RuntimeConfig
extras, and the JSON encoding of ``tool_call.arguments``. A top-
level belt-and-suspenders pass over the assembled body catches
anything the per-field passes miss.

Closes proposal 0047 end-to-end: pieces 1 and 2 (Response.usage
cache fields sourced from prompt_tokens_details + OTel observer
emits the cache attributes) landed in v0.12.0; this is piece 3.
Prompt-management §13 cross-variable substring stability is
satisfied by the existing Jinja2 strict-undefined render path on
both TextPrompt and ChatPrompt; pinned by new tests.

A new ``docs/concepts/prompts.md`` section explains APC, what OA
handles for users (wire-byte canonicalization, deterministic
rendering), what users own (the spec's five informative authoring
patterns), and a vLLM debugging callout for the cache-attribute-
not-appearing case (server-side ``--enable-prefix-caching`` plus
``--enable-prompt-tokens-details``).

Scope is the Chat Completions endpoint only. The OpenAI Responses
API endpoint and the Anthropic / Gemini wire-format mappings are
deferred (no python consumer today).

Behavior change worth flagging: ``tool_call.arguments`` JSON
encoding now uses ``sort_keys=True``. Functionally equivalent
(parses to the same dict) but byte-different from the previous
insertion-order encoding.

* Address PR 145 review

Two dead-pointer fixes flagged by CoPilot, both review-round-rename
casualties:

1. CHANGELOG entry referenced ``_canonicalize_json_schema``; the
   helper was renamed to ``_canonicalize_dict_keys`` because it
   canonicalizes every user-supplied dict on the wire, not just
   JSON Schemas.

2. ``conformance.toml`` 0047 leading-comment block pointed at
   ``test_cross_variable_substring_stability``; that test got
   split into ``..._text_prompt`` and ``..._chat_prompt`` when
   coverage extended to the ChatPrompt variant.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
 
 ### Added
 
+- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_dict_keys` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
 - **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.
 
 ### Changed
diff --git a/conformance.toml b/conformance.toml
@@ -266,11 +266,36 @@ status = "implemented"
 since = "0.11.0"
 
 # Spec v0.39.0 (proposal 0047).  Implicit prefix-cache wire-byte
-# stability.  Cross-provider invariant requiring intra-impl byte
-# equality across calls with equivalent inputs.  Queued for v0.13.0
-# alongside 0049 (LLM provider hardening + typed event batch).
+# stability.  Cross-capability proposal landed in v0.13.0 across
+# three pieces: (1) ``Response.usage`` cache-stat fields
+# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
+# OpenAI ``prompt_tokens_details`` payload, with conditional emission
+# preserved (absent-vs-zero distinction stays observable) — landed
+# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
+# (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
+# (and optional ``openarmature.llm.cache_creation.input_tokens``)
+# when the corresponding usage field is populated — also v0.12.0;
+# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
+# adapter — landed here. The canonicalizer recursively sorts dict
+# keys at every nesting level while preserving caller-supplied
+# array order, applied at the four user-input boundaries
+# (``tool.parameters`` / ``tool.function`` record top-level per
+# spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
+# extras, ``tool_call.arguments`` JSON encoding) plus a top-level
+# belt-and-suspenders pass over the assembled request body.  Scope
+# is the Chat Completions endpoint only; the OpenAI Responses API
+# endpoint is deferred to a future cycle (no python consumer
+# today).  Prompt-management §13 cross-variable substring stability
+# is satisfied by the existing Jinja2 ``StrictUndefined`` render
+# path; pinned by ``tests/unit/test_prompts.py::
+# test_cross_variable_substring_stability_text_prompt`` and
+# ``test_cross_variable_substring_stability_chat_prompt``.
+# Anthropic / Gemini
+# wire-byte conformance fixtures stay deferred — neither provider
+# is implemented in python today.
 [proposals."0047"]
-status = "not-yet"
+status = "implemented"
+since = "0.13.0"
 
 # Spec v0.40.0 (proposal 0048).  Read-symmetric invocation metadata.
 # Adds ``get_invocation_metadata()`` symmetric to the existing
diff --git a/docs/concepts/prompts.md b/docs/concepts/prompts.md
@@ -365,6 +365,73 @@ The filesystem backend layout is
 `<root>/<label>/<name>.j2`; for the example above,
 `./prompts/production/greeting.j2`.
 
+## Prefix-cache friendly authoring (APC)
+
+Inference engines that implement Automatic Prefix Caching
+(vLLM with `--enable-prefix-caching`, OpenAI's hosted prompt
+caching, llama.cpp's prefix reuse, others) skip recomputing
+attention for token prefixes they have already processed in
+a recent request. The cache hit is decided by **byte equality**
+of the prefix. A single reordered key, a shuffled tool
+definition, or a timestamp embedded in the system prompt
+invalidates the cache and re-runs full attention from the
+first changed byte.
+
+OpenArmature handles the wire-byte half of this contract for
+you. The OpenAI provider canonicalizes every user-supplied dict
+on the wire — tool parameter schemas, response-format schemas,
+`RuntimeConfig` extras, tool-call arguments — so equivalent OA
+inputs produce byte-identical wire output regardless of dict
+insertion order. Prompt rendering is deterministic by
+construction: same `Prompt` plus same variables produces
+byte-identical `PromptResult.messages` (spec
+[prompt-management §13](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/prompt-management/spec.md#13-determinism)).
+
+Authoring discipline that maximizes APC hit rates is
+out of OA's hands — it's about how you structure the prompts.
+The spec's [llm-provider §14 *APC-friendly authoring
+guidance*](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/llm-provider/spec.md#14-apc-friendly-authoring-guidance-informative)
+lists five informative patterns; the headline:
+
+1. **Place variables and chat history at the end of templates.**
+   Stable static prefix at the front maximizes cacheable bytes.
+2. **No timestamps, UUIDs, or other nondeterministic values
+   in static segments.** They poison the cache prefix on every
+   request.
+3. **Stable few-shot ordering.** Pick once, reuse across
+   requests; don't shuffle.
+4. **Sort retrieval results before injecting** when the
+   downstream consumer doesn't care about order.
+5. **Cache-friendly tool ordering.** Define tools in a stable
+   order across calls.
+
+### Debugging "the cache attribute isn't showing up"
+
+When the OTel observer is running but
+`openarmature.llm.cache_read.input_tokens` doesn't appear on
+your `openarmature.llm.complete` spans, the cause is almost
+always server-side: the inference engine either isn't
+configured to surface cache stats, or isn't running with prefix
+caching enabled at all.
+
+- **vLLM**: launch with `--enable-prefix-caching` AND
+  `--enable-prompt-tokens-details`. The first turns APC on;
+  the second tells vLLM to populate
+  `usage.prompt_tokens_details.cached_tokens` on the wire
+  response. Both flags are required for the attribute to
+  surface.
+- **OpenAI hosted (Chat Completions)**: prompt caching is
+  on automatically for prompts ≥1024 tokens; the
+  `prompt_tokens_details.cached_tokens` field appears on
+  qualifying responses without configuration.
+
+OA's role is to source the field when present (provider-side)
+and emit the attribute when populated (observer-side); without
+the upstream signal, neither happens — and that's the right
+behavior (per the spec's absent-vs-zero distinction, an absent
+attribute means "the provider didn't report," not "zero
+hits").
+
 ## What's out of scope (for now)
 
 - **Specific vendor backends**: Langfuse, PromptLayer, etc.,
diff --git a/src/openarmature/llm/providers/openai.py b/src/openarmature/llm/providers/openai.py
@@ -714,18 +714,24 @@ def _build_request_body(
                 body["stop"] = config.stop_sequences
             # Pass-through any provider-specific extras (extra="allow"
             # on RuntimeConfig); spec §6 mandates implementations MUST
-            # accept and forward undeclared fields untouched.
+            # accept and forward undeclared fields untouched. Spec 0047
+            # §8: canonicalize each extra value at the user-input
+            # boundary so dict-typed extras (vLLM ``guided_decoding``,
+            # etc.) render with stable key ordering.
             extras = config.model_extra or {}
             for k, v in extras.items():
-                body.setdefault(k, v)
+                body.setdefault(k, _canonicalize_dict_keys(v))
         # response_format is omitted entirely on the fallback path —
         # the schema travels in the augmented system message instead.
         if schema_dict is not None and include_response_format:
+            # Spec 0047 §8.1.5 / Q5 ack: response_format.json_schema.schema
+            # is a user-supplied JSON Schema and flows through the same
+            # canonicalization path as tool.parameters.
             body["response_format"] = {
                 "type": "json_schema",
                 "json_schema": {
                     "name": _derive_schema_name(schema_dict),
-                    "schema": schema_dict,
+                    "schema": _canonicalize_dict_keys(schema_dict),
                     "strict": strict_mode_supported(schema_dict),
                 },
             }
@@ -752,7 +758,12 @@ def _build_request_body(
                 }
             else:
                 body["tool_choice"] = tool_choice
-        return body
+        # Spec 0047 §8 belt-and-suspenders: walk the assembled body
+        # once more sorting any dict at every nesting level, in case
+        # a future code path introduces a user-input boundary the
+        # per-field canonicalization above doesn't cover. Cheap (the
+        # body is small) and explicit.
+        return _canonicalize_dict_keys(body)
 
     # ------------------------------------------------------------------
     # Response parsing (spec §8.1.2)
@@ -1132,8 +1143,11 @@ def _message_to_wire(msg: Message) -> dict[str, Any]:
                         "name": tc.name,
                         # Canonical compact form (no inter-token spaces). Matches
                         # the spec's wire-mapping fixture (005, cases shape) and
-                        # the form OpenAI itself emits.
-                        "arguments": json.dumps(tc.arguments or {}, separators=(",", ":")),
+                        # the form OpenAI itself emits. ``sort_keys=True`` per
+                        # spec 0047 §8 — tool-call arguments are a
+                        # caller-supplied dict and the JSON-encoded string
+                        # MUST be byte-stable across equivalent inputs.
+                        "arguments": json.dumps(tc.arguments or {}, separators=(",", ":"), sort_keys=True),
                     },
                 }
                 for tc in msg.tool_calls
@@ -1169,14 +1183,55 @@ def _block_to_wire(block: ContentBlock) -> dict[str, Any]:
     return {"type": "image_url", "image_url": image_url}
 
 
+# Spec 0047 §8 *Intra-impl wire-byte stability* canonicalizer.
+# Recursively sorts dict keys at every nesting level; preserves list
+# ordering (per Q5 ack on the proposal-0047 coord thread — array
+# ORDER is caller-supplied and stays as-is; object KEYS inside
+# arrays get sorted via the dict-recursion branch). Applied at every
+# user-supplied-dict boundary in the wire body so equivalent OA
+# inputs produce byte-identical wire output for APC hit reliability.
+#
+# Recursion depth: bounded by the depth of the input dict, not by
+# any internal accumulator. Python's default recursion limit (1000)
+# is two orders of magnitude above realistic JSON Schema depths
+# (typical schemas top out at 5-10 nesting levels — OpenAI's API
+# rejects deeper ones at the wire layer before the cache prefix
+# matters). We don't impose our own cap; if a caller hands us a
+# 1000-deep nested dict, RecursionError surfaces immediately at
+# canonicalization time rather than producing silently-broken wire
+# bytes downstream.
+#
+# Byte-stability requires Python's dict insertion-order preservation
+# guarantee (PEP 468, 3.7+) AND httpx serializing the body via the
+# stdlib ``json.dumps`` default (which respects dict iteration
+# order). Both are stable contracts on the supported Python versions
+# + httpx 0.27+. If a future httpx release internalizes ordering
+# (e.g., switches to alphabetical key emission), the canonicalizer
+# becomes redundant but tests would continue to pass; if it
+# randomizes ordering, the wire-byte tests in
+# ``tests/unit/test_llm_provider.py`` would fail loudly.
+def _canonicalize_dict_keys(value: Any) -> Any:
+    if isinstance(value, dict):
+        return {k: _canonicalize_dict_keys(value[k]) for k in sorted(cast("dict[str, Any]", value))}
+    if isinstance(value, list):
+        return [_canonicalize_dict_keys(v) for v in cast("list[Any]", value)]
+    return value
+
+
 def _tool_to_wire(tool: Tool) -> dict[str, Any]:
+    # Per spec 0047 §8 ack (coord Q5): the byte-stability rule covers
+    # tool DEFINITIONS broadly — not just the parameters subtree.
+    # Sort the function record's top-level keys + recursively
+    # canonicalize the parameters JSON Schema.
     return {
         "type": "function",
-        "function": {
-            "name": tool.name,
-            "description": tool.description,
-            "parameters": tool.parameters,
-        },
+        "function": _canonicalize_dict_keys(
+            {
+                "name": tool.name,
+                "description": tool.description,
+                "parameters": tool.parameters,
+            }
+        ),
     }
 
 
diff --git a/tests/unit/test_llm_provider.py b/tests/unit/test_llm_provider.py
diff --git a/tests/unit/test_prompts.py b/tests/unit/test_prompts.py