Skip to content

Commit a6b6f26

Browse files
Implement wire-byte stability (proposal 0047) (#145)
* Implement wire-byte stability (proposal 0047) Add intra-impl wire-byte stability to the OpenAI provider so equivalent OA inputs produce byte-identical wire output regardless of dict insertion order. A new ``_canonicalize_dict_keys`` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split: object keys are sorted, array order is caller-controlled). The helper applies at four user-supplied-dict boundaries: tool definitions (the ``function`` record top-level plus the parameters JSON Schema), ``response_format.json_schema.schema``, RuntimeConfig extras, and the JSON encoding of ``tool_call.arguments``. A top- level belt-and-suspenders pass over the assembled body catches anything the per-field passes miss. Closes proposal 0047 end-to-end: pieces 1 and 2 (Response.usage cache fields sourced from prompt_tokens_details + OTel observer emits the cache attributes) landed in v0.12.0; this is piece 3. Prompt-management §13 cross-variable substring stability is satisfied by the existing Jinja2 strict-undefined render path on both TextPrompt and ChatPrompt; pinned by new tests. A new ``docs/concepts/prompts.md`` section explains APC, what OA handles for users (wire-byte canonicalization, deterministic rendering), what users own (the spec's five informative authoring patterns), and a vLLM debugging callout for the cache-attribute- not-appearing case (server-side ``--enable-prefix-caching`` plus ``--enable-prompt-tokens-details``). Scope is the Chat Completions endpoint only. The OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (no python consumer today). Behavior change worth flagging: ``tool_call.arguments`` JSON encoding now uses ``sort_keys=True``. Functionally equivalent (parses to the same dict) but byte-different from the previous insertion-order encoding. * Address PR 145 review Two dead-pointer fixes flagged by CoPilot, both review-round-rename casualties: 1. CHANGELOG entry referenced ``_canonicalize_json_schema``; the helper was renamed to ``_canonicalize_dict_keys`` because it canonicalizes every user-supplied dict on the wire, not just JSON Schemas. 2. ``conformance.toml`` 0047 leading-comment block pointed at ``test_cross_variable_substring_stability``; that test got split into ``..._text_prompt`` and ``..._chat_prompt`` when coverage extended to the ChatPrompt variant.
1 parent d2d387a commit a6b6f26

6 files changed

Lines changed: 683 additions & 16 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
88

99
### Added
1010

11+
- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_dict_keys` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
1112
- **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.
1213

1314
### Changed

conformance.toml

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -266,11 +266,36 @@ status = "implemented"
266266
since = "0.11.0"
267267

268268
# Spec v0.39.0 (proposal 0047). Implicit prefix-cache wire-byte
269-
# stability. Cross-provider invariant requiring intra-impl byte
270-
# equality across calls with equivalent inputs. Queued for v0.13.0
271-
# alongside 0049 (LLM provider hardening + typed event batch).
269+
# stability. Cross-capability proposal landed in v0.13.0 across
270+
# three pieces: (1) ``Response.usage`` cache-stat fields
271+
# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
272+
# OpenAI ``prompt_tokens_details`` payload, with conditional emission
273+
# preserved (absent-vs-zero distinction stays observable) — landed
274+
# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
275+
# (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
276+
# (and optional ``openarmature.llm.cache_creation.input_tokens``)
277+
# when the corresponding usage field is populated — also v0.12.0;
278+
# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
279+
# adapter — landed here. The canonicalizer recursively sorts dict
280+
# keys at every nesting level while preserving caller-supplied
281+
# array order, applied at the four user-input boundaries
282+
# (``tool.parameters`` / ``tool.function`` record top-level per
283+
# spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
284+
# extras, ``tool_call.arguments`` JSON encoding) plus a top-level
285+
# belt-and-suspenders pass over the assembled request body. Scope
286+
# is the Chat Completions endpoint only; the OpenAI Responses API
287+
# endpoint is deferred to a future cycle (no python consumer
288+
# today). Prompt-management §13 cross-variable substring stability
289+
# is satisfied by the existing Jinja2 ``StrictUndefined`` render
290+
# path; pinned by ``tests/unit/test_prompts.py::
291+
# test_cross_variable_substring_stability_text_prompt`` and
292+
# ``test_cross_variable_substring_stability_chat_prompt``.
293+
# Anthropic / Gemini
294+
# wire-byte conformance fixtures stay deferred — neither provider
295+
# is implemented in python today.
272296
[proposals."0047"]
273-
status = "not-yet"
297+
status = "implemented"
298+
since = "0.13.0"
274299

275300
# Spec v0.40.0 (proposal 0048). Read-symmetric invocation metadata.
276301
# Adds ``get_invocation_metadata()`` symmetric to the existing

docs/concepts/prompts.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,73 @@ The filesystem backend layout is
365365
`<root>/<label>/<name>.j2`; for the example above,
366366
`./prompts/production/greeting.j2`.
367367

368+
## Prefix-cache friendly authoring (APC)
369+
370+
Inference engines that implement Automatic Prefix Caching
371+
(vLLM with `--enable-prefix-caching`, OpenAI's hosted prompt
372+
caching, llama.cpp's prefix reuse, others) skip recomputing
373+
attention for token prefixes they have already processed in
374+
a recent request. The cache hit is decided by **byte equality**
375+
of the prefix. A single reordered key, a shuffled tool
376+
definition, or a timestamp embedded in the system prompt
377+
invalidates the cache and re-runs full attention from the
378+
first changed byte.
379+
380+
OpenArmature handles the wire-byte half of this contract for
381+
you. The OpenAI provider canonicalizes every user-supplied dict
382+
on the wire — tool parameter schemas, response-format schemas,
383+
`RuntimeConfig` extras, tool-call arguments — so equivalent OA
384+
inputs produce byte-identical wire output regardless of dict
385+
insertion order. Prompt rendering is deterministic by
386+
construction: same `Prompt` plus same variables produces
387+
byte-identical `PromptResult.messages` (spec
388+
[prompt-management §13](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/prompt-management/spec.md#13-determinism)).
389+
390+
Authoring discipline that maximizes APC hit rates is
391+
out of OA's hands — it's about how you structure the prompts.
392+
The spec's [llm-provider §14 *APC-friendly authoring
393+
guidance*](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/llm-provider/spec.md#14-apc-friendly-authoring-guidance-informative)
394+
lists five informative patterns; the headline:
395+
396+
1. **Place variables and chat history at the end of templates.**
397+
Stable static prefix at the front maximizes cacheable bytes.
398+
2. **No timestamps, UUIDs, or other nondeterministic values
399+
in static segments.** They poison the cache prefix on every
400+
request.
401+
3. **Stable few-shot ordering.** Pick once, reuse across
402+
requests; don't shuffle.
403+
4. **Sort retrieval results before injecting** when the
404+
downstream consumer doesn't care about order.
405+
5. **Cache-friendly tool ordering.** Define tools in a stable
406+
order across calls.
407+
408+
### Debugging "the cache attribute isn't showing up"
409+
410+
When the OTel observer is running but
411+
`openarmature.llm.cache_read.input_tokens` doesn't appear on
412+
your `openarmature.llm.complete` spans, the cause is almost
413+
always server-side: the inference engine either isn't
414+
configured to surface cache stats, or isn't running with prefix
415+
caching enabled at all.
416+
417+
- **vLLM**: launch with `--enable-prefix-caching` AND
418+
`--enable-prompt-tokens-details`. The first turns APC on;
419+
the second tells vLLM to populate
420+
`usage.prompt_tokens_details.cached_tokens` on the wire
421+
response. Both flags are required for the attribute to
422+
surface.
423+
- **OpenAI hosted (Chat Completions)**: prompt caching is
424+
on automatically for prompts ≥1024 tokens; the
425+
`prompt_tokens_details.cached_tokens` field appears on
426+
qualifying responses without configuration.
427+
428+
OA's role is to source the field when present (provider-side)
429+
and emit the attribute when populated (observer-side); without
430+
the upstream signal, neither happens — and that's the right
431+
behavior (per the spec's absent-vs-zero distinction, an absent
432+
attribute means "the provider didn't report," not "zero
433+
hits").
434+
368435
## What's out of scope (for now)
369436

370437
- **Specific vendor backends**: Langfuse, PromptLayer, etc.,

src/openarmature/llm/providers/openai.py

Lines changed: 66 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -714,18 +714,24 @@ def _build_request_body(
714714
body["stop"] = config.stop_sequences
715715
# Pass-through any provider-specific extras (extra="allow"
716716
# on RuntimeConfig); spec §6 mandates implementations MUST
717-
# accept and forward undeclared fields untouched.
717+
# accept and forward undeclared fields untouched. Spec 0047
718+
# §8: canonicalize each extra value at the user-input
719+
# boundary so dict-typed extras (vLLM ``guided_decoding``,
720+
# etc.) render with stable key ordering.
718721
extras = config.model_extra or {}
719722
for k, v in extras.items():
720-
body.setdefault(k, v)
723+
body.setdefault(k, _canonicalize_dict_keys(v))
721724
# response_format is omitted entirely on the fallback path —
722725
# the schema travels in the augmented system message instead.
723726
if schema_dict is not None and include_response_format:
727+
# Spec 0047 §8.1.5 / Q5 ack: response_format.json_schema.schema
728+
# is a user-supplied JSON Schema and flows through the same
729+
# canonicalization path as tool.parameters.
724730
body["response_format"] = {
725731
"type": "json_schema",
726732
"json_schema": {
727733
"name": _derive_schema_name(schema_dict),
728-
"schema": schema_dict,
734+
"schema": _canonicalize_dict_keys(schema_dict),
729735
"strict": strict_mode_supported(schema_dict),
730736
},
731737
}
@@ -752,7 +758,12 @@ def _build_request_body(
752758
}
753759
else:
754760
body["tool_choice"] = tool_choice
755-
return body
761+
# Spec 0047 §8 belt-and-suspenders: walk the assembled body
762+
# once more sorting any dict at every nesting level, in case
763+
# a future code path introduces a user-input boundary the
764+
# per-field canonicalization above doesn't cover. Cheap (the
765+
# body is small) and explicit.
766+
return _canonicalize_dict_keys(body)
756767

757768
# ------------------------------------------------------------------
758769
# Response parsing (spec §8.1.2)
@@ -1132,8 +1143,11 @@ def _message_to_wire(msg: Message) -> dict[str, Any]:
11321143
"name": tc.name,
11331144
# Canonical compact form (no inter-token spaces). Matches
11341145
# the spec's wire-mapping fixture (005, cases shape) and
1135-
# the form OpenAI itself emits.
1136-
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":")),
1146+
# the form OpenAI itself emits. ``sort_keys=True`` per
1147+
# spec 0047 §8 — tool-call arguments are a
1148+
# caller-supplied dict and the JSON-encoded string
1149+
# MUST be byte-stable across equivalent inputs.
1150+
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":"), sort_keys=True),
11371151
},
11381152
}
11391153
for tc in msg.tool_calls
@@ -1169,14 +1183,55 @@ def _block_to_wire(block: ContentBlock) -> dict[str, Any]:
11691183
return {"type": "image_url", "image_url": image_url}
11701184

11711185

1186+
# Spec 0047 §8 *Intra-impl wire-byte stability* canonicalizer.
1187+
# Recursively sorts dict keys at every nesting level; preserves list
1188+
# ordering (per Q5 ack on the proposal-0047 coord thread — array
1189+
# ORDER is caller-supplied and stays as-is; object KEYS inside
1190+
# arrays get sorted via the dict-recursion branch). Applied at every
1191+
# user-supplied-dict boundary in the wire body so equivalent OA
1192+
# inputs produce byte-identical wire output for APC hit reliability.
1193+
#
1194+
# Recursion depth: bounded by the depth of the input dict, not by
1195+
# any internal accumulator. Python's default recursion limit (1000)
1196+
# is two orders of magnitude above realistic JSON Schema depths
1197+
# (typical schemas top out at 5-10 nesting levels — OpenAI's API
1198+
# rejects deeper ones at the wire layer before the cache prefix
1199+
# matters). We don't impose our own cap; if a caller hands us a
1200+
# 1000-deep nested dict, RecursionError surfaces immediately at
1201+
# canonicalization time rather than producing silently-broken wire
1202+
# bytes downstream.
1203+
#
1204+
# Byte-stability requires Python's dict insertion-order preservation
1205+
# guarantee (PEP 468, 3.7+) AND httpx serializing the body via the
1206+
# stdlib ``json.dumps`` default (which respects dict iteration
1207+
# order). Both are stable contracts on the supported Python versions
1208+
# + httpx 0.27+. If a future httpx release internalizes ordering
1209+
# (e.g., switches to alphabetical key emission), the canonicalizer
1210+
# becomes redundant but tests would continue to pass; if it
1211+
# randomizes ordering, the wire-byte tests in
1212+
# ``tests/unit/test_llm_provider.py`` would fail loudly.
1213+
def _canonicalize_dict_keys(value: Any) -> Any:
1214+
if isinstance(value, dict):
1215+
return {k: _canonicalize_dict_keys(value[k]) for k in sorted(cast("dict[str, Any]", value))}
1216+
if isinstance(value, list):
1217+
return [_canonicalize_dict_keys(v) for v in cast("list[Any]", value)]
1218+
return value
1219+
1220+
11721221
def _tool_to_wire(tool: Tool) -> dict[str, Any]:
1222+
# Per spec 0047 §8 ack (coord Q5): the byte-stability rule covers
1223+
# tool DEFINITIONS broadly — not just the parameters subtree.
1224+
# Sort the function record's top-level keys + recursively
1225+
# canonicalize the parameters JSON Schema.
11731226
return {
11741227
"type": "function",
1175-
"function": {
1176-
"name": tool.name,
1177-
"description": tool.description,
1178-
"parameters": tool.parameters,
1179-
},
1228+
"function": _canonicalize_dict_keys(
1229+
{
1230+
"name": tool.name,
1231+
"description": tool.description,
1232+
"parameters": tool.parameters,
1233+
}
1234+
),
11801235
}
11811236

11821237

0 commit comments

Comments
 (0)