Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The

### Added

- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_dict_keys` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
- **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.

### Changed
Expand Down
33 changes: 29 additions & 4 deletions conformance.toml
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,36 @@ status = "implemented"
since = "0.11.0"

# Spec v0.39.0 (proposal 0047). Implicit prefix-cache wire-byte
# stability. Cross-provider invariant requiring intra-impl byte
# equality across calls with equivalent inputs. Queued for v0.13.0
# alongside 0049 (LLM provider hardening + typed event batch).
# stability. Cross-capability proposal landed in v0.13.0 across
# three pieces: (1) ``Response.usage`` cache-stat fields
# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
# OpenAI ``prompt_tokens_details`` payload, with conditional emission
# preserved (absent-vs-zero distinction stays observable) — landed
# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
# (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
# (and optional ``openarmature.llm.cache_creation.input_tokens``)
# when the corresponding usage field is populated — also v0.12.0;
# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
# adapter — landed here. The canonicalizer recursively sorts dict
# keys at every nesting level while preserving caller-supplied
# array order, applied at the four user-input boundaries
# (``tool.parameters`` / ``tool.function`` record top-level per
# spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
# extras, ``tool_call.arguments`` JSON encoding) plus a top-level
# belt-and-suspenders pass over the assembled request body. Scope
# is the Chat Completions endpoint only; the OpenAI Responses API
# endpoint is deferred to a future cycle (no python consumer
# today). Prompt-management §13 cross-variable substring stability
# is satisfied by the existing Jinja2 ``StrictUndefined`` render
# path; pinned by ``tests/unit/test_prompts.py::
# test_cross_variable_substring_stability_text_prompt`` and
# ``test_cross_variable_substring_stability_chat_prompt``.
# Anthropic / Gemini
# wire-byte conformance fixtures stay deferred — neither provider
# is implemented in python today.
[proposals."0047"]
status = "not-yet"
status = "implemented"
since = "0.13.0"

# Spec v0.40.0 (proposal 0048). Read-symmetric invocation metadata.
# Adds ``get_invocation_metadata()`` symmetric to the existing
Expand Down
67 changes: 67 additions & 0 deletions docs/concepts/prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,73 @@ The filesystem backend layout is
`<root>/<label>/<name>.j2`; for the example above,
`./prompts/production/greeting.j2`.

## Prefix-cache friendly authoring (APC)

Inference engines that implement Automatic Prefix Caching
(vLLM with `--enable-prefix-caching`, OpenAI's hosted prompt
caching, llama.cpp's prefix reuse, others) skip recomputing
attention for token prefixes they have already processed in
a recent request. The cache hit is decided by **byte equality**
of the prefix. A single reordered key, a shuffled tool
definition, or a timestamp embedded in the system prompt
invalidates the cache and re-runs full attention from the
first changed byte.

OpenArmature handles the wire-byte half of this contract for
you. The OpenAI provider canonicalizes every user-supplied dict
on the wire — tool parameter schemas, response-format schemas,
`RuntimeConfig` extras, tool-call arguments — so equivalent OA
inputs produce byte-identical wire output regardless of dict
insertion order. Prompt rendering is deterministic by
construction: same `Prompt` plus same variables produces
byte-identical `PromptResult.messages` (spec
[prompt-management §13](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/prompt-management/spec.md#13-determinism)).

Authoring discipline that maximizes APC hit rates is
out of OA's hands — it's about how you structure the prompts.
The spec's [llm-provider §14 *APC-friendly authoring
guidance*](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/llm-provider/spec.md#14-apc-friendly-authoring-guidance-informative)
lists five informative patterns; the headline:

1. **Place variables and chat history at the end of templates.**
Stable static prefix at the front maximizes cacheable bytes.
2. **No timestamps, UUIDs, or other nondeterministic values
in static segments.** They poison the cache prefix on every
request.
3. **Stable few-shot ordering.** Pick once, reuse across
requests; don't shuffle.
4. **Sort retrieval results before injecting** when the
downstream consumer doesn't care about order.
5. **Cache-friendly tool ordering.** Define tools in a stable
order across calls.

### Debugging "the cache attribute isn't showing up"

When the OTel observer is running but
`openarmature.llm.cache_read.input_tokens` doesn't appear on
your `openarmature.llm.complete` spans, the cause is almost
always server-side: the inference engine either isn't
configured to surface cache stats, or isn't running with prefix
caching enabled at all.

- **vLLM**: launch with `--enable-prefix-caching` AND
`--enable-prompt-tokens-details`. The first turns APC on;
the second tells vLLM to populate
`usage.prompt_tokens_details.cached_tokens` on the wire
response. Both flags are required for the attribute to
surface.
- **OpenAI hosted (Chat Completions)**: prompt caching is
on automatically for prompts ≥1024 tokens; the
`prompt_tokens_details.cached_tokens` field appears on
qualifying responses without configuration.

OA's role is to source the field when present (provider-side)
and emit the attribute when populated (observer-side); without
the upstream signal, neither happens — and that's the right
behavior (per the spec's absent-vs-zero distinction, an absent
attribute means "the provider didn't report," not "zero
hits").

## What's out of scope (for now)

- **Specific vendor backends**: Langfuse, PromptLayer, etc.,
Expand Down
77 changes: 66 additions & 11 deletions src/openarmature/llm/providers/openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -714,18 +714,24 @@ def _build_request_body(
body["stop"] = config.stop_sequences
# Pass-through any provider-specific extras (extra="allow"
# on RuntimeConfig); spec §6 mandates implementations MUST
# accept and forward undeclared fields untouched.
# accept and forward undeclared fields untouched. Spec 0047
# §8: canonicalize each extra value at the user-input
# boundary so dict-typed extras (vLLM ``guided_decoding``,
# etc.) render with stable key ordering.
extras = config.model_extra or {}
for k, v in extras.items():
body.setdefault(k, v)
body.setdefault(k, _canonicalize_dict_keys(v))
# response_format is omitted entirely on the fallback path —
# the schema travels in the augmented system message instead.
if schema_dict is not None and include_response_format:
# Spec 0047 §8.1.5 / Q5 ack: response_format.json_schema.schema
# is a user-supplied JSON Schema and flows through the same
# canonicalization path as tool.parameters.
body["response_format"] = {
"type": "json_schema",
"json_schema": {
"name": _derive_schema_name(schema_dict),
"schema": schema_dict,
"schema": _canonicalize_dict_keys(schema_dict),
"strict": strict_mode_supported(schema_dict),
},
}
Expand All @@ -752,7 +758,12 @@ def _build_request_body(
}
else:
body["tool_choice"] = tool_choice
return body
# Spec 0047 §8 belt-and-suspenders: walk the assembled body
# once more sorting any dict at every nesting level, in case
# a future code path introduces a user-input boundary the
# per-field canonicalization above doesn't cover. Cheap (the
# body is small) and explicit.
return _canonicalize_dict_keys(body)

# ------------------------------------------------------------------
# Response parsing (spec §8.1.2)
Expand Down Expand Up @@ -1132,8 +1143,11 @@ def _message_to_wire(msg: Message) -> dict[str, Any]:
"name": tc.name,
# Canonical compact form (no inter-token spaces). Matches
# the spec's wire-mapping fixture (005, cases shape) and
# the form OpenAI itself emits.
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":")),
# the form OpenAI itself emits. ``sort_keys=True`` per
# spec 0047 §8 — tool-call arguments are a
# caller-supplied dict and the JSON-encoded string
# MUST be byte-stable across equivalent inputs.
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":"), sort_keys=True),
},
}
for tc in msg.tool_calls
Expand Down Expand Up @@ -1169,14 +1183,55 @@ def _block_to_wire(block: ContentBlock) -> dict[str, Any]:
return {"type": "image_url", "image_url": image_url}


# Spec 0047 §8 *Intra-impl wire-byte stability* canonicalizer.
# Recursively sorts dict keys at every nesting level; preserves list
# ordering (per Q5 ack on the proposal-0047 coord thread — array
# ORDER is caller-supplied and stays as-is; object KEYS inside
# arrays get sorted via the dict-recursion branch). Applied at every
# user-supplied-dict boundary in the wire body so equivalent OA
# inputs produce byte-identical wire output for APC hit reliability.
#
# Recursion depth: bounded by the depth of the input dict, not by
# any internal accumulator. Python's default recursion limit (1000)
# is two orders of magnitude above realistic JSON Schema depths
# (typical schemas top out at 5-10 nesting levels — OpenAI's API
# rejects deeper ones at the wire layer before the cache prefix
# matters). We don't impose our own cap; if a caller hands us a
# 1000-deep nested dict, RecursionError surfaces immediately at
# canonicalization time rather than producing silently-broken wire
# bytes downstream.
#
# Byte-stability requires Python's dict insertion-order preservation
# guarantee (PEP 468, 3.7+) AND httpx serializing the body via the
# stdlib ``json.dumps`` default (which respects dict iteration
# order). Both are stable contracts on the supported Python versions
# + httpx 0.27+. If a future httpx release internalizes ordering
# (e.g., switches to alphabetical key emission), the canonicalizer
# becomes redundant but tests would continue to pass; if it
# randomizes ordering, the wire-byte tests in
# ``tests/unit/test_llm_provider.py`` would fail loudly.
def _canonicalize_dict_keys(value: Any) -> Any:
if isinstance(value, dict):
return {k: _canonicalize_dict_keys(value[k]) for k in sorted(cast("dict[str, Any]", value))}
if isinstance(value, list):
return [_canonicalize_dict_keys(v) for v in cast("list[Any]", value)]
return value


def _tool_to_wire(tool: Tool) -> dict[str, Any]:
# Per spec 0047 §8 ack (coord Q5): the byte-stability rule covers
# tool DEFINITIONS broadly — not just the parameters subtree.
# Sort the function record's top-level keys + recursively
# canonicalize the parameters JSON Schema.
return {
"type": "function",
"function": {
"name": tool.name,
"description": tool.description,
"parameters": tool.parameters,
},
"function": _canonicalize_dict_keys(
{
"name": tool.name,
"description": tool.description,
"parameters": tool.parameters,
}
),
}


Expand Down
Loading