Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The

### Added

- **Implicit prefix-cache wire-byte stability** (proposal 0047, spec v0.39.0). The OpenAI Chat Completions wire body is now byte-stable across equivalent OA inputs — equivalent calls produce byte-identical request bodies regardless of dict insertion order at every user-supplied-dict boundary (tool definitions including the top-level `function` record + the `parameters` JSON Schema, `response_format.json_schema.schema`, `RuntimeConfig` extras, `tool_call.arguments` JSON encoding). A new `_canonicalize_json_schema` helper recursively sorts dict keys at every nesting level while preserving caller-supplied array ordering (the spec's split between "object keys MUST be sorted" and "array order MUST be preserved per caller-supplied order"). A top-level belt-and-suspenders canonicalization pass over the assembled body catches anything the per-field passes miss. Combined with the existing `Response.usage.cached_tokens` / `cache_creation_tokens` fields sourced from `prompt_tokens_details` (v0.12.0) and the OTel observer's `openarmature.llm.cache_read.input_tokens` + `openarmature.llm.cache_creation.input_tokens` attributes (also v0.12.0), this closes proposal 0047 end-to-end. Prompt-management §13 *Cross-variable substring stability* is satisfied by the existing Jinja2 `StrictUndefined` render path; pinned by a new test. Scope is the Chat Completions endpoint only — the OpenAI Responses API endpoint and the Anthropic / Gemini wire-format mappings are deferred (the providers aren't implemented in python today).
Comment thread
chris-colinsky marked this conversation as resolved.
Outdated
- **`LlmFailedEvent` typed event variant** (proposal 0058, spec v0.53.0). Carves LLM provider failures into a spec-normatively-typed event variant alongside `LlmCompletionEvent`. 17 mirrored identity / scoping / request-side fields + 3 failure-specific fields (`error_category` always-present from the llm-provider §7 normative category enumeration; optional `error_type` for vendor-specific detail or upstream exception class name; always-present `error_message`). `OpenAIProvider.complete()` emits the typed event alongside the §7 exception on both raise paths — adapter-caught provider exceptions AND pre-send validation raises. Caller-side exception flow unchanged; the exception still raises out of `complete()`. Mutually exclusive with `LlmCompletionEvent` on the same call. Both bundled observers (OTel + Langfuse) consume `LlmFailedEvent` directly: same `openarmature.llm.complete` span / Generation shape as the success path with ERROR status / level + `openarmature.error.category` attribute (OTel) / `error_category` as statusMessage (Langfuse), `start_time` back-dated by `latency_ms` so the failure duration reflects the time-to-raise.

### Changed
Expand Down
31 changes: 27 additions & 4 deletions conformance.toml
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,34 @@ status = "implemented"
since = "0.11.0"

# Spec v0.39.0 (proposal 0047). Implicit prefix-cache wire-byte
# stability. Cross-provider invariant requiring intra-impl byte
# equality across calls with equivalent inputs. Queued for v0.13.0
# alongside 0049 (LLM provider hardening + typed event batch).
# stability. Cross-capability proposal landed in v0.13.0 across
# three pieces: (1) ``Response.usage`` cache-stat fields
# (``cached_tokens`` / ``cache_creation_tokens``) sourced from the
# OpenAI ``prompt_tokens_details`` payload, with conditional emission
# preserved (absent-vs-zero distinction stays observable) — landed
# in the v0.12.0 cycle as the proposal's payload-side prerequisite;
# (2) OTel observer emits ``openarmature.llm.cache_read.input_tokens``
# (and optional ``openarmature.llm.cache_creation.input_tokens``)
# when the corresponding usage field is populated — also v0.12.0;
# (3) §8.1 intra-impl wire-byte canonicalization in the OpenAI
# adapter — landed here. The canonicalizer recursively sorts dict
# keys at every nesting level while preserving caller-supplied
# array order, applied at the four user-input boundaries
# (``tool.parameters`` / ``tool.function`` record top-level per
# spec Q5, ``response_format.json_schema.schema``, ``RuntimeConfig``
# extras, ``tool_call.arguments`` JSON encoding) plus a top-level
# belt-and-suspenders pass over the assembled request body. Scope
# is the Chat Completions endpoint only; the OpenAI Responses API
# endpoint is deferred to a future cycle (no python consumer
# today). Prompt-management §13 cross-variable substring stability
# is satisfied by the existing Jinja2 ``StrictUndefined`` render
# path; pinned by ``tests/unit/test_prompts.py::
# test_cross_variable_substring_stability``. Anthropic / Gemini
# wire-byte conformance fixtures stay deferred — neither provider
Comment thread
chris-colinsky marked this conversation as resolved.
# is implemented in python today.
[proposals."0047"]
status = "not-yet"
status = "implemented"
since = "0.13.0"

# Spec v0.40.0 (proposal 0048). Read-symmetric invocation metadata.
# Adds ``get_invocation_metadata()`` symmetric to the existing
Expand Down
67 changes: 67 additions & 0 deletions docs/concepts/prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,73 @@ The filesystem backend layout is
`<root>/<label>/<name>.j2`; for the example above,
`./prompts/production/greeting.j2`.

## Prefix-cache friendly authoring (APC)

Inference engines that implement Automatic Prefix Caching
(vLLM with `--enable-prefix-caching`, OpenAI's hosted prompt
caching, llama.cpp's prefix reuse, others) skip recomputing
attention for token prefixes they have already processed in
a recent request. The cache hit is decided by **byte equality**
of the prefix. A single reordered key, a shuffled tool
definition, or a timestamp embedded in the system prompt
invalidates the cache and re-runs full attention from the
first changed byte.

OpenArmature handles the wire-byte half of this contract for
you. The OpenAI provider canonicalizes every user-supplied dict
on the wire — tool parameter schemas, response-format schemas,
`RuntimeConfig` extras, tool-call arguments — so equivalent OA
inputs produce byte-identical wire output regardless of dict
insertion order. Prompt rendering is deterministic by
construction: same `Prompt` plus same variables produces
byte-identical `PromptResult.messages` (spec
[prompt-management §13](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/prompt-management/spec.md#13-determinism)).

Authoring discipline that maximizes APC hit rates is
out of OA's hands — it's about how you structure the prompts.
The spec's [llm-provider §14 *APC-friendly authoring
guidance*](https://github.com/LunarCommand/openarmature-spec/blob/main/spec/llm-provider/spec.md#14-apc-friendly-authoring-guidance-informative)
lists five informative patterns; the headline:

1. **Place variables and chat history at the end of templates.**
Stable static prefix at the front maximizes cacheable bytes.
2. **No timestamps, UUIDs, or other nondeterministic values
in static segments.** They poison the cache prefix on every
request.
3. **Stable few-shot ordering.** Pick once, reuse across
requests; don't shuffle.
4. **Sort retrieval results before injecting** when the
downstream consumer doesn't care about order.
5. **Cache-friendly tool ordering.** Define tools in a stable
order across calls.

### Debugging "the cache attribute isn't showing up"

When the OTel observer is running but
`openarmature.llm.cache_read.input_tokens` doesn't appear on
your `openarmature.llm.complete` spans, the cause is almost
always server-side: the inference engine either isn't
configured to surface cache stats, or isn't running with prefix
caching enabled at all.

- **vLLM**: launch with `--enable-prefix-caching` AND
`--enable-prompt-tokens-details`. The first turns APC on;
the second tells vLLM to populate
`usage.prompt_tokens_details.cached_tokens` on the wire
response. Both flags are required for the attribute to
surface.
- **OpenAI hosted (Chat Completions)**: prompt caching is
on automatically for prompts ≥1024 tokens; the
`prompt_tokens_details.cached_tokens` field appears on
qualifying responses without configuration.

OA's role is to source the field when present (provider-side)
and emit the attribute when populated (observer-side); without
the upstream signal, neither happens — and that's the right
behavior (per the spec's absent-vs-zero distinction, an absent
attribute means "the provider didn't report," not "zero
hits").

## What's out of scope (for now)

- **Specific vendor backends**: Langfuse, PromptLayer, etc.,
Expand Down
77 changes: 66 additions & 11 deletions src/openarmature/llm/providers/openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -714,18 +714,24 @@ def _build_request_body(
body["stop"] = config.stop_sequences
# Pass-through any provider-specific extras (extra="allow"
# on RuntimeConfig); spec §6 mandates implementations MUST
# accept and forward undeclared fields untouched.
# accept and forward undeclared fields untouched. Spec 0047
# §8: canonicalize each extra value at the user-input
# boundary so dict-typed extras (vLLM ``guided_decoding``,
# etc.) render with stable key ordering.
extras = config.model_extra or {}
for k, v in extras.items():
body.setdefault(k, v)
body.setdefault(k, _canonicalize_dict_keys(v))
# response_format is omitted entirely on the fallback path —
# the schema travels in the augmented system message instead.
if schema_dict is not None and include_response_format:
# Spec 0047 §8.1.5 / Q5 ack: response_format.json_schema.schema
# is a user-supplied JSON Schema and flows through the same
# canonicalization path as tool.parameters.
body["response_format"] = {
"type": "json_schema",
"json_schema": {
"name": _derive_schema_name(schema_dict),
"schema": schema_dict,
"schema": _canonicalize_dict_keys(schema_dict),
"strict": strict_mode_supported(schema_dict),
},
}
Expand All @@ -752,7 +758,12 @@ def _build_request_body(
}
else:
body["tool_choice"] = tool_choice
return body
# Spec 0047 §8 belt-and-suspenders: walk the assembled body
# once more sorting any dict at every nesting level, in case
# a future code path introduces a user-input boundary the
# per-field canonicalization above doesn't cover. Cheap (the
# body is small) and explicit.
return _canonicalize_dict_keys(body)

# ------------------------------------------------------------------
# Response parsing (spec §8.1.2)
Expand Down Expand Up @@ -1132,8 +1143,11 @@ def _message_to_wire(msg: Message) -> dict[str, Any]:
"name": tc.name,
# Canonical compact form (no inter-token spaces). Matches
# the spec's wire-mapping fixture (005, cases shape) and
# the form OpenAI itself emits.
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":")),
# the form OpenAI itself emits. ``sort_keys=True`` per
# spec 0047 §8 — tool-call arguments are a
# caller-supplied dict and the JSON-encoded string
# MUST be byte-stable across equivalent inputs.
"arguments": json.dumps(tc.arguments or {}, separators=(",", ":"), sort_keys=True),
},
}
for tc in msg.tool_calls
Expand Down Expand Up @@ -1169,14 +1183,55 @@ def _block_to_wire(block: ContentBlock) -> dict[str, Any]:
return {"type": "image_url", "image_url": image_url}


# Spec 0047 §8 *Intra-impl wire-byte stability* canonicalizer.
# Recursively sorts dict keys at every nesting level; preserves list
# ordering (per Q5 ack on the proposal-0047 coord thread — array
# ORDER is caller-supplied and stays as-is; object KEYS inside
# arrays get sorted via the dict-recursion branch). Applied at every
# user-supplied-dict boundary in the wire body so equivalent OA
# inputs produce byte-identical wire output for APC hit reliability.
#
# Recursion depth: bounded by the depth of the input dict, not by
# any internal accumulator. Python's default recursion limit (1000)
# is two orders of magnitude above realistic JSON Schema depths
# (typical schemas top out at 5-10 nesting levels — OpenAI's API
# rejects deeper ones at the wire layer before the cache prefix
# matters). We don't impose our own cap; if a caller hands us a
# 1000-deep nested dict, RecursionError surfaces immediately at
# canonicalization time rather than producing silently-broken wire
# bytes downstream.
#
# Byte-stability requires Python's dict insertion-order preservation
# guarantee (PEP 468, 3.7+) AND httpx serializing the body via the
# stdlib ``json.dumps`` default (which respects dict iteration
# order). Both are stable contracts on the supported Python versions
# + httpx 0.27+. If a future httpx release internalizes ordering
# (e.g., switches to alphabetical key emission), the canonicalizer
# becomes redundant but tests would continue to pass; if it
# randomizes ordering, the wire-byte tests in
# ``tests/unit/test_llm_provider.py`` would fail loudly.
def _canonicalize_dict_keys(value: Any) -> Any:
if isinstance(value, dict):
return {k: _canonicalize_dict_keys(value[k]) for k in sorted(cast("dict[str, Any]", value))}
if isinstance(value, list):
return [_canonicalize_dict_keys(v) for v in cast("list[Any]", value)]
return value


def _tool_to_wire(tool: Tool) -> dict[str, Any]:
# Per spec 0047 §8 ack (coord Q5): the byte-stability rule covers
# tool DEFINITIONS broadly — not just the parameters subtree.
# Sort the function record's top-level keys + recursively
# canonicalize the parameters JSON Schema.
return {
"type": "function",
"function": {
"name": tool.name,
"description": tool.description,
"parameters": tool.parameters,
},
"function": _canonicalize_dict_keys(
{
"name": tool.name,
"description": tool.description,
"parameters": tool.parameters,
}
),
}


Expand Down
Loading