Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ All notable changes to `openarmature-python` are documented in this file.

The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The package follows [Semantic Versioning](https://semver.org/); pre-1.0 minor bumps may carry behavioral changes per [spec governance](https://github.com/LunarCommand/openarmature-spec/blob/main/GOVERNANCE.md).

## [Unreleased]

### Changed (breaking)

- **`OpenAIProvider.ready()` default probe flipped to `chat_completions`.** A new constructor kwarg `readiness_probe: Literal["models", "chat_completions", "both"]` selects which wire path `ready()` exercises; the default is now the chat-completions path (`POST /v1/chat/completions` with `max_tokens=1`), which actually exercises the inference path. The previous catalog-only behavior is still available as `readiness_probe="models"`, and `readiness_probe="both"` runs catalog then chat for the strongest signal. Motivation: OpenAI-compatible proxies (Bifrost and similar) can return 200 on `GET /v1/models` while rejecting `POST /v1/chat/completions`, leaving the catalog probe green while every real call fails. The new default surfaces that class of failure at preflight rather than at first inference. Non-200 chat-probe responses route through `classify_http_error`, so the canonical error categories (`provider_authentication`, `provider_unavailable`, `provider_invalid_model`, etc.) surface consistently. Callers that depended on the catalog-only behavior (cost-sensitive cloud setups where every `ready()` would now bill prompt tokens) can opt back in by passing `readiness_probe="models"`.

## [0.11.0] — 2026-06-01

Observability + prompt-management release. The pinned spec advances from v0.27.1 to v0.38.0, absorbing eight accepted proposals (0039-0046). Two headlines: (1) the Langfuse observer grows native `trace.input` / `trace.output` sourcing with caller hooks (0043) and the per-async-context augmentation boundary becomes lineage-aware for nested fan-out / parallel-branches topologies (0045); (2) prompt-management gains a Chat-prompt variant alongside the existing Text-prompt (0046) and `LangfusePromptBackend` lands for both Langfuse text and chat prompts. Caller-supplied `invocation_id` (0039), mid-invocation open-span metadata update (0040), three reserved-key surfaces (0041 + 0042), and the parallel-branches OTel dispatch span (0044) round out the cycle.
Expand Down
4 changes: 4 additions & 0 deletions docs/agent/non-obvious-shapes.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,10 @@ A common shape is "after this LLM call, route to either a JSON-extraction node o

When the branches operate on different sub-shapes of state — e.g., one path is "extract JSON, then validate" while another is "dispatch tools, loop until done, then summarize" — encapsulate each as a `SubgraphNode` and route from the LLM node to the right subgraph. Each subgraph has its own state schema (projected from the parent), its own entry node, and its own internal topology. The parent graph becomes a switchboard with a few edges; the complexity lives one layer down where it composes cleanly.

### `OpenAIProvider.ready()` exercises `chat/completions` by default; opt back into the catalog-only probe for cost-sensitive callers

`OpenAIProvider(..., readiness_probe=...)` accepts `"chat_completions"` (default), `"models"`, or `"both"`. The default issues `POST /v1/chat/completions` with a `max_tokens=1` body so a green `ready()` actually proves the inference wire path works, not just that the catalog endpoint answers. The motivating failure class: OpenAI-compatible proxies (Bifrost is the field-reported case) that return 200 on `GET /v1/models` while 405'ing the completions endpoint — the previous catalog-only default reported ready and every real call broke. The `"models"` opt-in is the old behavior, useful for cost-sensitive cloud callers where every `ready()` would otherwise bill one prompt's worth of tokens. `"both"` runs catalog then chat — strongest signal at double the cost. Non-200 responses on either probe route through `classify_http_error`, so the canonical error categories (`ProviderAuthentication`, `ProviderUnavailable`, `ProviderInvalidModel`, etc.) surface consistently regardless of which probe ran.

### Be explicit with `tool_choice`; don't trust the provider's default

`Provider.complete(messages, tools, tool_choice=...)` accepts `"auto"`, `"required"`, `"none"`, or a `ForceTool(name=...)` record. When you omit `tool_choice`, the OpenAI provider's own default applies — usually `"auto"` when `tools` is non-empty, but documented per-provider. A pipeline that wants deterministic tool-calling (a routing node that MUST produce a tool call, a guarded LLM call that MUST NOT call tools) should pin `tool_choice` explicitly rather than relying on the provider default.
Expand Down
64 changes: 64 additions & 0 deletions docs/concepts/llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,70 @@ stateless calls. Conversational memory (if you want it) is the
caller's responsibility: thread it through state and pass the
accumulated message list into each call.

## Pre-flight readiness check

`Provider.ready()` is the optional pre-flight call you make before
your application starts taking real traffic. It raises one of the
canonical [`LlmProviderError`](../reference/llm.md) categories on
failure and returns `None` on success, so a typical startup hook
looks like:

```python
async def startup() -> None:
provider = _get_provider()
try:
await provider.ready()
except ProviderAuthentication:
# Bad API key — fail fast at boot.
raise
except ProviderInvalidModel:
# Bound model isn't served by this endpoint — same.
raise
except ProviderUnavailable:
# Endpoint is down or unreachable — fail fast too.
raise
```

`OpenAIProvider` ships three probe shapes selected via the
`readiness_probe` constructor kwarg:

- **`"chat_completions"`** (default) — issues `POST /v1/chat/completions`
with a `max_tokens=1` body. Actually exercises the inference wire
path. Strongest signal at the cost of one prompt's worth of tokens
on cloud endpoints.
- **`"models"`** — issues `GET /v1/models` and verifies the bound
model appears in the catalog. Cheaper (no completion billing) but
blind to proxy wire-mismatch cases: some OpenAI-compatible proxies
(Bifrost is the motivating example) serve `/v1/models` correctly
while 405'ing the completions endpoint, so a green catalog probe
doesn't prove `complete()` will work.
- **`"both"`** — runs the catalog probe first (cheap fail-fast on
model-not-in-catalog with the cleaner `seen_ids` diagnostic), then
the chat probe. Strongest signal at double the round-trip cost.

```python
# Local server (LM Studio, vLLM, llama.cpp) — chat probe is free.
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="qwen2.5-coder",
readiness_probe="chat_completions", # default
)

# Cloud endpoint, cost-sensitive — opt back into the catalog-only probe.
provider = OpenAIProvider(
base_url="https://api.openai.com",
model="gpt-4o-mini",
api_key=os.environ["LLM_API_KEY"],
readiness_probe="models",
)
```

The chat probe is the default because the catalog probe's
false-green failure mode (Bifrost-style proxy mismatch) is silent at
boot but fatal at first real call, and that's worse than the extra
token spend for the small set of cost-sensitive callers who can opt
out explicitly.

## Structured output

Every LLM-using node that produces typed data ends up with the same
Expand Down
4 changes: 4 additions & 0 deletions src/openarmature/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -943,6 +943,10 @@ A common shape is "after this LLM call, route to either a JSON-extraction node o

When the branches operate on different sub-shapes of state — e.g., one path is "extract JSON, then validate" while another is "dispatch tools, loop until done, then summarize" — encapsulate each as a `SubgraphNode` and route from the LLM node to the right subgraph. Each subgraph has its own state schema (projected from the parent), its own entry node, and its own internal topology. The parent graph becomes a switchboard with a few edges; the complexity lives one layer down where it composes cleanly.

### `OpenAIProvider.ready()` exercises `chat/completions` by default; opt back into the catalog-only probe for cost-sensitive callers

`OpenAIProvider(..., readiness_probe=...)` accepts `"chat_completions"` (default), `"models"`, or `"both"`. The default issues `POST /v1/chat/completions` with a `max_tokens=1` body so a green `ready()` actually proves the inference wire path works, not just that the catalog endpoint answers. The motivating failure class: OpenAI-compatible proxies (Bifrost is the field-reported case) that return 200 on `GET /v1/models` while 405'ing the completions endpoint — the previous catalog-only default reported ready and every real call broke. The `"models"` opt-in is the old behavior, useful for cost-sensitive cloud callers where every `ready()` would otherwise bill one prompt's worth of tokens. `"both"` runs catalog then chat — strongest signal at double the cost. Non-200 responses on either probe route through `classify_http_error`, so the canonical error categories (`ProviderAuthentication`, `ProviderUnavailable`, `ProviderInvalidModel`, etc.) surface consistently regardless of which probe ran.

### Be explicit with `tool_choice`; don't trust the provider's default

`Provider.complete(messages, tools, tool_choice=...)` accepts `"auto"`, `"required"`, `"none"`, or a `ForceTool(name=...)` record. When you omit `tool_choice`, the OpenAI provider's own default applies — usually `"auto"` when `tools` is non-empty, but documented per-provider. A pipeline that wants deterministic tool-calling (a routing node that MUST produce a tool call, a guarded LLM call that MUST NOT call tools) should pin `tool_choice` explicitly rather than relying on the provider default.
Expand Down
123 changes: 101 additions & 22 deletions src/openarmature/llm/providers/openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,30 @@
| HTTP 5xx (other) | provider_unavailable |
| 200 OK that fails to parse into Response shape | provider_invalid_response |

**``ready()`` probe.** Hits ``GET /v1/models`` and:

- 401/403 → ``provider_authentication``.
- 5xx / connection error → ``provider_unavailable``.
- 200 + bound model in returned list → success.
- 200 + bound model NOT in list → ``provider_invalid_model``.

The ``provider_model_not_loaded`` distinction needs a server-specific
probe (LM Studio's loaded-vs-configured endpoint, vLLM's health
endpoint, llama.cpp's runtime-status endpoint) that this base
provider can't generically emit. Subclasses or purpose-built
local-server provider variants close that gap; the base
``OpenAIProvider`` documents the limitation here rather than silently
treating "model in catalog" as "model loaded."
**``ready()`` probe.** Three modes selected by the constructor kwarg
``readiness_probe``:

- ``"chat_completions"`` (default) — issues ``POST /v1/chat/completions``
with a minimal ``max_tokens=1`` body. Actually exercises the inference
path, so OpenAI-compatible proxies (Bifrost, custom gateways) that
return 200 on ``GET /v1/models`` but reject ``POST /v1/chat/completions``
surface immediately rather than at first real call.
- ``"models"`` — hits ``GET /v1/models`` and verifies the bound model
appears in the returned catalog. Cheaper (no completion tokens
billed) but doesn't catch the wire-mismatch case above.
- ``"both"`` — runs the catalog probe first, then the chat probe.
Catalog check short-circuits with the cleaner "model not in catalog"
diagnostic before any billable call.

All three modes map non-200 responses through ``classify_http_error``
so the canonical error categories (``provider_authentication``,
``provider_unavailable``, ``provider_invalid_model``,
``provider_model_not_loaded``, ``provider_rate_limit``) surface
consistently regardless of which probe ran.

The previous default was ``"models"``; flipped to ``"chat_completions"``
because the catalog probe missed a real failure class (proxy wire-
format mismatch) in field use.
"""

from __future__ import annotations
Expand Down Expand Up @@ -104,6 +114,13 @@
)
from ..response import FinishReason, ParsedValue, Response, RuntimeConfig, Usage

# Runtime guard for ``OpenAIProvider(..., readiness_probe=...)``. The
# Literal type narrows callers under static checkers but is not enforced
# at runtime, so an unknown string would silently no-op both dispatch
# branches in ``ready()`` and return None — a false-green readiness
# signal. Validate in ``__init__`` against this set instead.
_VALID_READINESS_PROBES = frozenset({"models", "chat_completions", "both"})


class OpenAIProvider:
"""OpenAI Chat Completions wire-compatible provider.
Expand Down Expand Up @@ -139,6 +156,7 @@ def __init__(
timeout: float = 60.0,
force_prompt_augmentation_fallback: bool = False,
genai_system: str = "openai",
readiness_probe: Literal["models", "chat_completions", "both"] = "chat_completions",
) -> None:
self.base_url = _validate_and_normalize_base_url(base_url)
self.model = model
Expand All @@ -157,6 +175,20 @@ def __init__(
# those servers, and a wrong inference is worse than the explicit
# opt-in.
self._genai_system = genai_system
# ``readiness_probe`` selects which wire path ``ready()`` exercises.
# The default ``"chat_completions"`` actually tests inference; the
# opt-in ``"models"`` is the older catalog-only probe for
# cost-sensitive cloud callers (every chat probe bills prompt
# tokens). ``"both"`` runs catalog then chat for the strongest
# signal at double the round-trip cost. Same explicit-opt-in
# rationale as ``genai_system``: no base_url sniffing, since the
# right probe shape depends on what's on the other end and a
# wrong inference is worse than a wrong default.
if readiness_probe not in _VALID_READINESS_PROBES:
raise ValueError(
f"readiness_probe must be one of {sorted(_VALID_READINESS_PROBES)} (got {readiness_probe!r})"
)
self._readiness_probe = readiness_probe
Comment thread
chris-colinsky marked this conversation as resolved.
self._headers: dict[str, str] = {"Content-Type": "application/json"}
if api_key is not None:
self._headers["Authorization"] = f"Bearer {api_key}"
Expand Down Expand Up @@ -188,20 +220,36 @@ async def aclose(self) -> None:
# ------------------------------------------------------------------

async def ready(self) -> None:
"""Verify the bound model is reachable and listed by the
provider. Hits ``GET /v1/models`` and matches ``self.model``
against the returned ``data[].id`` entries."""
"""Verify the bound model is reachable. Dispatches on the
``readiness_probe`` mode chosen at construction:

- ``"chat_completions"`` (default) issues a ``max_tokens=1``
chat call against ``POST /v1/chat/completions``.
- ``"models"`` issues ``GET /v1/models`` and matches
``self.model`` against the returned ``data[].id`` entries.
- ``"both"`` runs the catalog probe first (cheaper, surfaces
model-not-in-catalog with the catalog diagnostic), then the
chat probe.
"""
if self._readiness_probe in ("models", "both"):
await self._probe_models()
if self._readiness_probe in ("chat_completions", "both"):
await self._probe_chat_completions()
Comment thread
chris-colinsky marked this conversation as resolved.

async def _probe_models(self) -> None:
"""Catalog probe — ``GET /v1/models`` + bound-model presence
check. Cheaper than the chat probe (no completion tokens
billed) and surfaces the model-not-in-catalog case with the
cleaner ``seen_ids`` diagnostic; misses wire-format mismatches
on proxies that serve the catalog correctly but reject
completions."""
try:
resp = await self._client.get("/v1/models")
except httpx.HTTPError as exc:
raise ProviderUnavailable(str(exc)) from exc

if resp.status_code in (401, 403):
raise ProviderAuthentication(f"GET /v1/models returned {resp.status_code}")
if 500 <= resp.status_code < 600:
raise ProviderUnavailable(f"GET /v1/models returned {resp.status_code}")
if resp.status_code != 200:
raise ProviderUnavailable(f"GET /v1/models returned unexpected {resp.status_code}")
raise classify_http_error(resp)

try:
body_raw = resp.json()
Expand Down Expand Up @@ -246,6 +294,37 @@ async def ready(self) -> None:
f"model {self.model!r} is configured but not loaded (status={status_field!r})"
)

async def _probe_chat_completions(self) -> None:
"""Inference probe — ``POST /v1/chat/completions`` with a
``max_tokens=1`` body. Surfaces wire-format mismatches that
the catalog probe can't see (the motivating case: Bifrost-
style proxies that 200 on ``/v1/models`` but 405/404 on
``/v1/chat/completions``). Bills one prompt's worth of tokens
on cloud endpoints, which is why this defaults on but is
opt-out via ``readiness_probe="models"``."""
body = {
"model": self.model,
"messages": [{"role": "user", "content": "."}],
"max_tokens": 1,
}
try:
resp = await self._client.post("/v1/chat/completions", json=body)
except httpx.HTTPError as exc:
raise ProviderUnavailable(str(exc)) from exc
if resp.status_code != 200:
raise classify_http_error(resp)
# Validate the response shape so a proxy answering 200 with an
# error payload or non-OpenAI-shape JSON doesn't pass the probe.
# Mirrors ``_do_complete``'s parse step. The returned Response
# is discarded — the validation itself is the point.
try:
payload_raw = resp.json()
except ValueError as exc:
raise ProviderInvalidResponse("POST /v1/chat/completions returned non-JSON body") from exc
if not isinstance(payload_raw, dict):
raise ProviderInvalidResponse("POST /v1/chat/completions returned a non-object body")
self._parse_response(cast("dict[str, Any]", payload_raw), None, None)

Comment thread
chris-colinsky marked this conversation as resolved.
# ------------------------------------------------------------------
# complete() — single completion call
# ------------------------------------------------------------------
Expand Down
Loading