Skip to content

Commit 0661af0

Browse files
Harden OpenAIProvider readiness probe (#109)
* Harden OpenAIProvider readiness probe Add a ``readiness_probe`` constructor kwarg accepting "models", "chat_completions", or "both", and flip the default from the older catalog-only ``GET /v1/models`` probe to a new ``POST /v1/chat/ completions`` probe with ``max_tokens=1``. Motivation: OpenAI-compatible proxies can return 200 on the catalog endpoint while rejecting completions (Bifrost is the field-reported case), so the previous probe reported ready while every real call failed. The new default actually exercises the inference wire path, so that failure class surfaces at preflight. Non-200 chat-probe responses route through ``classify_http_error`` so canonical error categories surface consistently. Catalog-only behavior remains opt-in for cost-sensitive cloud callers. Conformance harness picks ``readiness_probe`` mode from the fixture's mocked ``health_endpoint.path`` so fixture 007's catalog semantics keep working without spec changes. * Tighten readiness probe per CoPilot review Three findings on PR #109: 1. Runtime guard for ``readiness_probe``. The Literal type is a static hint; an unknown string would silently no-op both dispatch branches in ``ready()`` and report ready. Validate in ``__init__`` against a module-level frozenset and raise ValueError. 2. Route ``_probe_models`` non-200 responses through ``classify_http_error``. Previously hard-coded 401/403 to ProviderAuthentication and everything-else to ProviderUnavailable, missing ProviderRateLimit (429), ProviderModelNotLoaded (503+marker), and ProviderInvalidModel (404+marker). The docstring's mode-independence claim is now true. 3. Validate ``_probe_chat_completions`` 200 response shape. A proxy answering 200 with an error payload or non-OpenAI-shape JSON previously passed the probe. Mirror ``_do_complete``'s parse + ``_parse_response(payload, None, None)`` step. Adds five new tests covering: invalid mode at construction, catalog probe 429 → ProviderRateLimit, catalog probe 503+marker → ProviderModelNotLoaded, chat probe 200 with error payload, and chat probe 200 with non-JSON body.
1 parent e07a3e4 commit 0661af0

7 files changed

Lines changed: 587 additions & 24 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ All notable changes to `openarmature-python` are documented in this file.
44

55
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The package follows [Semantic Versioning](https://semver.org/); pre-1.0 minor bumps may carry behavioral changes per [spec governance](https://github.com/LunarCommand/openarmature-spec/blob/main/GOVERNANCE.md).
66

7+
## [Unreleased]
8+
9+
### Changed (breaking)
10+
11+
- **`OpenAIProvider.ready()` default probe flipped to `chat_completions`.** A new constructor kwarg `readiness_probe: Literal["models", "chat_completions", "both"]` selects which wire path `ready()` exercises; the default is now the chat-completions path (`POST /v1/chat/completions` with `max_tokens=1`), which actually exercises the inference path. The previous catalog-only behavior is still available as `readiness_probe="models"`, and `readiness_probe="both"` runs catalog then chat for the strongest signal. Motivation: OpenAI-compatible proxies (Bifrost and similar) can return 200 on `GET /v1/models` while rejecting `POST /v1/chat/completions`, leaving the catalog probe green while every real call fails. The new default surfaces that class of failure at preflight rather than at first inference. Non-200 chat-probe responses route through `classify_http_error`, so the canonical error categories (`provider_authentication`, `provider_unavailable`, `provider_invalid_model`, etc.) surface consistently. Callers that depended on the catalog-only behavior (cost-sensitive cloud setups where every `ready()` would now bill prompt tokens) can opt back in by passing `readiness_probe="models"`.
12+
713
## [0.11.0] — 2026-06-01
814

915
Observability + prompt-management release. The pinned spec advances from v0.27.1 to v0.38.0, absorbing eight accepted proposals (0039-0046). Two headlines: (1) the Langfuse observer grows native `trace.input` / `trace.output` sourcing with caller hooks (0043) and the per-async-context augmentation boundary becomes lineage-aware for nested fan-out / parallel-branches topologies (0045); (2) prompt-management gains a Chat-prompt variant alongside the existing Text-prompt (0046) and `LangfusePromptBackend` lands for both Langfuse text and chat prompts. Caller-supplied `invocation_id` (0039), mid-invocation open-span metadata update (0040), three reserved-key surfaces (0041 + 0042), and the parallel-branches OTel dispatch span (0044) round out the cycle.

docs/agent/non-obvious-shapes.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,10 @@ A common shape is "after this LLM call, route to either a JSON-extraction node o
8080

8181
When the branches operate on different sub-shapes of state — e.g., one path is "extract JSON, then validate" while another is "dispatch tools, loop until done, then summarize" — encapsulate each as a `SubgraphNode` and route from the LLM node to the right subgraph. Each subgraph has its own state schema (projected from the parent), its own entry node, and its own internal topology. The parent graph becomes a switchboard with a few edges; the complexity lives one layer down where it composes cleanly.
8282

83+
### `OpenAIProvider.ready()` exercises `chat/completions` by default; opt back into the catalog-only probe for cost-sensitive callers
84+
85+
`OpenAIProvider(..., readiness_probe=...)` accepts `"chat_completions"` (default), `"models"`, or `"both"`. The default issues `POST /v1/chat/completions` with a `max_tokens=1` body so a green `ready()` actually proves the inference wire path works, not just that the catalog endpoint answers. The motivating failure class: OpenAI-compatible proxies (Bifrost is the field-reported case) that return 200 on `GET /v1/models` while 405'ing the completions endpoint — the previous catalog-only default reported ready and every real call broke. The `"models"` opt-in is the old behavior, useful for cost-sensitive cloud callers where every `ready()` would otherwise bill one prompt's worth of tokens. `"both"` runs catalog then chat — strongest signal at double the cost. Non-200 responses on either probe route through `classify_http_error`, so the canonical error categories (`ProviderAuthentication`, `ProviderUnavailable`, `ProviderInvalidModel`, etc.) surface consistently regardless of which probe ran.
86+
8387
### Be explicit with `tool_choice`; don't trust the provider's default
8488

8589
`Provider.complete(messages, tools, tool_choice=...)` accepts `"auto"`, `"required"`, `"none"`, or a `ForceTool(name=...)` record. When you omit `tool_choice`, the OpenAI provider's own default applies — usually `"auto"` when `tools` is non-empty, but documented per-provider. A pipeline that wants deterministic tool-calling (a routing node that MUST produce a tool call, a guarded LLM call that MUST NOT call tools) should pin `tool_choice` explicitly rather than relying on the provider default.

docs/concepts/llms.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,70 @@ stateless calls. Conversational memory (if you want it) is the
8585
caller's responsibility: thread it through state and pass the
8686
accumulated message list into each call.
8787

88+
## Pre-flight readiness check
89+
90+
`Provider.ready()` is the optional pre-flight call you make before
91+
your application starts taking real traffic. It raises one of the
92+
canonical [`LlmProviderError`](../reference/llm.md) categories on
93+
failure and returns `None` on success, so a typical startup hook
94+
looks like:
95+
96+
```python
97+
async def startup() -> None:
98+
provider = _get_provider()
99+
try:
100+
await provider.ready()
101+
except ProviderAuthentication:
102+
# Bad API key — fail fast at boot.
103+
raise
104+
except ProviderInvalidModel:
105+
# Bound model isn't served by this endpoint — same.
106+
raise
107+
except ProviderUnavailable:
108+
# Endpoint is down or unreachable — fail fast too.
109+
raise
110+
```
111+
112+
`OpenAIProvider` ships three probe shapes selected via the
113+
`readiness_probe` constructor kwarg:
114+
115+
- **`"chat_completions"`** (default) — issues `POST /v1/chat/completions`
116+
with a `max_tokens=1` body. Actually exercises the inference wire
117+
path. Strongest signal at the cost of one prompt's worth of tokens
118+
on cloud endpoints.
119+
- **`"models"`** — issues `GET /v1/models` and verifies the bound
120+
model appears in the catalog. Cheaper (no completion billing) but
121+
blind to proxy wire-mismatch cases: some OpenAI-compatible proxies
122+
(Bifrost is the motivating example) serve `/v1/models` correctly
123+
while 405'ing the completions endpoint, so a green catalog probe
124+
doesn't prove `complete()` will work.
125+
- **`"both"`** — runs the catalog probe first (cheap fail-fast on
126+
model-not-in-catalog with the cleaner `seen_ids` diagnostic), then
127+
the chat probe. Strongest signal at double the round-trip cost.
128+
129+
```python
130+
# Local server (LM Studio, vLLM, llama.cpp) — chat probe is free.
131+
provider = OpenAIProvider(
132+
base_url="http://localhost:8000",
133+
model="qwen2.5-coder",
134+
readiness_probe="chat_completions", # default
135+
)
136+
137+
# Cloud endpoint, cost-sensitive — opt back into the catalog-only probe.
138+
provider = OpenAIProvider(
139+
base_url="https://api.openai.com",
140+
model="gpt-4o-mini",
141+
api_key=os.environ["LLM_API_KEY"],
142+
readiness_probe="models",
143+
)
144+
```
145+
146+
The chat probe is the default because the catalog probe's
147+
false-green failure mode (Bifrost-style proxy mismatch) is silent at
148+
boot but fatal at first real call, and that's worse than the extra
149+
token spend for the small set of cost-sensitive callers who can opt
150+
out explicitly.
151+
88152
## Structured output
89153

90154
Every LLM-using node that produces typed data ends up with the same

src/openarmature/AGENTS.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -943,6 +943,10 @@ A common shape is "after this LLM call, route to either a JSON-extraction node o
943943

944944
When the branches operate on different sub-shapes of state — e.g., one path is "extract JSON, then validate" while another is "dispatch tools, loop until done, then summarize" — encapsulate each as a `SubgraphNode` and route from the LLM node to the right subgraph. Each subgraph has its own state schema (projected from the parent), its own entry node, and its own internal topology. The parent graph becomes a switchboard with a few edges; the complexity lives one layer down where it composes cleanly.
945945

946+
### `OpenAIProvider.ready()` exercises `chat/completions` by default; opt back into the catalog-only probe for cost-sensitive callers
947+
948+
`OpenAIProvider(..., readiness_probe=...)` accepts `"chat_completions"` (default), `"models"`, or `"both"`. The default issues `POST /v1/chat/completions` with a `max_tokens=1` body so a green `ready()` actually proves the inference wire path works, not just that the catalog endpoint answers. The motivating failure class: OpenAI-compatible proxies (Bifrost is the field-reported case) that return 200 on `GET /v1/models` while 405'ing the completions endpoint — the previous catalog-only default reported ready and every real call broke. The `"models"` opt-in is the old behavior, useful for cost-sensitive cloud callers where every `ready()` would otherwise bill one prompt's worth of tokens. `"both"` runs catalog then chat — strongest signal at double the cost. Non-200 responses on either probe route through `classify_http_error`, so the canonical error categories (`ProviderAuthentication`, `ProviderUnavailable`, `ProviderInvalidModel`, etc.) surface consistently regardless of which probe ran.
949+
946950
### Be explicit with `tool_choice`; don't trust the provider's default
947951

948952
`Provider.complete(messages, tools, tool_choice=...)` accepts `"auto"`, `"required"`, `"none"`, or a `ForceTool(name=...)` record. When you omit `tool_choice`, the OpenAI provider's own default applies — usually `"auto"` when `tools` is non-empty, but documented per-provider. A pipeline that wants deterministic tool-calling (a routing node that MUST produce a tool call, a guarded LLM call that MUST NOT call tools) should pin `tool_choice` explicitly rather than relying on the provider default.

src/openarmature/llm/providers/openai.py

Lines changed: 101 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,30 @@
2222
| HTTP 5xx (other) | provider_unavailable |
2323
| 200 OK that fails to parse into Response shape | provider_invalid_response |
2424
25-
**``ready()`` probe.** Hits ``GET /v1/models`` and:
26-
27-
- 401/403 → ``provider_authentication``.
28-
- 5xx / connection error → ``provider_unavailable``.
29-
- 200 + bound model in returned list → success.
30-
- 200 + bound model NOT in list → ``provider_invalid_model``.
31-
32-
The ``provider_model_not_loaded`` distinction needs a server-specific
33-
probe (LM Studio's loaded-vs-configured endpoint, vLLM's health
34-
endpoint, llama.cpp's runtime-status endpoint) that this base
35-
provider can't generically emit. Subclasses or purpose-built
36-
local-server provider variants close that gap; the base
37-
``OpenAIProvider`` documents the limitation here rather than silently
38-
treating "model in catalog" as "model loaded."
25+
**``ready()`` probe.** Three modes selected by the constructor kwarg
26+
``readiness_probe``:
27+
28+
- ``"chat_completions"`` (default) — issues ``POST /v1/chat/completions``
29+
with a minimal ``max_tokens=1`` body. Actually exercises the inference
30+
path, so OpenAI-compatible proxies (Bifrost, custom gateways) that
31+
return 200 on ``GET /v1/models`` but reject ``POST /v1/chat/completions``
32+
surface immediately rather than at first real call.
33+
- ``"models"`` — hits ``GET /v1/models`` and verifies the bound model
34+
appears in the returned catalog. Cheaper (no completion tokens
35+
billed) but doesn't catch the wire-mismatch case above.
36+
- ``"both"`` — runs the catalog probe first, then the chat probe.
37+
Catalog check short-circuits with the cleaner "model not in catalog"
38+
diagnostic before any billable call.
39+
40+
All three modes map non-200 responses through ``classify_http_error``
41+
so the canonical error categories (``provider_authentication``,
42+
``provider_unavailable``, ``provider_invalid_model``,
43+
``provider_model_not_loaded``, ``provider_rate_limit``) surface
44+
consistently regardless of which probe ran.
45+
46+
The previous default was ``"models"``; flipped to ``"chat_completions"``
47+
because the catalog probe missed a real failure class (proxy wire-
48+
format mismatch) in field use.
3949
"""
4050

4151
from __future__ import annotations
@@ -104,6 +114,13 @@
104114
)
105115
from ..response import FinishReason, ParsedValue, Response, RuntimeConfig, Usage
106116

117+
# Runtime guard for ``OpenAIProvider(..., readiness_probe=...)``. The
118+
# Literal type narrows callers under static checkers but is not enforced
119+
# at runtime, so an unknown string would silently no-op both dispatch
120+
# branches in ``ready()`` and return None — a false-green readiness
121+
# signal. Validate in ``__init__`` against this set instead.
122+
_VALID_READINESS_PROBES = frozenset({"models", "chat_completions", "both"})
123+
107124

108125
class OpenAIProvider:
109126
"""OpenAI Chat Completions wire-compatible provider.
@@ -139,6 +156,7 @@ def __init__(
139156
timeout: float = 60.0,
140157
force_prompt_augmentation_fallback: bool = False,
141158
genai_system: str = "openai",
159+
readiness_probe: Literal["models", "chat_completions", "both"] = "chat_completions",
142160
) -> None:
143161
self.base_url = _validate_and_normalize_base_url(base_url)
144162
self.model = model
@@ -157,6 +175,20 @@ def __init__(
157175
# those servers, and a wrong inference is worse than the explicit
158176
# opt-in.
159177
self._genai_system = genai_system
178+
# ``readiness_probe`` selects which wire path ``ready()`` exercises.
179+
# The default ``"chat_completions"`` actually tests inference; the
180+
# opt-in ``"models"`` is the older catalog-only probe for
181+
# cost-sensitive cloud callers (every chat probe bills prompt
182+
# tokens). ``"both"`` runs catalog then chat for the strongest
183+
# signal at double the round-trip cost. Same explicit-opt-in
184+
# rationale as ``genai_system``: no base_url sniffing, since the
185+
# right probe shape depends on what's on the other end and a
186+
# wrong inference is worse than a wrong default.
187+
if readiness_probe not in _VALID_READINESS_PROBES:
188+
raise ValueError(
189+
f"readiness_probe must be one of {sorted(_VALID_READINESS_PROBES)} (got {readiness_probe!r})"
190+
)
191+
self._readiness_probe = readiness_probe
160192
self._headers: dict[str, str] = {"Content-Type": "application/json"}
161193
if api_key is not None:
162194
self._headers["Authorization"] = f"Bearer {api_key}"
@@ -188,20 +220,36 @@ async def aclose(self) -> None:
188220
# ------------------------------------------------------------------
189221

190222
async def ready(self) -> None:
191-
"""Verify the bound model is reachable and listed by the
192-
provider. Hits ``GET /v1/models`` and matches ``self.model``
193-
against the returned ``data[].id`` entries."""
223+
"""Verify the bound model is reachable. Dispatches on the
224+
``readiness_probe`` mode chosen at construction:
225+
226+
- ``"chat_completions"`` (default) issues a ``max_tokens=1``
227+
chat call against ``POST /v1/chat/completions``.
228+
- ``"models"`` issues ``GET /v1/models`` and matches
229+
``self.model`` against the returned ``data[].id`` entries.
230+
- ``"both"`` runs the catalog probe first (cheaper, surfaces
231+
model-not-in-catalog with the catalog diagnostic), then the
232+
chat probe.
233+
"""
234+
if self._readiness_probe in ("models", "both"):
235+
await self._probe_models()
236+
if self._readiness_probe in ("chat_completions", "both"):
237+
await self._probe_chat_completions()
238+
239+
async def _probe_models(self) -> None:
240+
"""Catalog probe — ``GET /v1/models`` + bound-model presence
241+
check. Cheaper than the chat probe (no completion tokens
242+
billed) and surfaces the model-not-in-catalog case with the
243+
cleaner ``seen_ids`` diagnostic; misses wire-format mismatches
244+
on proxies that serve the catalog correctly but reject
245+
completions."""
194246
try:
195247
resp = await self._client.get("/v1/models")
196248
except httpx.HTTPError as exc:
197249
raise ProviderUnavailable(str(exc)) from exc
198250

199-
if resp.status_code in (401, 403):
200-
raise ProviderAuthentication(f"GET /v1/models returned {resp.status_code}")
201-
if 500 <= resp.status_code < 600:
202-
raise ProviderUnavailable(f"GET /v1/models returned {resp.status_code}")
203251
if resp.status_code != 200:
204-
raise ProviderUnavailable(f"GET /v1/models returned unexpected {resp.status_code}")
252+
raise classify_http_error(resp)
205253

206254
try:
207255
body_raw = resp.json()
@@ -246,6 +294,37 @@ async def ready(self) -> None:
246294
f"model {self.model!r} is configured but not loaded (status={status_field!r})"
247295
)
248296

297+
async def _probe_chat_completions(self) -> None:
298+
"""Inference probe — ``POST /v1/chat/completions`` with a
299+
``max_tokens=1`` body. Surfaces wire-format mismatches that
300+
the catalog probe can't see (the motivating case: Bifrost-
301+
style proxies that 200 on ``/v1/models`` but 405/404 on
302+
``/v1/chat/completions``). Bills one prompt's worth of tokens
303+
on cloud endpoints, which is why this defaults on but is
304+
opt-out via ``readiness_probe="models"``."""
305+
body = {
306+
"model": self.model,
307+
"messages": [{"role": "user", "content": "."}],
308+
"max_tokens": 1,
309+
}
310+
try:
311+
resp = await self._client.post("/v1/chat/completions", json=body)
312+
except httpx.HTTPError as exc:
313+
raise ProviderUnavailable(str(exc)) from exc
314+
if resp.status_code != 200:
315+
raise classify_http_error(resp)
316+
# Validate the response shape so a proxy answering 200 with an
317+
# error payload or non-OpenAI-shape JSON doesn't pass the probe.
318+
# Mirrors ``_do_complete``'s parse step. The returned Response
319+
# is discarded — the validation itself is the point.
320+
try:
321+
payload_raw = resp.json()
322+
except ValueError as exc:
323+
raise ProviderInvalidResponse("POST /v1/chat/completions returned non-JSON body") from exc
324+
if not isinstance(payload_raw, dict):
325+
raise ProviderInvalidResponse("POST /v1/chat/completions returned a non-object body")
326+
self._parse_response(cast("dict[str, Any]", payload_raw), None, None)
327+
249328
# ------------------------------------------------------------------
250329
# complete() — single completion call
251330
# ------------------------------------------------------------------

0 commit comments

Comments
 (0)