diff --git a/adr/0030-ai-gateway-responses-api.md b/adr/0030-ai-gateway-responses-api.md new file mode 100644 index 0000000..9b656e6 --- /dev/null +++ b/adr/0030-ai-gateway-responses-api.md @@ -0,0 +1,343 @@ +# ADR-0030: AI Gateway — OpenAI Responses API Support & Dynamic Model Routing + +**Status:** Proposed +**Date:** 2026-04-22 + +## Context + +[ADR-0024](0024-ai-gateway-plugin.md) established the `ai-proxy` dispatcher with a unified OpenAI-compatible API and translation to Anthropic Messages. That decision pinned the client-facing contract to **OpenAI Chat Completions** (`POST /v1/chat/completions`). Two gaps have surfaced since: + +### Gap 1 — Chat Completions is not the only OpenAI format clients use + +OpenAI's **Responses API** (`POST /v1/responses`, GA March 2025) is a distinct protocol from Chat Completions. It is the format used by: + +- `codex-cli` (OpenAI's coding agent) +- The OpenAI Agents SDK +- Most new OpenAI-native tooling released after early 2025 + +Key differences from Chat Completions: + +| | Chat Completions | Responses API | +|---|---|---| +| Request shape | `messages[]` (role + string/array content) | `input[]` (typed items: `input_text`, `input_image`, `function_call`, `function_call_output`, `reasoning`…) | +| Tool calls | Inlined in assistant messages, correlated via `tool_call_id` | First-class items in `input` / `output` arrays | +| Structured outputs | JSON mode / JSON schema on response | Native typed output items | +| State model | Stateless (client sends full history) | **Optionally stateful** (`store: true` + `previous_response_id` chains turns server-side) | + +Crucially, the Responses item model is a **closer match to Anthropic's Messages API content blocks** than Chat Completions is. Translating Responses → Anthropic is a more natural mapping: typed items map to content blocks quasi 1-to-1, tool use semantics align, and reasoning traces have a direct counterpart. The Chat Completions → Messages translation in `ai-proxy` v1 has known rough edges around multi-turn tool use precisely because of this impedance mismatch. + +Users who need to point OpenAI-format clients at Anthropic models today (e.g. running `codex-cli` against Claude) cannot use Barbacane — they have to run a separate proxy. This has been reported by contributors and is the motivation for this ADR. + +### Gap 2 — Model identifiers are pinned in gateway config + +The current `ai-proxy` config requires each target to declare a `model` string: + +```yaml +targets: + premium: + provider: anthropic + model: claude-opus-4-6 # pinned here +``` + +The dispatcher honors the client's `model` field if present ([plugins/ai-proxy/src/lib.rs:457-460](../plugins/ai-proxy/src/lib.rs#L457-L460)), but the config schema still makes `model` required per target, and there is no way to route based on the model name the client sent. Operators who want "route any `claude-*` request to Anthropic, any `gpt-*` request to OpenAI" have to declare every model explicitly and keep the list in sync with provider releases. + +This is ergonomic friction, not a correctness issue — but it compounds with Gap 1 (a single Responses-aware deployment is likely to expose many models simultaneously) and is cheap to fix. + +### Non-goal: stateful Responses API + +The stateful mode of the Responses API (`store: true` + `previous_response_id`, plus the companion `GET /v1/responses/{id}` retrieval and cancel endpoints) requires the gateway to persist conversation state across requests keyed by a response identifier it issues itself. Barbacane's WASM plugin runtime has no session-scoped storage primitive today (only per-request `context` and a global `cache` KV). Adding one is a larger refactor that deserves its own ADR. This ADR is **stateless-only**: clients must send the full `input[]` each turn. + +The v1 behavior on `store: true` is deliberately permissive rather than rejecting — see section 2 below. The real boundary the gateway can't cross is `previous_response_id`, which is where the hard rejection lives. + +## Decision + +### 0. Principle: the model identifier is caller-owned + +The `model` string in an AI request is part of the **client's** contract, not the gateway's. The gateway routes, authenticates, rate-limits, and translates — it does not decide which model a request runs against. This principle governs the rest of this ADR: + +- Gateway config declares **providers** (where to go, with what credentials) and **routing rules** (how to pick a provider from a request), never an authoritative model list. +- The client's `model` field is passed to the upstream provider verbatim. No gateway-side override. +- Operators do not need to update the gateway when a provider ships a new model. Adding `claude-opus-5` on the Anthropic side requires zero Barbacane config change if `routes` matches `claude-*`. +- If the client omits `model`, the dispatcher returns `400 model_required`. Matches the upstream provider contracts (both OpenAI Chat Completions and Responses API mark `model` as required) and keeps the gateway out of the model-selection business. + +#### Removal of `model` from `targets` and the flat config + +The `model` field in `targets..` and in the flat top-level config (inherited from ADR-0024) is **removed** by this ADR, not deprecated. Barbacane is pre-1.0 and the codebase convention is to avoid backward-compat shims at this stage. The migration path for existing ADR-0024 deployments is one line: delete the `model` field from each target. The dispatcher now derives provider credentials from the resolved target but always uses the client's `model` field for the outbound request. + +The `targets` map itself is retained for the policy-driven routing case (logical tiers selected via `cel` writing `ai.target` into context). It simply no longer carries a `model` — it holds `provider`, `api_key`, and optionally `base_url`. + +##### Migration UX + +`additionalProperties: false` in `plugins/ai-proxy/config-schema.json` means a leftover `model:` field is rejected by the compiler at lint time, not silently ignored at runtime. The error surfaces through the existing `vacuum:barbacane` ruleset as a JSON Schema validation failure (E1015 / config-validation), pointing at the offending operation's spec line. The implementing PR ships a CHANGELOG entry under `### Changed` documenting the breaking change and the one-line migration, and the `vacuum:barbacane` error message names the field explicitly (`"model" is no longer accepted on ai-proxy targets — see ADR-0030`) so the failure is self-explanatory without forcing operators to read this ADR. + +#### Caller-owned, gateway-gated + +"The caller owns the model identifier" does not mean "the gateway forwards anything." The gateway is still responsible for enforcing policy on caller intent — exactly like an HTTP gateway lets the client pick the URL but returns `403` on paths the client is not authorized to hit. For AI traffic there are two distinct gating layers, and they live in two distinct places: + +| Layer | Scope | Example | Where it lives | +|---|---|---|---| +| **Catalog policy** | Static, operator-level — "this gateway serves these models, period" | *"Never route to `claude-opus-*`, it's too expensive"* | `ai-proxy` dispatcher config (`routes.allow` / `routes.deny`) | +| **Consumer policy** | Dynamic, tenant/tier-aware — depends on the caller identity | *"Free tier can't use `gpt-4o`, premium can"* | Middleware (`cel` or an `ai-prompt-guard` option) | + +The catalog policy is enforced by the dispatcher before dispatch: a route match is a necessary but not sufficient condition, the requested model must also pass the route's `allow`/`deny` filter. On failure: `403` with `error.type = "model_not_permitted"` and `error.message` naming the offending model. + +The consumer policy is enforced by middleware, which already has access to `request.claims` / `request.headers`. The existing `cel` middleware covers this — with one prerequisite: today the plugin exposes `request.body` as a raw string ([plugins/cel/src/lib.rs:124-127](../plugins/cel/src/lib.rs#L124-L127)), which makes JSON-field access unusable in policy expressions. The fix is small and local to the `cel` plugin: when the inbound `content-type` is `application/json` (or a `+json` suffix), parse the body once and bind it under a separate name in the CEL context so existing string-based policies continue to evaluate unchanged. + +This ADR commits to that as a **prerequisite work item** before consumer-policy AI examples can ship. With the extension in place: + +```yaml +- name: cel + config: + expression: "request.body_json.model.startsWith('gpt-4o') && request.claims.tier != 'premium'" + on_match: + deny: + status: 403 + code: "model_not_permitted" +``` + +Naming (`request.body_json` vs. auto-overloading `request.body`), parse-error failure mode, and whether the parsed value is cached across stacked CEL middlewares are details for the implementing PR — settled on the `cel` plugin side, not in `ai-proxy`. No new capability is required: `cel` already declares `body_access = true` ([plugins/cel/plugin.toml](../plugins/cel/plugin.toml)). + +Splitting the two layers matters because they have different audiences (platform team vs. tenant team), different change cadences (rarely vs. per-contract), and different failure modes (misconfiguration blocks everyone vs. one tenant). Collapsing them into a single config section would make both harder to reason about. + +### 1. `ai-proxy` becomes protocol-aware + +The existing `ai-proxy` dispatcher gains a second client-facing endpoint. It exposes: + +- `POST /v1/chat/completions` — existing Chat Completions surface (ADR-0024) +- `POST /v1/responses` — new Responses API surface (this ADR) + +Path-based dispatch inside the plugin chooses the translation layer. Targets, credentials, fallback chain, metrics, and context propagation are **shared** across both protocols — they live one layer below the protocol adapter. + +Source layout refactor in `plugins/ai-proxy/src/`: + +``` +lib.rs dispatch entry + target resolution (shared) +protocols/ + chat_completion.rs existing translate_{to,from}_anthropic (extracted) + responses.rs new — stateless Responses ↔ Messages translation +providers/ + openai.rs passthrough for both protocols + anthropic.rs translate per incoming protocol + ollama.rs passthrough (Chat Completions only for now) +``` + +**Rationale for extending `ai-proxy` rather than shipping a second plugin:** targets, fallback, `/v1/models`, and observability are protocol-agnostic. A second plugin would duplicate all of that and force operators to pick which protocol surface they want per deployment. Keeping a single plugin lets a single `ai-proxy` instance serve mixed-format clients against the same target pool. + +### 2. Responses API translation — stateless only + +The `responses.rs` adapter: + +- **Accepts** `POST /v1/responses` regardless of the `store` value (`true`, `false`, or omitted). The gateway always processes the request statelessly — upstream to Anthropic, no persistence. Rationale: most clients send `store: true` as an unexamined default (it is the OpenAI server-side default) without ever using the stateful features; rejecting `store: true` would break those clients gratuitously. The gateway fails at the real boundary (see next point), not at a flag the client didn't mean to invoke. +- **Emits observability signals** when `store ≠ false`: + - HTTP response header `Warning: 299 - "store ignored; gateway is stateless, see ADR-0030"` + - Metric counter `barbacane_plugin_ai_proxy_responses_store_downgrades_total` incremented — lets operators quantify stateful-API usage and decide whether to prioritize the session capability work +- **Generates a synthetic `id`** in the response (format `resp_`). UUIDv7 (RFC 9562) is time-ordered, so a `resp_*` grep across access logs comes out in chronological order without a separate sort key — and the embedded timestamp leaks no information the response wasn't already carrying via `created_at`. Clients that read the ID only as an opaque tracking handle are unaffected; clients that try to reuse it fail explicitly at the next point. +- **Rejects `previous_response_id`** with `400` and `error.type = "previous_response_id_not_supported"`. This is the only genuinely stateful feature a client can invoke, and the rejection lands exactly where the client is trying to use state the gateway cannot provide. No silent behavior change, no distant mystery failure. +- **Related endpoints** (`GET /v1/responses/{id}`, `POST /v1/responses/{id}/cancel`, etc.) are not included in the ai-gateway spec fragment shipped by Barbacane (see section 4). Absence of the route → `404` from the standard router, no special handler. This keeps the spec-driven invariant: the gateway's capabilities are visible in the compiled `.bca` artifact. +- **Translates** `input[]` items to Anthropic Messages `content` blocks: + - `input_text` / `input_image` → `text` / `image` content blocks + - `function_call` + `function_call_output` → `tool_use` + `tool_result` blocks + - `reasoning` items → dropped on the request side (Anthropic does not accept client-supplied reasoning input). Because dropping reasoning silently can degrade output quality on multi-turn agent flows in ways the client cannot detect, the dispatcher emits `Warning: 299 - "reasoning items dropped; Anthropic upstream does not accept client-supplied reasoning input"` and increments `barbacane_plugin_ai_proxy_responses_reasoning_dropped_total` whenever it strips at least one reasoning item from `input[]`. Documented as a known fidelity gap. +- **Translates** Anthropic response back to Responses API shape: + - `content` blocks → `output[]` items of matching types + - Anthropic `usage` → Responses `usage` (`input_tokens` / `output_tokens` map directly; no `prompt_tokens` rename needed since Responses already uses the new naming) +- **Pins** the Anthropic API version to `2024-10-22` (same as ADR-0024 Chat Completions path), subject to the contract-test-and-bump process from ADR-0024. +- **Streaming:** Responses API SSE (`response.output_item.delta` events, etc.) maps to Anthropic SSE (`content_block_delta`). Initial release buffers Anthropic responses and emits a single terminal Responses event, matching the current Chat Completions path ([plugins/ai-proxy/src/lib.rs:286-292](../plugins/ai-proxy/src/lib.rs#L286-L292)). True token-by-token Responses SSE translation is deferred to the same future work item as Chat Completions SSE. + +### 3. Dynamic model routing (`routes` table) + +A new `routes` section in `ai-proxy` config resolves the provider/credentials by matching the `model` field in the client request against glob patterns: + +```yaml +x-barbacane-dispatch: + name: ai-proxy + config: + routes: + - pattern: "claude-*" + provider: anthropic + api_key: "${ANTHROPIC_API_KEY}" + deny: ["claude-opus-*"] # catalog policy: no Opus on this gateway + - pattern: "gpt-*" + provider: openai + api_key: "${OPENAI_API_KEY}" + allow: ["gpt-4o-mini", "gpt-4o"] + - pattern: "o[1-4]*" + provider: openai + api_key: "${OPENAI_API_KEY}" + - pattern: "*" + provider: ollama + base_url: http://ollama:11434 +``` + +`allow` and `deny` are optional glob lists evaluated against the client's `model` field after the route matches. Semantics: if `allow` is set, the model must match at least one entry; if `deny` is set, the model must not match any entry; both can be combined (deny is evaluated after allow). A request that fails either check gets a `403` — it does not fall through to the next route pattern, because that would silently escalate a denied model to a different provider. + +#### Catalog policy applies on every resolution path + +A subtlety the first draft of this ADR left implicit: catalog `allow` / `deny` is meant as an *operator-level* statement ("this gateway never serves `claude-opus-*`, period"). That intent does not vary based on *how* the dispatcher arrived at a target. So the rule is: **`routes.<>.allow` / `deny` are evaluated against the inbound `model` field on every resolution path that produces a target carrying `allow` / `deny`** — including when the target was selected via `ai.target` from a `cel` middleware (path 1 below). + +Concretely, the resolution algorithm is: + +1. Resolve the target via the precedence list below — this picks the `provider` + credentials. +2. If the resolved target carries `allow` and/or `deny` rules, evaluate them against `request.body.model`. Failure → `403 model_not_permitted`. +3. Dispatch. + +Resolution precedence: + +1. `ai.target` context key (set by `cel` middleware — existing ADR-0024 behavior) — wins if present. Resolves to `targets[]`. If that target was declared in the `routes` table (i.e. the operator merged the two by giving a route pattern a `name`), its `allow` / `deny` apply. If it was declared as a plain `targets.` without `allow`/`deny`, none apply (consumer-policy CEL is the right enforcement layer for that case). +2. `routes` pattern match against the request's `model` field — new. `allow` / `deny` apply. +3. `default_target` → `targets[name]` — existing. `allow` / `deny` from that target apply if declared. +4. Flat `provider` + `model` — existing. No `allow` / `deny` (single-target config). + +The point is: catalog policy is attached to **the target**, not to the resolution path. A `cel` misconfig that sets `ai.target: anthropic-tier` cannot leak a denied model, because the `allow` / `deny` on that target still gate the dispatch. This is the safe-by-default reading; operators who want CEL to fully bypass catalog policy can put their CEL logic on a target that simply has no `allow`/`deny`. + +#### Patterns + +Patterns are glob (`*`, `?`, `[...]`), not full regex — simpler schema, covers the 95% case. First match wins. + +Glob syntax, case-sensitivity, and anchoring are constrained at the **plugin's JSON schema** level (a `pattern` regex on `routes[].pattern`) so that invalid syntax is rejected at compile time by `vacuum:barbacane`, not at runtime. The implementing PR picks the underlying glob library (`globset` is the obvious fit) and the schema mirrors that library's accepted character set. This keeps the ADR out of the business of pinning a specific library version. + +The `model` field is **intentionally absent from `routes` entries**, enforcing the principle above: routes declare which provider a model family belongs to, not what models exist. Operators do not track new model names in the gateway config. The `targets` map remains available for the policy-driven routing case (ADR-0024) but no longer carries a `model` either — see section 0. + +#### `allow` / `deny` does not fall through — escape hatch + +Because a denied model returns `403` and does not fall through, this config: + +```yaml +routes: + - pattern: "gpt-*" + provider: openai + api_key: "${OPENAI_API_KEY}" + allow: ["gpt-4o", "gpt-4o-mini"] + - pattern: "*" + provider: ollama + base_url: http://ollama:11434 +``` + +returns `403` for `gpt-3.5-turbo` — it does **not** route to ollama. This is the safe default (a "deny" is not silently escalated to a different provider), but it is mildly surprising. The escape hatch is to make the route pattern more specific so non-matching OpenAI models miss the route entirely and reach the catch-all: + +```yaml +routes: + - pattern: "gpt-4o*" # only matches what the allow-list would accept + provider: openai + api_key: "${OPENAI_API_KEY}" + - pattern: "*" + provider: ollama + base_url: http://ollama:11434 +``` + +Now `gpt-3.5-turbo` falls through to ollama, while `gpt-4o-mini` still routes to OpenAI. The trade-off is explicit: the `allow` list is the operator's promise *"only these from this provider"*; tightening the pattern is the operator's promise *"anything else doesn't belong to this provider at all."* The two read differently and should. + +When no route matches the client's `model` (or when the request omits `model` entirely), the dispatcher returns `400` with `error.type = "model_required"` or `"no_route"` respectively. The gateway does not silently pick a default. + +### 4. `/v1/models` served by the `ai-proxy` dispatcher + +Model discovery (`GET /v1/models`) stays **inside the spec-driven routing model**: it is served by the same `ai-proxy` dispatcher, bound to a `GET /v1/models` operation declared in the tenant spec. No native data plane handler, no carve-out around the middleware pipeline. + +An initial draft of this ADR proposed a native handler injected outside the spec. That was rejected as inconsistent with Barbacane's core invariant — **every data plane path is declared in the spec and flows through the standard dispatch + middleware pipeline** — and it also created a cross-plugin config problem (a discovery plugin would need to read the `ai-proxy` plugin's targets, which is not a capability Barbacane exposes). + +#### Dispatcher behavior per path + +The `ai-proxy` dispatcher selects its behavior from `req.path` (the same dispatch key already used in section 1): + +| Incoming path | Behavior | +|---|---| +| `POST /v1/chat/completions` | Chat Completions translation (ADR-0024) | +| `POST /v1/responses` | Responses API translation (this ADR, section 2) | +| `GET /v1/models` | Return the model catalog derived from the dispatcher's own config | + +The dispatcher's own config is the source of truth — there is no cross-plugin lookup to invent. With `model` removed from `targets` (section 0), the dispatcher has no local list of model identifiers to advertise, so discovery works by querying each configured provider's own `/models` endpoint and aggregating the results. Entries are filtered by the route `allow` / `deny` rules that would apply to a request with that model, so a deny-listed model never appears in `/v1/models` — no accidental advertising of models the gateway would refuse. + +##### Caching + +Aggregation is cached via the existing `cache` capability (the same KV that backs the [`cache` middleware plugin](../plugins/cache/plugin.toml) — `host_cache_get` / `host_cache_set` host functions). `ai-proxy` adds `cache = true` to its capability manifest; the cache key is derived from the dispatcher's resolved provider set + an optional `models_cache` config block: + +```yaml +config: + models_cache: + ttl: 300 # seconds, default 300 + enabled: true # set false to query providers on every request +``` + +Scope and failure mode — both worth flagging because they are non-obvious: + +- **Per-data-plane-instance, not cross-replica.** The `cache` capability is process-local. Each replica builds and serves its own `/v1/models` snapshot. Two replicas can momentarily disagree if a provider has just added or removed a model; this is acceptable for a discovery endpoint that is itself eventually-consistent upstream. +- **Cache-miss thundering herd.** A simultaneous burst of `/v1/models` requests after a TTL expiry can fan out to every provider. Mitigated by single-flight inside the dispatcher (only one upstream call per (provider, instance) at a time; concurrent callers wait on the in-flight result). Implementation detail, not a config knob. +- **Partial provider failure.** If one provider's `/models` call fails (5xx, timeout, connection error), the dispatcher returns a partial response — `200 OK` with the available providers' models in `data[]`, plus a `partial: true` field on the response and a `warnings: [{provider, status}]` array describing each failure. Rationale: a discovery endpoint that hard-fails when one of three providers is flaky causes more user-visible breakage than it prevents, and the partial flag lets sophisticated clients react. The failure is also tagged via the counter `barbacane_plugin_ai_proxy_models_provider_failures_total{provider}` so operators see it without polling clients. + +Response shape (OpenAI-compatible): + +```json +{ + "object": "list", + "data": [ + {"id": "claude-sonnet-4-6", "object": "model", "owned_by": "anthropic"}, + {"id": "gpt-4o", "object": "model", "owned_by": "openai"} + ] +} +``` + +#### Operator ergonomics: the ai-gateway spec fragment + +Requiring the operator to declare three operations in their spec and bind `ai-proxy` to each — with the same config block duplicated three times — is friction we do not want to impose. The fix is **not** a native handler; it is a reusable spec fragment. + +##### What the compiler already gives us + +The compiler already supports multi-file specs. A `barbacane.yaml` manifest can point `specs:` at a folder; every `*.yaml` / `*.yml` / `*.json` in that folder is parsed and **all operations across all files merge into a single artifact** ([crates/barbacane-compiler/src/manifest.rs:268-322](../crates/barbacane-compiler/src/manifest.rs#L268-L322), [crates/barbacane-compiler/src/artifact.rs:417-426](../crates/barbacane-compiler/src/artifact.rs#L417-L426)). Duplicate routes and `operationId`s are rejected; otherwise files are interchangeable. Plugin config values that look like `env://VAR_NAME` are resolved at runtime from the environment ([crates/barbacane-wasm/src/secrets.rs:79-85](../crates/barbacane-wasm/src/secrets.rs#L79-L85)). + +Together, these mean Barbacane can ship `schemas/ai-gateway.yaml` as a **standalone spec file** that operators drop into their `specs/` folder alongside their tenant spec. The shipped file declares the three operations and binds `ai-proxy` with `env://`-resolved provider keys. No new compiler mechanism is required to make this work — what we are calling "the ai-gateway spec fragment" is just a regular spec the compiler already knows how to consume. + +``` +my-api/ +├── barbacane.yaml # specs: ./specs/ +└── specs/ + ├── api.yaml # tenant spec + └── ai-gateway.yaml # shipped by Barbacane; declares the three operations +``` + +##### What's still duplicated, and what to do about it + +The remaining friction is config-level, not file-level: each of the three operations in the shipped fragment carries its own `x-barbacane-dispatch: { name: ai-proxy, config: { … } }` block, and an operator who wants to override a route or add a target has to edit three blocks in lockstep. The current pattern for solving this *for middlewares* is the root-level `x-barbacane-middlewares`, which the compiler merges with each operation's chain ([crates/barbacane-compiler/src/spec_parser/parser.rs:109](../crates/barbacane-compiler/src/spec_parser/parser.rs#L109)). Dispatchers do not have a root-level equivalent — `x-barbacane-dispatch` is operation-only ([crates/barbacane-compiler/src/spec_parser/model.rs:52](../crates/barbacane-compiler/src/spec_parser/model.rs#L52)). + +Two options for v1 of this ADR, both small: + +1. **Bake the config into the shipped fragment with `env://` placeholders for credentials.** Operators set `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` / `OLLAMA_BASE_URL` in their environment and don't touch the file. Routes and price tables are in the shipped fragment; tweaking either means forking the file. Zero compiler changes. Works today. +2. **Add a root-level `x-barbacane-dispatch-defaults` block** keyed by plugin name, parallel to root-level `x-barbacane-middlewares`. Operations that bind to that plugin without their own `config` inherit the defaults. The shipped fragment carries `x-barbacane-dispatch: { name: ai-proxy }` (no config); the operator's tenant spec carries `x-barbacane-dispatch-defaults: { ai-proxy: { config: { … } } }`. Generalizes beyond AI gateway. Modest compiler addition, mirroring the well-established middleware pattern. + +**This ADR's recommendation is to ship option 1 in v1 and treat option 2 as a follow-up improvement.** Option 1 unblocks `/v1/models` without any compiler change; option 2 is a clean ergonomic win when an operator is ready to invest. Operators on option 1 who want to customize routes or pricing fork the shipped fragment into their own `specs/` folder — a one-line cost for what should be a rare action. + +Either path keeps the spec-driven invariant: the compiled `.bca` artifact lists every operation explicitly, regardless of how it got there. + +The point of calling this out is: do not let the implementation PR sneak a half-designed merge mechanism in alongside the Responses API work. The two are decoupled and benefit from being decided separately. + +#### Disabling `/v1/models` + +An operator who does not want to expose `/v1/models` simply does not include that operation in their spec (or imports a variant of the fragment without it). No config flag needed — the spec is the source of truth. This is another reason the spec-driven approach wins: removing a route is a spec edit, not a config flag, and the absence is auditable via the compiled `.bca` artifact. + +## Consequences + +### Easier + +- Running `codex-cli` and other Responses-API-native tools against Anthropic models through Barbacane, without a sidecar proxy. +- Tool-use-heavy workloads translate more cleanly (item-based → content-block mapping is near-isomorphic). +- Adding a new Anthropic or OpenAI model: no gateway config change if the operator is using `routes`. +- Model discovery for AI-aware clients (`/v1/models`) via an importable spec fragment, without operator-written plugins and without breaking spec-driven invariants. +- Mixed-format deployments: a single `ai-proxy` instance serves both Chat Completions and Responses clients against the same target pool. + +### Harder + +- `ai-proxy` is now protocol-aware. Translation bugs can be specific to one surface and not the other, doubling the contract-test matrix (Chat Completions × {OpenAI, Anthropic}, Responses × {OpenAI, Anthropic}). +- Streaming fidelity remains a known limitation on both protocols until real SSE translation lands. +- The `routes` table introduces a third resolution path, increasing the surface area of "why did my request go to provider X?" debugging. Mitigation: the dispatcher emits a metric label `resolution = {context|routes|default|flat}` and a debug log line. +- **Breaking change for ADR-0024 deployments**: the `model` field is removed from `targets.` and from the flat top-level config. Existing specs must be updated — delete the field from each target. Justified by the project's pre-1.0 status and the codebase convention of avoiding backward-compat shims at this stage. + +### Deferred (explicitly out of scope) + +- **Stateful Responses API** (`previous_response_id`, `GET /v1/responses/{id}`, cancel). Requires a session-scoped storage capability in the WASM runtime. Track as a future ADR. The `400 previous_response_id_not_supported` rejection in v1 is the forward-compatibility hook — a future release can remove it without breaking existing clients that were already failing on it. +- **True SSE translation** for either protocol against Anthropic. Inherits the deferral from ADR-0024. +- **Ollama Responses API support.** Ollama's OpenAI-compat surface is Chat Completions only as of 2026-04. If/when Ollama adds Responses, it is a one-line passthrough addition. + +## References + +- [OpenAI Responses API reference](https://developers.openai.com/api/reference/resources/responses/methods/create) +- [ADR-0023: WASM plugin streaming](0023-wasm-plugin-streaming.md) — prerequisite for SSE +- [ADR-0024: AI Gateway Plugin](0024-ai-gateway-plugin.md) — predecessor, defines Chat Completions path