OpenAIProvider talks to any server that implements OpenAI's Chat
Completions wire format (POST /v1/chat/completions) — including
self-hosted vLLM. This page
walks through the configuration nuances specific to vLLM
deployments.
import asyncio
from openarmature.llm import OpenAIProvider, RuntimeConfig, UserMessage
async def main() -> None:
provider = OpenAIProvider(
base_url="http://localhost:8000", # host root only — no /v1
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None, # vLLM doesn't require auth by default
genai_system="vllm", # surfaces on observability spans
)
messages = [UserMessage(content="hello")]
config = RuntimeConfig(temperature=0.0, max_tokens=128)
# await provider.ready() # pre-flight; needs a live endpoint
# response = await provider.complete(messages, config=config)
# print(response.message.content)
_ = (messages, config) # used once the calls above are uncommented
await provider.aclose()
asyncio.run(main())That's it for the happy path. The rest of the page covers the config nuances you'll hit in real deployments.
vLLM serves on http://<host>:<port>/v1/chat/completions and
http://<host>:<port>/v1/models. OpenAIProvider appends the
/v1/... paths itself, so the base_url you pass MUST be the host
root — no /v1 suffix:
from openarmature.llm import OpenAIProvider
# Correct — host root only, provider appends /v1/...
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
)
# Rejected at construction time — raises ValueError
try:
OpenAIProvider(
base_url="http://localhost:8000/v1",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
)
except ValueError as exc:
# "base_url must not end with '/v1' — the provider appends …"
_ = excThe check exists because httpx joins paths by appending, so an
unprefixed /v1 on base_url produces a doubled /v1/v1/... wire
path that silently 404/405s on most backends. The provider rejects
the misconfiguration synchronously rather than letting it surface as
a wire failure at lifespan startup.
Trailing slashes are stripped; other non-empty paths (proxy prefixes
like /api/openai-proxy) are left intact for intentional reverse-
proxy setups.
vLLM ships with auth off by default. Pass api_key=None for that
case. To enable auth on the vLLM side, launch with --api-key <your-key> and pass the same value to OpenAIProvider:
# vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key sk-vllm-local-secretfrom openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key="sk-vllm-local-secret",
)A wrong or missing key surfaces as ProviderAuthentication (mapped
from 401/403) — the same error category as OpenAI cloud auth
failures, so retry / surface logic is portable across backends.
The genai_system constructor kwarg sets the OTel gen_ai.system
span attribute. The default "openai" is correct for OpenAI's
hosted API; for self-hosted backends, set it to the backend name so
dashboards / cost-attribution UIs filter the traces correctly:
from openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
genai_system="vllm",
)Standard values for other backends running the same wire format:
"vllm", "lmstudio", "llamacpp", "sglang". No base_url
sniffing is done — the same host:port could be any of those servers,
and a wrong inference would be worse than the explicit opt-in.
OpenAI's native structured-output path uses the response_format
field on the request body. Older vLLM releases either reject this
field or silently ignore it. If you're targeting structured output
on an older vLLM:
from openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
genai_system="vllm",
force_prompt_augmentation_fallback=True,
)With the fallback flag set, the provider injects a JSON-Schema
directive into the system message instead of using
response_format. The wire body never carries response_format;
the model sees the schema in the prompt and is asked to produce
conforming JSON. Validation against the schema still runs on the
returned text — StructuredOutputInvalid surfaces when the model's
output doesn't match.
Recent vLLM releases (>=0.5.x) support response_format natively;
leave the flag at its default False for those.
provider.ready() hits GET /v1/models and:
- Matches the bound model against the returned
data[].identries; raisesProviderInvalidModelif absent. - Consults an optional per-entry
statusfield — if it containsloadingornot_loaded, raisesProviderModelNotLoaded. Local servers that report load state (some LM Studio / vLLM builds) get a real not-loaded signal through this path. - Maps 401/403 →
ProviderAuthentication, 5xx / connection error →ProviderUnavailable.
Limitation for vLLM specifically. vLLM's /v1/models doesn't
populate a status field — it returns the configured model with a
200 even during a slow first-load. So the status-based not-loaded
detection above doesn't fire for vLLM; the probe confirms the model
name matches but can't tell warmed from cold. For deployments where
cold-load takes seconds to minutes, layer your own warm-up call
after ready():
from openarmature.llm import OpenAIProvider, RuntimeConfig, UserMessage
async def warm_up(provider: OpenAIProvider) -> None:
await provider.ready()
# Synthetic warm-up — sends a 1-token request to force the model
# to finish loading before lifespan startup completes.
await provider.complete(
[UserMessage(content="ok")],
config=RuntimeConfig(temperature=0.0, max_tokens=1),
)A more discriminating readiness contract is on the roadmap (see post-release task: harden OpenAIProvider readiness probe).
vLLM supports OpenAI-style tool calling when launched with
--enable-auto-tool-choice and a tool-parser flag matching the
model family. The wire shape is identical to OpenAI's; from
OpenAIProvider's perspective, tool calls Just Work. The
fundamentals → tool calling page
covers the OA-side dispatch pattern; no vLLM-specific changes
needed.
# vLLM server with tool calling enabled
vllm serve <model-id> \
--enable-auto-tool-choice \
--tool-call-parser <parser-name>The --tool-call-parser flag MUST match the model family's training
format; mismatches produce assistant messages that vLLM tries to
parse as tool calls and silently returns as content (or vice versa).
Common families:
| Model family | --tool-call-parser value |
|---|---|
| Llama 3.x Instruct | llama3_json |
| Llama 4 (Maverick / Scout) | llama4_pythonic |
| Mistral Instruct families | mistral |
| Hermes, Qwen 2.5 tool-use | hermes |
| Qwen3 / Qwen3-Coder | qwen3_xml |
| DeepSeek V3 | deepseek_v3 |
| GPT-OSS (20B / 120B) | openai |
Anthropic Claude and Google Gemini models are proprietary cloud APIs,
not open weights; vLLM doesn't serve them, so they don't appear in
this table. Use their first-party endpoints (or an OpenAI-compatible
proxy) and skip the --tool-call-parser story entirely.
Gemma (Google open weights). Distinct from Gemini, but vLLM does
not currently ship a tool-call parser for the mainstream Gemma 2,
Gemma 3, or CodeGemma variants; tool calling is effectively
unsupported under vLLM for those. The one exception is Google's
specialized FunctionGemma (270M, edge-focused), which has its own
functiongemma parser. For general-purpose tool-calling workloads,
pick a model family from the table above rather than Gemma.
Qwen3-VL specifically. vLLM's docs don't currently document a
dedicated parser for the Qwen3-VL variants (Qwen3-VL-30B-A3B,
Qwen3-VL-72B). Check vLLM's release notes for the version you're
pinned to before assuming the Qwen3 row above carries over;
multimodal-instruct variants sometimes ship parser support behind
the text-instruct generation.
See vLLM's tool-calling docs for the current full list; the set grows release-over-release.
The 30-second snippet at the top of this page is enough for a local dev box. Production deployments hit three additional gotchas worth calling out.
OpenAIProvider keeps one httpx.AsyncClient per provider instance
and reuses connections across concurrent complete() calls per the
standard httpx pool idiom. vLLM's stock uvicorn keep-alive timeout
is 5 seconds; an idle pooled connection on the OA side can outlive
that window and the next request lands on a half-closed socket. The
visible symptom is httpcore.RemoteProtocolError: Server disconnected without sending a response or
httpx.RemoteProtocolError, surfaced through OpenAIProvider as
ProviderUnavailable.
The fix is to widen vLLM's keep-alive window via the
VLLM_HTTP_TIMEOUT_KEEP_ALIVE env var (the value feeds uvicorn's
timeout_keep_alive). 300 seconds covers most pool idle windows in
practice:
VLLM_HTTP_TIMEOUT_KEEP_ALIVE=300 vllm serve <model-id> --host 0.0.0.0 --port 8001Same applies behind a reverse proxy: the proxy's keep-alive window MUST be at least as wide as vLLM's. Otherwise the proxy closes connections vLLM still considers alive and the OA-side pool reuses a dead socket on the next call.
For long-running vLLM workloads, a systemd unit is the canonical launcher. The structural skeleton:
# /etc/systemd/system/vllm-<model>.service
[Unit]
Description=vLLM serving <model-id>
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=vllm
WorkingDirectory=/srv/vllm
EnvironmentFile=/etc/vllm/<model>.env
ExecStart=/srv/vllm/.venv/bin/vllm serve <model-id> \
--host 0.0.0.0 --port 8001 \
--enable-auto-tool-choice \
--tool-call-parser <parser-name>
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.targetThe EnvironmentFile pattern keeps VLLM_HTTP_TIMEOUT_KEEP_ALIVE,
CUDA_VISIBLE_DEVICES, HF_HOME, and other deploy-specific vars
out of the unit file itself, which makes the unit shippable across
hosts without per-machine edits. journalctl -u vllm-<model> is
then the canonical log surface for production triage.
Three vLLM flags interact directly with how many concurrent
complete() calls an OA graph can land before vLLM starts 429-ing:
--max-model-len: per-request context ceiling. Lower values fit more concurrent requests in the same KV-cache budget; higher values let individual requests carry longer prompts at the cost of concurrent capacity.--max-num-seqs: hard cap on concurrent sequences vLLM will schedule. Past this cap, the scheduler queues and (once queue fills) returns 429 withRetry-After.--gpu-memory-utilization: fraction of GPU VRAM vLLM may use. Higher values widen the KV-cache budget, which lets vLLM schedule closer to its--max-num-seqscap before evicting in-flight sequences; the cap itself doesn't move. Tune cautiously to avoid OOM on the resident model weights.
OA's OpenAIProvider shares one connection pool across the whole
graph, so a fan-out with concurrency=N lands N simultaneous wire
calls. When N exceeds --max-num-seqs minus vLLM's other
in-flight traffic, expect ProviderRateLimit with
retry_after populated; wrap the LLM-calling node in
RetryMiddleware (or set concurrency explicitly on the fan-out)
to avoid head-of-line stalls.
- Concurrency: vLLM batches requests internally.
OpenAIProvidershares onehttpx.AsyncClientper provider instance; concurrentcomplete()calls share the connection pool and round-trip through vLLM's batcher. - Token counting: vLLM returns OpenAI-shaped
usageblocks. OA records them onResponse.usageand surfaces them asgen_ai.usage.*span attributes per observability §5.5.3. Retry-Afteron 429: vLLM emits 429 when its scheduler queues fill.ProviderRateLimit.retry_afteris populated from the header. Retry middleware handles backoff if wrapped around the node.- Streaming: not supported on
provider.complete()(it's a single-completion call by contract). For streaming, write a capability against the same wire endpoint.