Skip to content

Commit c1b2f23

Browse files
Add vLLM production deployment notes (#112)
Extend docs/model-providers/vllm.md with cross-cutting gotchas surfaced in real production work. The "Tool calling" section grows a --tool-call-parser family table (verified against vLLM's docs: Llama 3.x, Llama 4, Mistral, Hermes, Qwen3, DeepSeek V3, GPT-OSS) plus explicit not-supported callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no parser ships). A new "Production deployment" H2 covers the three gotchas: - VLLM_HTTP_TIMEOUT_KEEP_ALIVE: vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as ProviderUnavailable; widen to roughly 300s. Includes the reverse-proxy variant of the same rule. - systemd unit skeleton: structural, no model-specific paths; uses EnvironmentFile so the unit ships across hosts. - Throughput knobs (--max-model-len, --max-num-seqs, --gpu-memory-utilization) framed OA-side: when fan-out concurrency exceeds the cap, expect ProviderRateLimit; wrap the LLM-calling node in RetryMiddleware. Docs-only; no code or test changes. CHANGELOG bullet added under [Unreleased] ### Added.
1 parent ebbafd8 commit c1b2f23

2 files changed

Lines changed: 136 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
88

99
### Added
1010

11+
- **vLLM production deployment notes.** `docs/model-providers/vllm.md` grows a "Production deployment" section covering the `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` gotcha (vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as `ProviderUnavailable`; widen to roughly 300s), a systemd unit skeleton, and the three throughput knobs that interact with OA's shared connection pool (`--max-model-len`, `--max-num-seqs`, `--gpu-memory-utilization`). The existing "Tool calling" section grows a `--tool-call-parser` family table verified against vLLM's docs (Llama 3.x / Llama 4 / Mistral / Hermes / Qwen3 / DeepSeek V3 / GPT-OSS), plus explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no vLLM parser).
1112
- **Three new patterns docs.** `docs/patterns/state-migration-on-resume.md`, `docs/patterns/caller-supplied-trace-identifiers.md`, and `docs/patterns/observer-state-reconciliation.md` graduate the corresponding entries from `docs/agent/non-obvious-shapes.md` into full pattern recipes with code snippets and "when this is right / when it isn't" guidance. The programmatic patterns API (`openarmature.patterns.list()` / `get(name)`) grows from 4 to 7 entries.
1213
- **HyperDX OTel integration test path and "Production swap" docs in example 03.** `examples/03-observer-hooks/main.py`'s module docstring grows a "Production swap" section showing how to substitute the demo's `SimpleSpanProcessor` + `ConsoleSpanExporter` for `BatchSpanProcessor` + `OTLPSpanExporter` pointed at HyperDX (or any other OTLP-HTTP collector). A new opt-in integration test (`tests/integration/test_otel_hyperdx_export.py`, gated by `HYPERDX_API_KEY` + `HYPERDX_OTLP_ENDPOINT` env vars and `@pytest.mark.integration`) drives the same production export path end-to-end against a live endpoint. `opentelemetry-exporter-otlp-proto-http` lands as a dev-only dep; not promoted to a public extras group yet.
1314

docs/model-providers/vllm.md

Lines changed: 135 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -197,21 +197,150 @@ post-release task: harden OpenAIProvider readiness probe).
197197

198198
vLLM supports OpenAI-style tool calling when launched with
199199
`--enable-auto-tool-choice` and a tool-parser flag matching the
200-
model family (e.g., `--tool-call-parser llama3_json` for Llama 3.1
201-
Instruct). The wire shape is identical to OpenAI's; from
200+
model family. The wire shape is identical to OpenAI's; from
202201
`OpenAIProvider`'s perspective, tool calls Just Work. The
203202
[fundamentals → tool calling](../concepts/llms.md#tool-calling) page
204203
covers the OA-side dispatch pattern; no vLLM-specific changes
205204
needed.
206205

207206
```bash
208-
# vLLM server — enable tool calling
209-
python -m vllm.entrypoints.openai.api_server \
210-
--model meta-llama/Llama-3.1-8B-Instruct \
207+
# vLLM server with tool calling enabled
208+
vllm serve <model-id> \
211209
--enable-auto-tool-choice \
212-
--tool-call-parser llama3_json
210+
--tool-call-parser <parser-name>
211+
```
212+
213+
The `--tool-call-parser` flag MUST match the model family's training
214+
format; mismatches produce assistant messages that vLLM tries to
215+
parse as tool calls and silently returns as content (or vice versa).
216+
Common families:
217+
218+
| Model family | `--tool-call-parser` value |
219+
|-------------------------------|----------------------------|
220+
| Llama 3.x Instruct | `llama3_json` |
221+
| Llama 4 (Maverick / Scout) | `llama4_pythonic` |
222+
| Mistral Instruct families | `mistral` |
223+
| Hermes, Qwen 2.5 tool-use | `hermes` |
224+
| Qwen3 / Qwen3-Coder | `qwen3_xml` |
225+
| DeepSeek V3 | `deepseek_v3` |
226+
| GPT-OSS (20B / 120B) | `openai` |
227+
228+
Anthropic Claude and Google Gemini models are proprietary cloud APIs,
229+
not open weights; vLLM doesn't serve them, so they don't appear in
230+
this table. Use their first-party endpoints (or an OpenAI-compatible
231+
proxy) and skip the `--tool-call-parser` story entirely.
232+
233+
**Gemma (Google open weights).** Distinct from Gemini, but vLLM does
234+
not currently ship a tool-call parser for the mainstream Gemma 2,
235+
Gemma 3, or CodeGemma variants; tool calling is effectively
236+
unsupported under vLLM for those. The one exception is Google's
237+
specialized FunctionGemma (270M, edge-focused), which has its own
238+
`functiongemma` parser. For general-purpose tool-calling workloads,
239+
pick a model family from the table above rather than Gemma.
240+
241+
**Qwen3-VL specifically.** vLLM's docs don't currently document a
242+
dedicated parser for the Qwen3-VL variants (`Qwen3-VL-30B-A3B`,
243+
`Qwen3-VL-72B`). Check vLLM's release notes for the version you're
244+
pinned to before assuming the Qwen3 row above carries over;
245+
multimodal-instruct variants sometimes ship parser support behind
246+
the text-instruct generation.
247+
248+
See vLLM's
249+
[tool-calling docs](https://docs.vllm.ai/en/latest/features/tool_calling.html)
250+
for the current full list; the set grows release-over-release.
251+
252+
## Production deployment
253+
254+
The 30-second snippet at the top of this page is enough for a local
255+
dev box. Production deployments hit three additional gotchas worth
256+
calling out.
257+
258+
### `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` against `OpenAIProvider`
259+
260+
`OpenAIProvider` keeps one `httpx.AsyncClient` per provider instance
261+
and reuses connections across concurrent `complete()` calls per the
262+
standard httpx pool idiom. vLLM's stock uvicorn keep-alive timeout
263+
is 5 seconds; an idle pooled connection on the OA side can outlive
264+
that window and the next request lands on a half-closed socket. The
265+
visible symptom is `httpcore.RemoteProtocolError: Server
266+
disconnected without sending a response` or
267+
`httpx.RemoteProtocolError`, surfaced through `OpenAIProvider` as
268+
`ProviderUnavailable`.
269+
270+
The fix is to widen vLLM's keep-alive window via the
271+
`VLLM_HTTP_TIMEOUT_KEEP_ALIVE` env var (the value feeds uvicorn's
272+
`timeout_keep_alive`). 300 seconds covers most pool idle windows in
273+
practice:
274+
275+
```bash
276+
VLLM_HTTP_TIMEOUT_KEEP_ALIVE=300 vllm serve <model-id> --host 0.0.0.0 --port 8001
213277
```
214278

279+
Same applies behind a reverse proxy: the proxy's keep-alive window
280+
MUST be at least as wide as vLLM's. Otherwise the proxy closes
281+
connections vLLM still considers alive and the OA-side pool reuses a
282+
dead socket on the next call.
283+
284+
### systemd unit shape
285+
286+
For long-running vLLM workloads, a systemd unit is the canonical
287+
launcher. The structural skeleton:
288+
289+
```ini
290+
# /etc/systemd/system/vllm-<model>.service
291+
[Unit]
292+
Description=vLLM serving <model-id>
293+
After=network-online.target
294+
Wants=network-online.target
295+
296+
[Service]
297+
Type=simple
298+
User=vllm
299+
WorkingDirectory=/srv/vllm
300+
EnvironmentFile=/etc/vllm/<model>.env
301+
ExecStart=/srv/vllm/.venv/bin/vllm serve <model-id> \
302+
--host 0.0.0.0 --port 8001 \
303+
--enable-auto-tool-choice \
304+
--tool-call-parser <parser-name>
305+
Restart=on-failure
306+
RestartSec=5
307+
308+
[Install]
309+
WantedBy=multi-user.target
310+
```
311+
312+
The `EnvironmentFile` pattern keeps `VLLM_HTTP_TIMEOUT_KEEP_ALIVE`,
313+
`CUDA_VISIBLE_DEVICES`, `HF_HOME`, and other deploy-specific vars
314+
out of the unit file itself, which makes the unit shippable across
315+
hosts without per-machine edits. `journalctl -u vllm-<model>` is
316+
then the canonical log surface for production triage.
317+
318+
### Throughput knobs and OA concurrency
319+
320+
Three vLLM flags interact directly with how many concurrent
321+
`complete()` calls an OA graph can land before vLLM starts 429-ing:
322+
323+
- `--max-model-len`: per-request context ceiling. Lower values fit
324+
more concurrent requests in the same KV-cache budget; higher
325+
values let individual requests carry longer prompts at the cost
326+
of concurrent capacity.
327+
- `--max-num-seqs`: hard cap on concurrent sequences vLLM will
328+
schedule. Past this cap, the scheduler queues and (once queue
329+
fills) returns 429 with `Retry-After`.
330+
- `--gpu-memory-utilization`: fraction of GPU VRAM vLLM may use.
331+
Higher values widen the KV-cache budget, which lets vLLM schedule
332+
closer to its `--max-num-seqs` cap before evicting in-flight
333+
sequences; the cap itself doesn't move. Tune cautiously to avoid
334+
OOM on the resident model weights.
335+
336+
OA's `OpenAIProvider` shares one connection pool across the whole
337+
graph, so a fan-out with `concurrency=N` lands N simultaneous wire
338+
calls. When `N` exceeds `--max-num-seqs` minus vLLM's other
339+
in-flight traffic, expect `ProviderRateLimit` with
340+
`retry_after` populated; wrap the LLM-calling node in
341+
`RetryMiddleware` (or set `concurrency` explicitly on the fan-out)
342+
to avoid head-of-line stalls.
343+
215344
## Behaviour to be aware of
216345

217346
- **Concurrency**: vLLM batches requests internally. `OpenAIProvider`

0 commit comments

Comments
 (0)