You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend docs/model-providers/vllm.md with cross-cutting gotchas
surfaced in real production work. The "Tool calling" section grows
a --tool-call-parser family table (verified against vLLM's docs:
Llama 3.x, Llama 4, Mistral, Hermes, Qwen3, DeepSeek V3, GPT-OSS)
plus explicit not-supported callouts for Anthropic / Gemini
(proprietary cloud) and mainstream Gemma (no parser ships).
A new "Production deployment" H2 covers the three gotchas:
- VLLM_HTTP_TIMEOUT_KEEP_ALIVE: vLLM's stock 5s uvicorn keep-alive
lapses pooled OA-side httpx connections and surfaces as
ProviderUnavailable; widen to roughly 300s. Includes the
reverse-proxy variant of the same rule.
- systemd unit skeleton: structural, no model-specific paths; uses
EnvironmentFile so the unit ships across hosts.
- Throughput knobs (--max-model-len, --max-num-seqs,
--gpu-memory-utilization) framed OA-side: when fan-out
concurrency exceeds the cap, expect ProviderRateLimit; wrap
the LLM-calling node in RetryMiddleware.
Docs-only; no code or test changes. CHANGELOG bullet added under
[Unreleased] ### Added.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The
8
8
9
9
### Added
10
10
11
+
-**vLLM production deployment notes.**`docs/model-providers/vllm.md` grows a "Production deployment" section covering the `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` gotcha (vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as `ProviderUnavailable`; widen to roughly 300s), a systemd unit skeleton, and the three throughput knobs that interact with OA's shared connection pool (`--max-model-len`, `--max-num-seqs`, `--gpu-memory-utilization`). The existing "Tool calling" section grows a `--tool-call-parser` family table verified against vLLM's docs (Llama 3.x / Llama 4 / Mistral / Hermes / Qwen3 / DeepSeek V3 / GPT-OSS), plus explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no vLLM parser).
11
12
-**Three new patterns docs.**`docs/patterns/state-migration-on-resume.md`, `docs/patterns/caller-supplied-trace-identifiers.md`, and `docs/patterns/observer-state-reconciliation.md` graduate the corresponding entries from `docs/agent/non-obvious-shapes.md` into full pattern recipes with code snippets and "when this is right / when it isn't" guidance. The programmatic patterns API (`openarmature.patterns.list()` / `get(name)`) grows from 4 to 7 entries.
12
13
-**HyperDX OTel integration test path and "Production swap" docs in example 03.**`examples/03-observer-hooks/main.py`'s module docstring grows a "Production swap" section showing how to substitute the demo's `SimpleSpanProcessor` + `ConsoleSpanExporter` for `BatchSpanProcessor` + `OTLPSpanExporter` pointed at HyperDX (or any other OTLP-HTTP collector). A new opt-in integration test (`tests/integration/test_otel_hyperdx_export.py`, gated by `HYPERDX_API_KEY` + `HYPERDX_OTLP_ENDPOINT` env vars and `@pytest.mark.integration`) drives the same production export path end-to-end against a live endpoint. `opentelemetry-exporter-otlp-proto-http` lands as a dev-only dep; not promoted to a public extras group yet.
0 commit comments