Skip to content

Commit 3238080

Browse files
committed
extension/llm/server: OpenAI-compatible HTTP server
Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and non-streaming), /v1/models, /health. Request validation rejects parameters the server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties, top_k, logit_bias, logprobs, response_format other than text, non-positive max_tokens, tool_choice = required / specific function) instead of silently ignoring them; stop sequences are applied before tool parsing; client cancellation calls runner.stop(); usage is reported. runner_pool admits physical sessions per the engine's serving_capacity() (single-slot on XNNPACK, with concurrent requests queueing on the resident session) and routes by prefix affinity. Hermetic tests (FakeRunner via dependency injection) cover the contract, templating, sampling params, tool calls and the pool; conformance/ is a black-box suite runnable against any live OpenAI server. READMEs document the flags and scope. Last of four stacked commits; depends on the bindings and serving foundations. ghstack-source-id: 83e480a ghstack-comment-id: 4617263008 Pull-Request: #19994
1 parent c326f1a commit 3238080

12 files changed

Lines changed: 2492 additions & 0 deletions

File tree

extension/llm/server/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# ExecuTorch LLM Server
2+
3+
OpenAI-compatible serving for ExecuTorch LLMs, so any OpenAI-compatible agent
4+
harness (pi, opencode, ...) can use ExecuTorch as a local backend.
5+
6+
```
7+
extension/llm/server/
8+
spec/ # language-neutral OpenAI contract ExecuTorch targets
9+
conformance/ # one test suite every language server must pass
10+
python/ # Python server implementation (current)
11+
# cpp/ # future: no-Python single-binary server
12+
```
13+
14+
Why this layout: the OpenAI contract is identical across languages, so the
15+
**spec** and **conformance** suite are shared, and each language gets its own
16+
implementation directory. The real cross-language reuse comes from the C++
17+
`LLMEngine`/`LLMSession` primitives underneath (with `TextLLMRunner` as the
18+
current adapter) — each server is a thin protocol shell over that engine. See
19+
`python/README.md` to run it.
20+
21+
Status: experimental, reliability-first and deliberately narrow. Implemented:
22+
`/health`, `/v1/models`, `/v1/chat/completions` (streaming + non-streaming),
23+
Hugging Face chat templates (`--hf-tokenizer`), `temperature` / `max_tokens` /
24+
`max_completion_tokens` / `stop`, Hermes/Qwen tool calling
25+
(`<tool_call>...</tool_call>`, complete calls only) with `tool_choice="none"`,
26+
structured API errors, cancellation, and an opt-in conservative per-runner KV
27+
prefix cache (`--enable-prefix-cache`). Unsupported params (including `top_p`,
28+
`seed`, `n>1`, `reasoning_effort`, penalties, `logit_bias`, `response_format`,
29+
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
30+
rather than silently ignored. See `python/README.md` to run it and
31+
`spec/README.md` for the exact contract.
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""Language-neutral OpenAI-contract conformance tests.
8+
9+
Runs against any base URL (ExecuTorch, llama.cpp, mlx-lm, ...) so every server
10+
implementation is validated against one shared spec. Point it at a running
11+
server:
12+
13+
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest test_openai_contract.py
14+
15+
Skips automatically if no server is reachable.
16+
"""
17+
18+
import json
19+
import os
20+
import urllib.error
21+
import urllib.request
22+
23+
import pytest
24+
25+
BASE_URL = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:8000/v1").rstrip("/")
26+
MODEL = os.environ.get("OPENAI_MODEL", "executorch")
27+
28+
29+
def _post(path: str, body: dict, stream: bool = False):
30+
req = urllib.request.Request(
31+
f"{BASE_URL}{path}",
32+
data=json.dumps(body).encode(),
33+
headers={"Content-Type": "application/json"},
34+
method="POST",
35+
)
36+
return urllib.request.urlopen(req, timeout=120)
37+
38+
39+
def _server_up() -> bool:
40+
try:
41+
urllib.request.urlopen(f"{BASE_URL}/models", timeout=5)
42+
return True
43+
except Exception:
44+
return False
45+
46+
47+
pytestmark = pytest.mark.skipif(
48+
not _server_up(), reason="no OpenAI server at OPENAI_BASE_URL"
49+
)
50+
51+
52+
def test_models_listing():
53+
with urllib.request.urlopen(f"{BASE_URL}/models", timeout=10) as r:
54+
data = json.loads(r.read())
55+
assert data["object"] == "list"
56+
assert any("id" in m for m in data["data"])
57+
58+
59+
def test_chat_completion_nonstreaming():
60+
body = {
61+
"model": MODEL,
62+
"messages": [{"role": "user", "content": "Say hello in one word."}],
63+
"max_tokens": 16,
64+
"temperature": 0.0,
65+
}
66+
with _post("/chat/completions", body) as r:
67+
data = json.loads(r.read())
68+
assert data["object"] == "chat.completion"
69+
assert data["choices"][0]["message"]["role"] == "assistant"
70+
assert isinstance(data["choices"][0]["message"]["content"], str)
71+
assert data["choices"][0]["finish_reason"] is not None
72+
73+
74+
def test_chat_completion_streaming():
75+
body = {
76+
"model": MODEL,
77+
"messages": [{"role": "user", "content": "Count to three."}],
78+
"max_tokens": 32,
79+
"stream": True,
80+
}
81+
saw_role = saw_content = saw_done = False
82+
with _post("/chat/completions", body, stream=True) as r:
83+
for raw in r:
84+
line = raw.decode().strip()
85+
if not line.startswith("data:"):
86+
continue
87+
payload = line[len("data:") :].strip()
88+
if payload == "[DONE]":
89+
saw_done = True
90+
break
91+
chunk = json.loads(payload)
92+
assert chunk["object"] == "chat.completion.chunk"
93+
delta = chunk["choices"][0]["delta"]
94+
saw_role = saw_role or delta.get("role") == "assistant"
95+
saw_content = saw_content or bool(delta.get("content"))
96+
assert saw_role and saw_content and saw_done
97+
98+
99+
def test_multibyte_streaming_integrity():
100+
# Byte-level BPE can split a multi-byte character across tokens; the stream
101+
# must reassemble it, not abort with a UTF-8 decode error.
102+
body = {
103+
"model": MODEL,
104+
"messages": [
105+
{"role": "user", "content": "Reply with exactly: 你好世界 🌍 café"}
106+
],
107+
"max_tokens": 32,
108+
"temperature": 0.0,
109+
"stream": True,
110+
}
111+
content, saw_done, saw_error = "", False, False
112+
with _post("/chat/completions", body, stream=True) as r:
113+
for raw in r:
114+
line = raw.decode().strip()
115+
if not line.startswith("data:"):
116+
continue
117+
payload = line[len("data:") :].strip()
118+
if payload == "[DONE]":
119+
saw_done = True
120+
break
121+
chunk = json.loads(payload)
122+
if "error" in chunk:
123+
saw_error = True
124+
content += (
125+
chunk["choices"][0]["delta"].get("content", "")
126+
if chunk.get("choices")
127+
else ""
128+
)
129+
assert saw_done and not saw_error
130+
assert isinstance(content, str) and content # reassembled, valid UTF-8
131+
132+
133+
def test_usage_chunk_in_stream():
134+
body = {
135+
"model": MODEL,
136+
"messages": [{"role": "user", "content": "Say hi."}],
137+
"max_tokens": 16,
138+
"stream": True,
139+
"stream_options": {"include_usage": True},
140+
}
141+
usage = None
142+
with _post("/chat/completions", body, stream=True) as r:
143+
for raw in r:
144+
line = raw.decode().strip()
145+
if not line.startswith("data:"):
146+
continue
147+
payload = line[len("data:") :].strip()
148+
if payload == "[DONE]":
149+
break
150+
chunk = json.loads(payload)
151+
if chunk.get("usage"):
152+
usage = chunk["usage"]
153+
assert usage is not None, "no usage chunk emitted with include_usage"
154+
assert usage["prompt_tokens"] > 0 and usage["completion_tokens"] > 0
155+
assert usage["total_tokens"] == usage["prompt_tokens"] + usage["completion_tokens"]
156+
157+
158+
WEATHER_TOOL = {
159+
"type": "function",
160+
"function": {
161+
"name": "get_weather",
162+
"description": "Get the current weather for a city.",
163+
"parameters": {
164+
"type": "object",
165+
"properties": {"city": {"type": "string"}},
166+
"required": ["city"],
167+
},
168+
},
169+
}
170+
171+
172+
def test_tool_call_response_shape():
173+
body = {
174+
"model": MODEL,
175+
"messages": [
176+
{"role": "user", "content": "What is the weather in Paris? Use the tool."}
177+
],
178+
"tools": [WEATHER_TOOL],
179+
"max_tokens": 128,
180+
"temperature": 0.0,
181+
}
182+
with _post("/chat/completions", body) as r:
183+
data = json.loads(r.read())
184+
calls = data["choices"][0]["message"].get("tool_calls")
185+
assert calls, "expected tool_calls in response"
186+
tc = calls[0]
187+
assert tc["type"] == "function"
188+
assert tc["id"]
189+
assert tc["function"]["name"] == "get_weather"
190+
json.loads(tc["function"]["arguments"]) # arguments is a JSON string
191+
assert data["choices"][0]["finish_reason"] == "tool_calls"
192+
193+
194+
def test_error_body_shape():
195+
# Over-long prompt -> structured 400 (OpenAI error envelope), not a 500/drop.
196+
body = {
197+
"model": MODEL,
198+
"messages": [{"role": "user", "content": "word " * 40000}],
199+
"max_tokens": 8,
200+
}
201+
try:
202+
_post("/chat/completions", body)
203+
raise AssertionError("expected an HTTP error for over-long prompt")
204+
except urllib.error.HTTPError as e:
205+
assert 400 <= e.code < 500
206+
err = json.loads(e.read())["error"]
207+
assert err["message"] and err["type"]
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# ExecuTorch LLM Server — Python
2+
3+
A thin OpenAI-compatible HTTP server over ExecuTorch's `LLMEngine`/`LLMSession`
4+
serving API (with `TextLLMRunner` as the underlying adapter).
5+
6+
## Install
7+
8+
```bash
9+
pip install -r requirements.txt
10+
# transformers is optional but recommended for model-correct chat templates
11+
pip install transformers
12+
```
13+
14+
Requires an ExecuTorch build with the LLM runner pybindings
15+
(`EXECUTORCH_BUILD_PYBIND=ON`) so `executorch.extension.llm.runner` imports.
16+
17+
### Model & runtime requirements
18+
19+
LLM `.pte` files exported via `export_llm` use ExecuTorch custom/quantized ops:
20+
`use_sdpa_with_kv_cache``llama::custom_sdpa`, and quantized exports
21+
(`embedding_quantize`, `8da4w`, ...) → `quantized_decomposed` ops. These are the
22+
Python-runtime equivalent of the C++ build flags in the canonical
23+
[Llama README](../../../../examples/models/llama/README.md)
24+
(`-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON`,
25+
`-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON`). **The server registers them
26+
automatically** (imports `executorch.extension.llm.custom_ops.custom_ops` and
27+
`executorch.kernels.quantized` before constructing the runner); without them the
28+
runner fails with `Missing operator ... load_method('forward') failed`.
29+
30+
Tokenizer: pass the model's tokenizer — `tokenizer.json` (HF, e.g. Qwen3) or
31+
`tokenizer.model` (Llama); the runner auto-detects. If you see an RE2 lookahead
32+
warning it falls back to PCRE2 and still works (build with
33+
`-DSUPPORT_REGEX_LOOKAHEAD=ON` for the native regex path).
34+
35+
## Run
36+
37+
```bash
38+
python -m executorch.extension.llm.server.python.server \
39+
--model-path /path/to/model.pte \
40+
--tokenizer-path /path/to/tokenizer.bin \
41+
--hf-tokenizer Qwen/Qwen2.5-Coder-7B-Instruct \
42+
--model-id qwen2.5-coder \
43+
--enable-prefix-cache \
44+
--host 127.0.0.1 --port 8000
45+
```
46+
47+
`--hf-tokenizer` is **required** (it applies the model's real `chat_template`)
48+
unless you pass `--allow-chatml-fallback` to opt into approximate generic ChatML
49+
— which is wrong for many instruct/tool models and can't reproduce controls like
50+
`enable_thinking`.
51+
52+
Key flags:
53+
54+
| Flag | Effect |
55+
|------|--------|
56+
| `--hf-tokenizer` | model's HF chat template (required unless fallback) |
57+
| `--allow-chatml-fallback` | opt into approximate ChatML when no HF tokenizer |
58+
| `--no-think` | default `enable_thinking=False` (e.g. Qwen3) |
59+
| `--max-context N` | reject over-long prompts with 400 instead of failing mid-gen |
60+
| `--num-runners N` | *requested* physical sessions (each = one KV cache, N × memory); the actual count is clamped by the engine's `serving_capacity()` — XNNPACK self-contained `.pte` is single-slot, so N>1 is clamped to 1 and extra requests queue |
61+
| `--enable-prefix-cache` | opt-in turn-to-turn KV reuse (requires `--hf-tokenizer`; runs the LLMEngine/LLMSession path) |
62+
63+
## Use from an agent harness
64+
65+
- **opencode** (`opencode.json`):
66+
```json
67+
{ "provider": { "executorch": {
68+
"npm": "@ai-sdk/openai-compatible",
69+
"options": { "baseURL": "http://127.0.0.1:8000/v1" },
70+
"models": { "qwen2.5-coder": { "name": "Qwen2.5-Coder (ExecuTorch)" } } } } }
71+
```
72+
- **pi** (`~/.pi/agent/models.json`):
73+
```json
74+
{ "providers": { "executorch": {
75+
"baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
76+
"apiKey": "x", "models": [ { "id": "qwen2.5-coder" } ] } } }
77+
```
78+
79+
## Validate
80+
81+
Two layers, both contract-focused (assert on the wire, not internals):
82+
83+
```bash
84+
# 1. Hermetic unit tests — fake engine, no model/GPU, fast (CI-friendly).
85+
pip install pytest httpx
86+
pytest tests/
87+
88+
# 2. Conformance — black-box, against a LIVE server (real model, or llama.cpp/mlx-lm).
89+
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest ../conformance/test_openai_contract.py
90+
```
91+
92+
`tests/` swaps in a `FakeRunner` via `RunnerPool(runner_factory=...)`, so the real
93+
server/protocol/streaming code is tested over HTTP without a `.pte`.
94+
95+
## Architecture
96+
97+
Control plane (this dir, Python): server, OpenAI protocol, chat templating,
98+
session routing/streaming, and prefix-reuse *policy*. Data plane (C++): the
99+
`LLMEngine`/`LLMSession` API owns token stepping and KV mutation (prefill/decode/
100+
sampling) and releases the GIL. Python depends on `LLMEngine`/`LLMSession`, not on
101+
`TextLLMRunner` token-step internals (`TextLLMRunner` is a legacy/direct runner
102+
and a C++ implementation detail behind the session adapter). How many physical
103+
sessions can exist without multiplying model memory is decided by
104+
`serving_capacity()`, not by `--num-runners`. Tensor data never crosses into
105+
Python element-wise.
106+
107+
| File | Role |
108+
|------|------|
109+
| `server.py` | FastAPI app, routes, CLI entrypoint |
110+
| `protocol.py` | OpenAI request/response schemas |
111+
| `chat_template.py` | messages (+tools) → prompt string |
112+
| `runner_pool.py` | session pool + serving-capacity admission + affinity routing + async streaming bridge |
113+
| `serving_chat.py` | `/v1/chat/completions` (streaming + non-streaming, stop, tools) |
114+
| `prefix_cache.py` | turn-to-turn KV prefix-reuse policy over an `LLMSession` (opt-in) |
115+
| `tool_parsers/` | Hermes/Qwen `<tool_call>` parser only |
116+
117+
## Scope & caveats
118+
119+
Deliberately narrow (reliability-first): Hermes/Qwen tool calling only;
120+
unsupported sampling params are rejected, not ignored. `--num-runners` is a
121+
*request*, not a guarantee — the engine's `serving_capacity()` is authoritative,
122+
and an XNNPACK self-contained `.pte` is conservative **single-slot** for v1
123+
(packed weights may be per-method-instance, so extra physical sessions would
124+
duplicate model memory): N>1 is clamped to 1 and concurrent requests queue on the
125+
resident session. The engine serializes backend execution across sessions (op
126+
kernels aren't assumed thread-safe — this is also what fixed the multi-runner
127+
heap corruption). Prefix cache requires the LLMSession/engine path
128+
(`--enable-prefix-cache` + `--hf-tokenizer`). Weight sharing across physical
129+
sessions on a backend that supports it (e.g. CUDA/AOTI), adaptive thinking, and
130+
multi-session subagents are future work.

0 commit comments

Comments
 (0)