Skip to content

Commit 87e6641

Browse files
committed
extension/llm/server: worker-based OpenAI-compatible HTTP server
Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and non-streaming), /v1/models, /health. Request validation rejects parameters the server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties, top_k, logit_bias, logprobs, response_format other than text, non-positive max_tokens, tool_choice = required / specific function) instead of silently ignoring them; stop sequences are applied before tool parsing; usage is reported. The Python process is control plane only: it loads no model and imports no runtime pybind. Model execution runs in a separate C++ worker process (cpp/text_llm_worker.cpp, over TextLLMEngine/TextLLMSession) that the control plane spawns and drives over a small JSONL protocol (worker_client.py). The protocol and the decode loop (reset, encode, context clamp, prefill, decode, UTF-8 assembly, stop handling, stats, finish_reason) live in a shared header, cpp/worker_loop.h, so model-specific workers reuse them; text_llm_worker only constructs the engine/session and runs the loop. runner_pool is a pool of worker processes (one in-flight request per worker) with a blocking->async streaming bridge. V1 is single-slot; concurrent requests queue. There is no prefix cache and no Python-side KV state; cancellation is best-effort (the control plane stops consuming, the worker finishes the in-flight request). Hermetic tests (a FakeRunner worker handle) cover the contract, templating, sampling params, tool calls, the pool, and the worker protocol; conformance/ is a black-box suite runnable against any live OpenAI server. READMEs document the flags and scope. Depends on the serving foundations. ghstack-source-id: 9e3b7b5 ghstack-comment-id: 4617263008 Pull-Request: #19994
1 parent 9e89df8 commit 87e6641

17 files changed

Lines changed: 3210 additions & 0 deletions

extension/llm/server/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# ExecuTorch LLM Server
2+
3+
OpenAI-compatible serving for ExecuTorch LLMs, so any OpenAI-compatible agent
4+
harness (pi, opencode, ...) can use ExecuTorch as a local backend.
5+
6+
```
7+
extension/llm/server/
8+
spec/ # language-neutral OpenAI contract ExecuTorch targets
9+
conformance/ # one test suite every language server must pass
10+
python/ # Python server implementation (current)
11+
# cpp/ # future: no-Python single-binary server
12+
```
13+
14+
Why this layout: the OpenAI contract is identical across languages, so the
15+
**spec** and **conformance** suite are shared, and each language gets its own
16+
implementation directory. The real cross-language reuse comes from the C++
17+
`LLMEngine`/`LLMSession` primitives underneath, packaged as a process-isolated
18+
**worker binary** (`text_llm_worker`) that any control plane drives over a small
19+
JSONL protocol — the server is a thin protocol shell that spawns and talks to
20+
that worker. See `python/README.md` to run it.
21+
22+
Status: experimental, reliability-first and deliberately narrow. Implemented:
23+
`/health`, `/v1/models`, `/v1/chat/completions` (streaming + non-streaming),
24+
Hugging Face chat templates (`--hf-tokenizer`), `temperature` / `max_tokens` /
25+
`max_completion_tokens` / `stop`, Hermes tool calling by default
26+
(`<tool_call>...</tool_call>` JSON, complete calls only; model-specific launchers
27+
may select the Qwen XML format) with `tool_choice="none"`,
28+
structured API errors, and best-effort cancellation. V1 serving is single-slot
29+
(one worker, one session) with no prefix cache; KV prefix reuse, if it returns,
30+
lives inside the worker/session, not the control plane. Unsupported params (including `top_p`,
31+
`seed`, `n>1`, `reasoning_effort`, penalties, `logit_bias`, `response_format`,
32+
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
33+
rather than silently ignored. See `python/README.md` to run it and
34+
`spec/README.md` for the exact contract.
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""Language-neutral OpenAI-contract conformance tests.
8+
9+
Runs against any base URL (ExecuTorch, llama.cpp, mlx-lm, ...) so every server
10+
implementation is validated against one shared spec. Point it at a running
11+
server:
12+
13+
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest test_openai_contract.py
14+
15+
Skips automatically if no server is reachable.
16+
"""
17+
18+
import json
19+
import os
20+
import urllib.error
21+
import urllib.request
22+
23+
import pytest
24+
25+
BASE_URL = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:8000/v1").rstrip("/")
26+
MODEL = os.environ.get("OPENAI_MODEL", "executorch")
27+
28+
29+
def _post(path: str, body: dict, stream: bool = False):
30+
req = urllib.request.Request(
31+
f"{BASE_URL}{path}",
32+
data=json.dumps(body).encode(),
33+
headers={"Content-Type": "application/json"},
34+
method="POST",
35+
)
36+
return urllib.request.urlopen(req, timeout=120)
37+
38+
39+
def _server_up() -> bool:
40+
try:
41+
urllib.request.urlopen(f"{BASE_URL}/models", timeout=5)
42+
return True
43+
except Exception:
44+
return False
45+
46+
47+
pytestmark = pytest.mark.skipif(
48+
not _server_up(), reason="no OpenAI server at OPENAI_BASE_URL"
49+
)
50+
51+
52+
def test_models_listing():
53+
with urllib.request.urlopen(f"{BASE_URL}/models", timeout=10) as r:
54+
data = json.loads(r.read())
55+
assert data["object"] == "list"
56+
assert any("id" in m for m in data["data"])
57+
58+
59+
def test_chat_completion_nonstreaming():
60+
body = {
61+
"model": MODEL,
62+
"messages": [{"role": "user", "content": "Say hello in one word."}],
63+
"max_tokens": 16,
64+
"temperature": 0.0,
65+
}
66+
with _post("/chat/completions", body) as r:
67+
data = json.loads(r.read())
68+
assert data["object"] == "chat.completion"
69+
assert data["choices"][0]["message"]["role"] == "assistant"
70+
assert isinstance(data["choices"][0]["message"]["content"], str)
71+
assert data["choices"][0]["finish_reason"] is not None
72+
73+
74+
def test_chat_completion_streaming():
75+
body = {
76+
"model": MODEL,
77+
"messages": [{"role": "user", "content": "Count to three."}],
78+
"max_tokens": 32,
79+
"stream": True,
80+
}
81+
saw_role = saw_content = saw_done = False
82+
with _post("/chat/completions", body, stream=True) as r:
83+
for raw in r:
84+
line = raw.decode().strip()
85+
if not line.startswith("data:"):
86+
continue
87+
payload = line[len("data:") :].strip()
88+
if payload == "[DONE]":
89+
saw_done = True
90+
break
91+
chunk = json.loads(payload)
92+
assert chunk["object"] == "chat.completion.chunk"
93+
delta = chunk["choices"][0]["delta"]
94+
saw_role = saw_role or delta.get("role") == "assistant"
95+
saw_content = saw_content or bool(delta.get("content"))
96+
assert saw_role and saw_content and saw_done
97+
98+
99+
def test_multibyte_streaming_integrity():
100+
# Byte-level BPE can split a multi-byte character across tokens; the stream
101+
# must reassemble it, not abort with a UTF-8 decode error.
102+
body = {
103+
"model": MODEL,
104+
"messages": [
105+
{"role": "user", "content": "Reply with exactly: 你好世界 🌍 café"}
106+
],
107+
"max_tokens": 32,
108+
"temperature": 0.0,
109+
"stream": True,
110+
}
111+
content, saw_done, saw_error = "", False, False
112+
with _post("/chat/completions", body, stream=True) as r:
113+
for raw in r:
114+
line = raw.decode().strip()
115+
if not line.startswith("data:"):
116+
continue
117+
payload = line[len("data:") :].strip()
118+
if payload == "[DONE]":
119+
saw_done = True
120+
break
121+
chunk = json.loads(payload)
122+
if "error" in chunk:
123+
saw_error = True
124+
content += (
125+
chunk["choices"][0]["delta"].get("content", "")
126+
if chunk.get("choices")
127+
else ""
128+
)
129+
assert saw_done and not saw_error
130+
assert isinstance(content, str) and content # reassembled, valid UTF-8
131+
132+
133+
def test_usage_chunk_in_stream():
134+
body = {
135+
"model": MODEL,
136+
"messages": [{"role": "user", "content": "Say hi."}],
137+
"max_tokens": 16,
138+
"stream": True,
139+
"stream_options": {"include_usage": True},
140+
}
141+
usage = None
142+
with _post("/chat/completions", body, stream=True) as r:
143+
for raw in r:
144+
line = raw.decode().strip()
145+
if not line.startswith("data:"):
146+
continue
147+
payload = line[len("data:") :].strip()
148+
if payload == "[DONE]":
149+
break
150+
chunk = json.loads(payload)
151+
if chunk.get("usage"):
152+
usage = chunk["usage"]
153+
assert usage is not None, "no usage chunk emitted with include_usage"
154+
assert usage["prompt_tokens"] > 0 and usage["completion_tokens"] > 0
155+
assert usage["total_tokens"] == usage["prompt_tokens"] + usage["completion_tokens"]
156+
157+
158+
WEATHER_TOOL = {
159+
"type": "function",
160+
"function": {
161+
"name": "get_weather",
162+
"description": "Get the current weather for a city.",
163+
"parameters": {
164+
"type": "object",
165+
"properties": {"city": {"type": "string"}},
166+
"required": ["city"],
167+
},
168+
},
169+
}
170+
171+
172+
def test_tool_call_response_shape():
173+
body = {
174+
"model": MODEL,
175+
"messages": [
176+
{"role": "user", "content": "What is the weather in Paris? Use the tool."}
177+
],
178+
"tools": [WEATHER_TOOL],
179+
"max_tokens": 128,
180+
"temperature": 0.0,
181+
}
182+
with _post("/chat/completions", body) as r:
183+
data = json.loads(r.read())
184+
calls = data["choices"][0]["message"].get("tool_calls")
185+
assert calls, "expected tool_calls in response"
186+
tc = calls[0]
187+
assert tc["type"] == "function"
188+
assert tc["id"]
189+
assert tc["function"]["name"] == "get_weather"
190+
json.loads(tc["function"]["arguments"]) # arguments is a JSON string
191+
assert data["choices"][0]["finish_reason"] == "tool_calls"
192+
193+
194+
def test_error_body_shape():
195+
# Over-long prompt -> structured 400 (OpenAI error envelope), not a 500/drop.
196+
body = {
197+
"model": MODEL,
198+
"messages": [{"role": "user", "content": "word " * 40000}],
199+
"max_tokens": 8,
200+
}
201+
try:
202+
_post("/chat/completions", body)
203+
raise AssertionError("expected an HTTP error for over-long prompt")
204+
except urllib.error.HTTPError as e:
205+
assert 400 <= e.code < 500
206+
err = json.loads(e.read())["error"]
207+
assert err["message"] and err["type"]
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
# Generic model-execution worker for standard .pte TextLLM models. One binary,
8+
# no registry/factory: it constructs TextLLMEngine/TextLLMSession directly and
9+
# speaks the JSONL worker protocol (worker_client.py). Model execution is C++
10+
# only — the Python server is HTTP/control plane.
11+
#
12+
# Build like the example runners (standalone), e.g. from this directory: cmake
13+
# -S . -B <executorch-cmake-out>/extension/llm/server/cpp \
14+
# -DCMAKE_PREFIX_PATH=<executorch-cmake-out> -DEXECUTORCH_BUILD_XNNPACK=ON cmake
15+
# --build <...>/extension/llm/server/cpp --target text_llm_worker
16+
17+
cmake_minimum_required(VERSION 3.24)
18+
project(llm_server_workers)
19+
20+
set(CMAKE_CXX_STANDARD 17)
21+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
22+
23+
set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../../../..)
24+
25+
include(${EXECUTORCH_ROOT}/tools/cmake/Utils.cmake)
26+
27+
set(_common_include_directories ${EXECUTORCH_ROOT}/..)
28+
# Vendored single-include nlohmann/json for the worker protocol (no new dep).
29+
set(_json_include
30+
${EXECUTORCH_ROOT}/extension/llm/tokenizers/third-party/json/single_include
31+
)
32+
33+
# gflags
34+
set(gflags_DIR ${CMAKE_CURRENT_BINARY_DIR}/../../../../third-party/gflags)
35+
find_package(gflags REQUIRED)
36+
37+
# executorch
38+
list(APPEND CMAKE_FIND_ROOT_PATH ${CMAKE_CURRENT_BINARY_DIR}/../../../..)
39+
find_package(executorch CONFIG REQUIRED FIND_ROOT_PATH_BOTH)
40+
executorch_target_link_options_shared_lib(executorch)
41+
42+
set(link_libraries executorch gflags)
43+
44+
# CPU ops
45+
list(APPEND link_libraries optimized_native_cpu_ops_lib cpublas eigen_blas)
46+
executorch_target_link_options_shared_lib(optimized_native_cpu_ops_lib)
47+
48+
# Custom + quantized kernels that export_llm models need, whole-archived so the
49+
# static op registrations survive the linker: llama::custom_sdpa (from
50+
# use_sdpa_with_kv_cache) and quantized_decomposed ops (from quantized exports).
51+
# Without these the model loads but execution fails with "Missing operator".
52+
if(TARGET custom_ops)
53+
executorch_target_link_options_shared_lib(custom_ops)
54+
list(APPEND link_libraries custom_ops)
55+
endif()
56+
if(TARGET quantized_ops_lib)
57+
list(APPEND link_libraries quantized_kernels quantized_ops_lib)
58+
executorch_target_link_options_shared_lib(quantized_ops_lib)
59+
endif()
60+
61+
# Extensions (Engine/Session lives in extension_llm_runner)
62+
list(
63+
APPEND
64+
link_libraries
65+
extension_llm_runner
66+
extension_module
67+
extension_data_loader
68+
extension_tensor
69+
extension_flat_tensor
70+
)
71+
72+
# XNNPACK: the standard CPU backend for normal .pte TextLLM models.
73+
list(APPEND link_libraries xnnpack_backend)
74+
executorch_target_link_options_shared_lib(xnnpack_backend)
75+
76+
# Tokenizer
77+
list(APPEND link_libraries tokenizers::tokenizers)
78+
79+
add_executable(text_llm_worker text_llm_worker.cpp)
80+
target_include_directories(
81+
text_llm_worker PUBLIC ${_common_include_directories} ${_json_include}
82+
)
83+
target_link_libraries(text_llm_worker PUBLIC ${link_libraries})
84+
85+
if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
86+
target_link_options_gc_sections(text_llm_worker)
87+
target_link_options(text_llm_worker PRIVATE "LINKER:-s")
88+
endif()

0 commit comments

Comments
 (0)