pytorch
diff --git a/‎extension/llm/server/README.md‎
Lines changed: 31 additions & 0 deletions b/‎extension/llm/server/README.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎extension/llm/server/conformance/test_openai_contract.py‎
Lines changed: 207 additions & 0 deletions b/‎extension/llm/server/conformance/test_openai_contract.py‎
Lines changed: 207 additions & 0 deletions
diff --git a/‎extension/llm/server/python/README.md‎
Lines changed: 130 additions & 0 deletions b/‎extension/llm/server/python/README.md‎
Lines changed: 130 additions & 0 deletions
@@ -0,0 +1,31 @@
+# ExecuTorch LLM Server
+
+OpenAI-compatible serving for ExecuTorch LLMs, so any OpenAI-compatible agent
+harness (pi, opencode, ...) can use ExecuTorch as a local backend.
+
+```
+extension/llm/server/
+  spec/          # language-neutral OpenAI contract ExecuTorch targets
+  conformance/   # one test suite every language server must pass
+  python/        # Python server implementation (current)
+  # cpp/         # future: no-Python single-binary server
+```
+
+Why this layout: the OpenAI contract is identical across languages, so the
+**spec** and **conformance** suite are shared, and each language gets its own
+implementation directory. The real cross-language reuse comes from the C++
+`LLMEngine`/`LLMSession` primitives underneath (with `TextLLMRunner` as the
+current adapter) — each server is a thin protocol shell over that engine. See
+`python/README.md` to run it.
+
+Status: experimental, reliability-first and deliberately narrow. Implemented:
+`/health`, `/v1/models`, `/v1/chat/completions` (streaming + non-streaming),
+Hugging Face chat templates (`--hf-tokenizer`), `temperature` / `max_tokens` /
+`max_completion_tokens` / `stop`, Hermes/Qwen tool calling
+(`<tool_call>...</tool_call>`, complete calls only) with `tool_choice="none"`,
+structured API errors, cancellation, and an opt-in conservative per-runner KV
+prefix cache (`--enable-prefix-cache`). Unsupported params (including `top_p`,
+`seed`, `n>1`, `reasoning_effort`, penalties, `logit_bias`, `response_format`,
+`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
+rather than silently ignored. See `python/README.md` to run it and
+`spec/README.md` for the exact contract.
@@ -0,0 +1,207 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""Language-neutral OpenAI-contract conformance tests.
+
+Runs against any base URL (ExecuTorch, llama.cpp, mlx-lm, ...) so every server
+implementation is validated against one shared spec. Point it at a running
+server:
+
+    OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest test_openai_contract.py
+
+Skips automatically if no server is reachable.
+"""
+
+import json
+import os
+import urllib.error
+import urllib.request
+
+import pytest
+
+BASE_URL = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:8000/v1").rstrip("/")
+MODEL = os.environ.get("OPENAI_MODEL", "executorch")
+
+
+def _post(path: str, body: dict, stream: bool = False):
+    req = urllib.request.Request(
+        f"{BASE_URL}{path}",
+        data=json.dumps(body).encode(),
+        headers={"Content-Type": "application/json"},
+        method="POST",
+    )
+    return urllib.request.urlopen(req, timeout=120)
+
+
+def _server_up() -> bool:
+    try:
+        urllib.request.urlopen(f"{BASE_URL}/models", timeout=5)
+        return True
+    except Exception:
+        return False
+
+
+pytestmark = pytest.mark.skipif(
+    not _server_up(), reason="no OpenAI server at OPENAI_BASE_URL"
+)
+
+
+def test_models_listing():
+    with urllib.request.urlopen(f"{BASE_URL}/models", timeout=10) as r:
+        data = json.loads(r.read())
+    assert data["object"] == "list"
+    assert any("id" in m for m in data["data"])
+
+
+def test_chat_completion_nonstreaming():
+    body = {
+        "model": MODEL,
+        "messages": [{"role": "user", "content": "Say hello in one word."}],
+        "max_tokens": 16,
+        "temperature": 0.0,
+    }
+    with _post("/chat/completions", body) as r:
+        data = json.loads(r.read())
+    assert data["object"] == "chat.completion"
+    assert data["choices"][0]["message"]["role"] == "assistant"
+    assert isinstance(data["choices"][0]["message"]["content"], str)
+    assert data["choices"][0]["finish_reason"] is not None
+
+
+def test_chat_completion_streaming():
+    body = {
+        "model": MODEL,
+        "messages": [{"role": "user", "content": "Count to three."}],
+        "max_tokens": 32,
+        "stream": True,
+    }
+    saw_role = saw_content = saw_done = False
+    with _post("/chat/completions", body, stream=True) as r:
+        for raw in r:
+            line = raw.decode().strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if payload == "[DONE]":
+                saw_done = True
+                break
+            chunk = json.loads(payload)
+            assert chunk["object"] == "chat.completion.chunk"
+            delta = chunk["choices"][0]["delta"]
+            saw_role = saw_role or delta.get("role") == "assistant"
+            saw_content = saw_content or bool(delta.get("content"))
+    assert saw_role and saw_content and saw_done
+
+
+def test_multibyte_streaming_integrity():
+    # Byte-level BPE can split a multi-byte character across tokens; the stream
+    # must reassemble it, not abort with a UTF-8 decode error.
+    body = {
+        "model": MODEL,
+        "messages": [
+            {"role": "user", "content": "Reply with exactly: 你好世界 🌍 café"}
+        ],
+        "max_tokens": 32,
+        "temperature": 0.0,
+        "stream": True,
+    }
+    content, saw_done, saw_error = "", False, False
+    with _post("/chat/completions", body, stream=True) as r:
+        for raw in r:
+            line = raw.decode().strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if payload == "[DONE]":
+                saw_done = True
+                break
+            chunk = json.loads(payload)
+            if "error" in chunk:
+                saw_error = True
+            content += (
+                chunk["choices"][0]["delta"].get("content", "")
+                if chunk.get("choices")
+                else ""
+            )
+    assert saw_done and not saw_error
+    assert isinstance(content, str) and content  # reassembled, valid UTF-8
+
+
+def test_usage_chunk_in_stream():
+    body = {
+        "model": MODEL,
+        "messages": [{"role": "user", "content": "Say hi."}],
+        "max_tokens": 16,
+        "stream": True,
+        "stream_options": {"include_usage": True},
+    }
+    usage = None
+    with _post("/chat/completions", body, stream=True) as r:
+        for raw in r:
+            line = raw.decode().strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if payload == "[DONE]":
+                break
+            chunk = json.loads(payload)
+            if chunk.get("usage"):
+                usage = chunk["usage"]
+    assert usage is not None, "no usage chunk emitted with include_usage"
+    assert usage["prompt_tokens"] > 0 and usage["completion_tokens"] > 0
+    assert usage["total_tokens"] == usage["prompt_tokens"] + usage["completion_tokens"]
+
+
+WEATHER_TOOL = {
+    "type": "function",
+    "function": {
+        "name": "get_weather",
+        "description": "Get the current weather for a city.",
+        "parameters": {
+            "type": "object",
+            "properties": {"city": {"type": "string"}},
+            "required": ["city"],
+        },
+    },
+}
+
+
+def test_tool_call_response_shape():
+    body = {
+        "model": MODEL,
+        "messages": [
+            {"role": "user", "content": "What is the weather in Paris? Use the tool."}
+        ],
+        "tools": [WEATHER_TOOL],
+        "max_tokens": 128,
+        "temperature": 0.0,
+    }
+    with _post("/chat/completions", body) as r:
+        data = json.loads(r.read())
+    calls = data["choices"][0]["message"].get("tool_calls")
+    assert calls, "expected tool_calls in response"
+    tc = calls[0]
+    assert tc["type"] == "function"
+    assert tc["id"]
+    assert tc["function"]["name"] == "get_weather"
+    json.loads(tc["function"]["arguments"])  # arguments is a JSON string
+    assert data["choices"][0]["finish_reason"] == "tool_calls"
+
+
+def test_error_body_shape():
+    # Over-long prompt -> structured 400 (OpenAI error envelope), not a 500/drop.
+    body = {
+        "model": MODEL,
+        "messages": [{"role": "user", "content": "word " * 40000}],
+        "max_tokens": 8,
+    }
+    try:
+        _post("/chat/completions", body)
+        raise AssertionError("expected an HTTP error for over-long prompt")
+    except urllib.error.HTTPError as e:
+        assert 400 <= e.code < 500
+        err = json.loads(e.read())["error"]
+        assert err["message"] and err["type"]
@@ -0,0 +1,130 @@
+# ExecuTorch LLM Server — Python
+
+A thin OpenAI-compatible HTTP server over ExecuTorch's `LLMEngine`/`LLMSession`
+serving API (with `TextLLMRunner` as the underlying adapter).
+
+## Install
+
+```bash
+pip install -r requirements.txt
+# transformers is optional but recommended for model-correct chat templates
+pip install transformers
+```
+
+Requires an ExecuTorch build with the LLM runner pybindings
+(`EXECUTORCH_BUILD_PYBIND=ON`) so `executorch.extension.llm.runner` imports.
+
+### Model & runtime requirements
+
+LLM `.pte` files exported via `export_llm` use ExecuTorch custom/quantized ops:
+`use_sdpa_with_kv_cache` → `llama::custom_sdpa`, and quantized exports
+(`embedding_quantize`, `8da4w`, ...) → `quantized_decomposed` ops. These are the
+Python-runtime equivalent of the C++ build flags in the canonical
+[Llama README](../../../../examples/models/llama/README.md)
+(`-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON`,
+`-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON`). **The server registers them
+automatically** (imports `executorch.extension.llm.custom_ops.custom_ops` and
+`executorch.kernels.quantized` before constructing the runner); without them the
+runner fails with `Missing operator ... load_method('forward') failed`.
+
+Tokenizer: pass the model's tokenizer — `tokenizer.json` (HF, e.g. Qwen3) or
+`tokenizer.model` (Llama); the runner auto-detects. If you see an RE2 lookahead
+warning it falls back to PCRE2 and still works (build with
+`-DSUPPORT_REGEX_LOOKAHEAD=ON` for the native regex path).
+
+## Run
+
+```bash
+python -m executorch.extension.llm.server.python.server \
+    --model-path /path/to/model.pte \
+    --tokenizer-path /path/to/tokenizer.bin \
+    --hf-tokenizer Qwen/Qwen2.5-Coder-7B-Instruct \
+    --model-id qwen2.5-coder \
+    --enable-prefix-cache \
+    --host 127.0.0.1 --port 8000
+```
+
+`--hf-tokenizer` is **required** (it applies the model's real `chat_template`)
+unless you pass `--allow-chatml-fallback` to opt into approximate generic ChatML
+— which is wrong for many instruct/tool models and can't reproduce controls like
+`enable_thinking`.
+
+Key flags:
+
+| Flag | Effect |
+|------|--------|
+| `--hf-tokenizer` | model's HF chat template (required unless fallback) |
+| `--allow-chatml-fallback` | opt into approximate ChatML when no HF tokenizer |
+| `--no-think` | default `enable_thinking=False` (e.g. Qwen3) |
+| `--max-context N` | reject over-long prompts with 400 instead of failing mid-gen |
+| `--num-runners N` | *requested* physical sessions (each = one KV cache, N × memory); the actual count is clamped by the engine's `serving_capacity()` — XNNPACK self-contained `.pte` is single-slot, so N>1 is clamped to 1 and extra requests queue |
+| `--enable-prefix-cache` | opt-in turn-to-turn KV reuse (requires `--hf-tokenizer`; runs the LLMEngine/LLMSession path) |
+
+## Use from an agent harness
+
+- **opencode** (`opencode.json`):
+  ```json
+  { "provider": { "executorch": {
+      "npm": "@ai-sdk/openai-compatible",
+      "options": { "baseURL": "http://127.0.0.1:8000/v1" },
+      "models": { "qwen2.5-coder": { "name": "Qwen2.5-Coder (ExecuTorch)" } } } } }
+  ```
+- **pi** (`~/.pi/agent/models.json`):
+  ```json
+  { "providers": { "executorch": {
+      "baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
+      "apiKey": "x", "models": [ { "id": "qwen2.5-coder" } ] } } }
+  ```
+
+## Validate
+
+Two layers, both contract-focused (assert on the wire, not internals):
+
+```bash
+# 1. Hermetic unit tests — fake engine, no model/GPU, fast (CI-friendly).
+pip install pytest httpx
+pytest tests/
+
+# 2. Conformance — black-box, against a LIVE server (real model, or llama.cpp/mlx-lm).
+OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest ../conformance/test_openai_contract.py
+```
+
+`tests/` swaps in a `FakeRunner` via `RunnerPool(runner_factory=...)`, so the real
+server/protocol/streaming code is tested over HTTP without a `.pte`.
+
+## Architecture
+
+Control plane (this dir, Python): server, OpenAI protocol, chat templating,
+session routing/streaming, and prefix-reuse *policy*. Data plane (C++): the
+`LLMEngine`/`LLMSession` API owns token stepping and KV mutation (prefill/decode/
+sampling) and releases the GIL. Python depends on `LLMEngine`/`LLMSession`, not on
+`TextLLMRunner` token-step internals (`TextLLMRunner` is a legacy/direct runner
+and a C++ implementation detail behind the session adapter). How many physical
+sessions can exist without multiplying model memory is decided by
+`serving_capacity()`, not by `--num-runners`. Tensor data never crosses into
+Python element-wise.
+
+| File | Role |
+|------|------|
+| `server.py` | FastAPI app, routes, CLI entrypoint |
+| `protocol.py` | OpenAI request/response schemas |
+| `chat_template.py` | messages (+tools) → prompt string |
+| `runner_pool.py` | session pool + serving-capacity admission + affinity routing + async streaming bridge |
+| `serving_chat.py` | `/v1/chat/completions` (streaming + non-streaming, stop, tools) |
+| `prefix_cache.py` | turn-to-turn KV prefix-reuse policy over an `LLMSession` (opt-in) |
+| `tool_parsers/` | Hermes/Qwen `<tool_call>` parser only |
+
+## Scope & caveats
+
+Deliberately narrow (reliability-first): Hermes/Qwen tool calling only;
+unsupported sampling params are rejected, not ignored. `--num-runners` is a
+*request*, not a guarantee — the engine's `serving_capacity()` is authoritative,
+and an XNNPACK self-contained `.pte` is conservative **single-slot** for v1
+(packed weights may be per-method-instance, so extra physical sessions would
+duplicate model memory): N>1 is clamped to 1 and concurrent requests queue on the
+resident session. The engine serializes backend execution across sessions (op
+kernels aren't assumed thread-safe — this is also what fixed the multi-runner
+heap corruption). Prefix cache requires the LLMSession/engine path
+(`--enable-prefix-cache` + `--hf-tokenizer`). Weight sharing across physical
+sessions on a backend that supports it (e.g. CUDA/AOTI), adaptive thinking, and
+multi-session subagents are future work.