extension/llm/server: document pi integration

mergennachin · mergennachin · commit 33f31ab77b27 · 2026-06-08T12:27:54.000-07:00
Add an operational recipe to the server README for pointing pi (or any OpenAI-compatible harness) at the ExecuTorch server for local tool-use: the launch command, useful flags (--no-think / --max-context / --allow-chatml-fallback), client base_url/model/api_key settings, the supported chat-completions + Hermes/Qwen tool-call contract (only tool_choice auto/none/unset; response_format/logprobs/top_p!=1/seed rejected), and reliability guidance. Docs only; no runtime or dependency changes. Part of #20001 ghstack-source-id: c141a21 ghstack-comment-id: 4617420672 Pull-Request: #19999
diff --git a/extension/llm/server/README.md b/extension/llm/server/README.md
@@ -32,3 +32,61 @@ lives inside the worker/session, not the control plane. Unsupported params (incl
 `logprobs`, and `tool_choice="required"`) are rejected with a structured 400
 rather than silently ignored. See `python/README.md` to run it and
 `spec/README.md` for the exact contract.
+
+## Use from pi (or any OpenAI-compatible harness)
+
+Point pi at the server to use ExecuTorch as a local backend for tool-use
+workflows. Launch the server:
+
+```bash
+python -m executorch.extension.llm.server.python.server \
+  --model-path <model.pte> \
+  --tokenizer-path <tokenizer.model-or-json> \
+  --hf-tokenizer <hf-model-or-local-dir> \
+  --model-id <model-id> \
+  --host 127.0.0.1 \
+  --port 8000
+```
+
+Useful optional flags (full reference in `python/README.md`):
+
+- `--no-think` — default `enable_thinking=false` for templates that support it
+  (e.g. Qwen3-style).
+- `--max-context N` — reject over-long prompts cleanly; use the export-time
+  context length.
+- `--allow-chatml-fallback` — approximate ChatML when the model has no HF
+  `chat_template`; experimentation only, not recommended for reliable tool use.
+
+Point pi at the server via `~/.pi/agent/models.json`:
+
+```json
+{ "providers": { "executorch": {
+    "baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
+    "apiKey": "x", "models": [ { "id": "<model-id>" } ] } } }
+```
+
+Other OpenAI-compatible clients use their own schema — generically: base URL
+`http://127.0.0.1:8000/v1`, the model id you passed to `--model-id`, and a dummy
+API key if one is required.
+
+Supported contract for pi:
+
+- Endpoint `POST /v1/chat/completions`; streaming supported.
+- Tool calls: the model's Hermes-style `<tool_call>...</tool_call>` output is
+  parsed and returned as OpenAI `tool_calls`. This generic server uses Hermes by
+  default; a model-specific server may select the Qwen XML format.
+- `tool_choice`: only `"auto"`, `"none"`, or unset.
+- Rejected with a structured 400 (`unsupported_parameter`), not silently
+  ignored: `tool_choice="required"` or specific-function forcing,
+  `response_format` JSON/constrained output, `logprobs`, `top_p` other than
+  `1.0`, and `seed`.
+
+Reliability guidance:
+
+- Use the model's real HF `chat_template` (`--hf-tokenizer`) for tool use, kept
+  aligned with the exported tokenizer/model.
+- If tool calls come back as plain text, confirm the model is emitting the
+  configured tool-call format's markers (Hermes for the generic server) and that
+  `tools` were included in the request.
+- If a request fails with `unsupported_parameter`, remove or disable that
+  OpenAI knob in your pi/client config.