Skip to content

Commit 33f31ab

Browse files
committed
extension/llm/server: document pi integration
Add an operational recipe to the server README for pointing pi (or any OpenAI-compatible harness) at the ExecuTorch server for local tool-use: the launch command, useful flags (--no-think / --max-context / --allow-chatml-fallback), client base_url/model/api_key settings, the supported chat-completions + Hermes/Qwen tool-call contract (only tool_choice auto/none/unset; response_format/logprobs/top_p!=1/seed rejected), and reliability guidance. Docs only; no runtime or dependency changes. Part of #20001 ghstack-source-id: c141a21 ghstack-comment-id: 4617420672 Pull-Request: #19999
1 parent b0312cf commit 33f31ab

1 file changed

Lines changed: 58 additions & 0 deletions

File tree

extension/llm/server/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,61 @@ lives inside the worker/session, not the control plane. Unsupported params (incl
3232
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
3333
rather than silently ignored. See `python/README.md` to run it and
3434
`spec/README.md` for the exact contract.
35+
36+
## Use from pi (or any OpenAI-compatible harness)
37+
38+
Point pi at the server to use ExecuTorch as a local backend for tool-use
39+
workflows. Launch the server:
40+
41+
```bash
42+
python -m executorch.extension.llm.server.python.server \
43+
--model-path <model.pte> \
44+
--tokenizer-path <tokenizer.model-or-json> \
45+
--hf-tokenizer <hf-model-or-local-dir> \
46+
--model-id <model-id> \
47+
--host 127.0.0.1 \
48+
--port 8000
49+
```
50+
51+
Useful optional flags (full reference in `python/README.md`):
52+
53+
- `--no-think` — default `enable_thinking=false` for templates that support it
54+
(e.g. Qwen3-style).
55+
- `--max-context N` — reject over-long prompts cleanly; use the export-time
56+
context length.
57+
- `--allow-chatml-fallback` — approximate ChatML when the model has no HF
58+
`chat_template`; experimentation only, not recommended for reliable tool use.
59+
60+
Point pi at the server via `~/.pi/agent/models.json`:
61+
62+
```json
63+
{ "providers": { "executorch": {
64+
"baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
65+
"apiKey": "x", "models": [ { "id": "<model-id>" } ] } } }
66+
```
67+
68+
Other OpenAI-compatible clients use their own schema — generically: base URL
69+
`http://127.0.0.1:8000/v1`, the model id you passed to `--model-id`, and a dummy
70+
API key if one is required.
71+
72+
Supported contract for pi:
73+
74+
- Endpoint `POST /v1/chat/completions`; streaming supported.
75+
- Tool calls: the model's Hermes-style `<tool_call>...</tool_call>` output is
76+
parsed and returned as OpenAI `tool_calls`. This generic server uses Hermes by
77+
default; a model-specific server may select the Qwen XML format.
78+
- `tool_choice`: only `"auto"`, `"none"`, or unset.
79+
- Rejected with a structured 400 (`unsupported_parameter`), not silently
80+
ignored: `tool_choice="required"` or specific-function forcing,
81+
`response_format` JSON/constrained output, `logprobs`, `top_p` other than
82+
`1.0`, and `seed`.
83+
84+
Reliability guidance:
85+
86+
- Use the model's real HF `chat_template` (`--hf-tokenizer`) for tool use, kept
87+
aligned with the exported tokenizer/model.
88+
- If tool calls come back as plain text, confirm the model is emitting the
89+
configured tool-call format's markers (Hermes for the generic server) and that
90+
`tools` were included in the request.
91+
- If a request fails with `unsupported_parameter`, remove or disable that
92+
OpenAI knob in your pi/client config.

0 commit comments

Comments
 (0)