Skip to content

Commit c4231cf

Browse files
committed
extension/llm/server: document pi integration
Add an operational recipe to the server README for pointing pi (or any OpenAI-compatible harness) at the ExecuTorch server for local tool-use: the launch command, useful flags (--no-think / --enable-prefix-cache / --max-context / --allow-chatml-fallback), client base_url/model/api_key settings, the supported chat-completions + Hermes/Qwen tool-call contract (only tool_choice auto/none/unset; response_format/logprobs/top_p!=1/seed rejected), and reliability guidance. Docs only; no runtime or dependency changes. ghstack-source-id: 672db61 ghstack-comment-id: 4617420672 Pull-Request: #19999
1 parent 3238080 commit c4231cf

1 file changed

Lines changed: 53 additions & 0 deletions

File tree

extension/llm/server/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,56 @@ prefix cache (`--enable-prefix-cache`). Unsupported params (including `top_p`,
2929
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
3030
rather than silently ignored. See `python/README.md` to run it and
3131
`spec/README.md` for the exact contract.
32+
33+
## Use from pi (or any OpenAI-compatible harness)
34+
35+
Point pi at the server to use ExecuTorch as a local backend for tool-use
36+
workflows. Launch the server:
37+
38+
```bash
39+
python -m executorch.extension.llm.server.python.server \
40+
--model-path <model.pte> \
41+
--tokenizer-path <tokenizer.model-or-json> \
42+
--hf-tokenizer <hf-model-or-local-dir> \
43+
--model-id <model-id> \
44+
--host 127.0.0.1 \
45+
--port 8000
46+
```
47+
48+
Useful optional flags (full reference in `python/README.md`):
49+
50+
- `--no-think` — default `enable_thinking=false` for templates that support it
51+
(e.g. Qwen3-style).
52+
- `--enable-prefix-cache` — turn-to-turn KV reuse; only with `--hf-tokenizer`
53+
that matches the exported model/tokenizer.
54+
- `--max-context N` — reject over-long prompts cleanly; use the export-time
55+
context length.
56+
- `--allow-chatml-fallback` — approximate ChatML when the model has no HF
57+
`chat_template`; experimentation only, not recommended for reliable tool use.
58+
59+
Client settings (pi, or any OpenAI-compatible provider — keep generic if your
60+
client's schema differs):
61+
62+
- `base_url: http://127.0.0.1:8000/v1`
63+
- `model: <model-id>`
64+
- `api_key:` any dummy value if the client requires one
65+
66+
Supported contract for pi:
67+
68+
- Endpoint `POST /v1/chat/completions`; streaming supported.
69+
- Tool calls: the model's Hermes/Qwen `<tool_call>...</tool_call>` output is
70+
parsed and returned as OpenAI `tool_calls`.
71+
- `tool_choice`: only `"auto"`, `"none"`, or unset.
72+
- Rejected with a structured 400 (`unsupported_parameter`), not silently
73+
ignored: `tool_choice="required"` or specific-function forcing,
74+
`response_format` JSON/constrained output, `logprobs`, `top_p` other than
75+
`1.0`, and `seed`.
76+
77+
Reliability guidance:
78+
79+
- Use the model's real HF `chat_template` (`--hf-tokenizer`) for tool use, kept
80+
aligned with the exported tokenizer/model.
81+
- If tool calls come back as plain text, confirm the model is emitting the
82+
Hermes/Qwen tool-call markers and that `tools` were included in the request.
83+
- If a request fails with `unsupported_parameter`, remove or disable that
84+
OpenAI knob in your pi/client config.

0 commit comments

Comments
 (0)