PR-G6: gRPC chat REPL (scripts/chat_grpc.py)#61
Merged
Conversation
A multi-turn chat client that uses the Python SDK to talk to a
running gRPC RuntimeService. Demonstrates the v0.3 session-bound
architecture's killer feature: the server keeps the running KV
cache, so every turn after the first appends only the new user
message \u2014 independent of conversation length.
Compare to scripts/chat.py (v0.2): that REPL re-prefilled the
full conversation on every turn against an in-process
SpeculativeEngine. This REPL holds one Session open across turns
and the server keeps O(history) cache; per-turn prefill is
O(new_user_message).
Files
-----
scripts/chat_grpc.py +220 lines
Multi-turn REPL using:
- kakeya.Client + kakeya.Session (the v0.3 SDK)
- Qwen3-family AutoTokenizer for chat-template encoding +
streaming detokenization
- Slash commands: /help, /reset (close + new session),
/info (history_length / kv_live_bytes / idle_seconds /
INV-1/2 violation counts), /exit
- Graceful Ctrl-C interrupts mid-generation; recovery via
automatic session re-creation on KakeyaError
- System prompt seeded on session creation
(--system-prompt to customize, '' to skip)
- Streaming token-by-token decode using the tokenizer's
running buffer (same pattern as scripts/chat.py)
scripts/review_pr_g6_on_mac.sh +90 lines
Mac M4 reviewer aid:
- Starts gRPC server with Qwen3-0.6B in background
- Pipes a 3-turn conversation through chat_grpc.py via stdin
- Asserts >=2 'kakeya>' response prompts in the output
- Produces pr-g6-mac-chat-smoke-<unix>.json acceptance evidence
Per CLI-plumbing convention this script is exempt from the unit-
test coverage gate (no Linux unit tests added). End-to-end
behavior is exercised by:
- tests/integration/test_sdk_real.py (the SDK methods this REPL
drives; integration-tested against real Qwen3-0.6B)
- scripts/review_pr_g6_on_mac.sh (the REPL flow itself, against
a running gRPC server)
Usage
-----
# 1. Start the runtime
PYTHONPATH=.:sdks/python python3 \
scripts/start_grpc_runtime_server.py \
--backend cpu --verifier-id Qwen/Qwen3-0.6B \
--bind 127.0.0.1:50051
# 2. Chat
PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
# > you> Hi.
# > kakeya> Hello! How can I help you today?
# > you> /info
# > history_length = 47
# > kv_live_bytes = 6,815,744
# > idle_seconds = 1.234
# > you> /exit
Stack
-----
PR-G6 is independent of PR-G3 (#59 README) and PR-G5 (#60 prewarm
CLI). All three can land in any order.
Per ADR 0008 \u00a79: this PR ships a UX-only CLI. No new Linux unit
tests; no Linux gate impact. Mac M4 evidence required for merge
because the REPL is interactive against a real server.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A multi-turn chat client that uses the Python SDK to talk to a running gRPC
RuntimeService. Demonstrates the v0.3 session-bound architecture's killer feature: the server keeps the running KV cache, so every turn after the first appends only the new user message — independent of conversation length.Compare to
scripts/chat.py(v0.2): that REPL re-prefilled the full conversation on every turn against an in-processSpeculativeEngine. PR-G6's REPL holds oneSessionopen across turns; per-turn prefill is O(new_user_message).Files
scripts/chat_grpc.pykakeya.Client+kakeya.Session, Qwen3 chat-template encoding, streaming detokenization. Slash commands/help,/reset,/info,/exit. Graceful Ctrl-C interrupts. Auto session re-creation onKakeyaError.scripts/review_pr_g6_on_mac.shUser experience
The
/infooutput makes the session-bound architecture visible:history_lengthgrows turn-by-turn butkv_live_bytesplateaus at the (sink + window) cap.Stack
PR-G6 is independent of PR-G3 (README, #59) and PR-G5 (prewarm CLI, #60). All three can land in any order. The README's quickstart already references
scripts/chat_grpc.pyas the recommended demo path; once PR-G6 lands, the docs become live.Per ADR 0008 §9
UX-only CLI. No Linux unit tests added (per CLI-plumbing convention; SDK methods are integration-tested via
tests/integration/test_sdk_real.py). Mac M4 evidence required for merge because the REPL is interactive against a real server.Reviewer checklist
pr-g6-mac-chat-smoke-*.jsonshowspassed=true(≥2 response prompts in the captured output)./help,/reset,/info,/exitall behave as documented.[interrupted], returns to prompt).SessionNotFoundErrorfrom the server (e.g., LRU eviction on a long-running REPL).--system-prompt 'be brief'shapes responses;--system-prompt ''skips).What ships in v0.3.x with this
After all three deployment-polish PRs (#59 G3, #60 G5, this) land, the v0.3 first-time-user flow looks like:
Plus a clear, current README pointing at
docs/quickstart.md.What's still missing for "普通用户" deployment (PR-G1/G2/G4 territory)
pip install kakeya-inference(PR-G1: pyproject.toml + PyPI publish)pip install kakeyafor the SDK (PR-G2)npm install @kakeya/runtime(PR-G2)docker pull ghcr.io/fluffyaicode/kakeya:0.3.x(PR-G4)Those are the next deployment-polish chapter; PR-G3+G5+G6 (this trilogy) close the "the docs and CLIs work" gap that's reasonable to ship before the publishing pipeline is wired.