PR-G6: gRPC chat REPL (scripts/chat_grpc.py) by FluffyAIcode · Pull Request #61 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-05T12:31:31Z

Why

A multi-turn chat client that uses the Python SDK to talk to a running gRPC RuntimeService. Demonstrates the v0.3 session-bound architecture's killer feature: the server keeps the running KV cache, so every turn after the first appends only the new user message — independent of conversation length.

Compare to scripts/chat.py (v0.2): that REPL re-prefilled the full conversation on every turn against an in-process SpeculativeEngine. PR-G6's REPL holds one Session open across turns; per-turn prefill is O(new_user_message).

Files

File	Lines	Purpose
`scripts/chat_grpc.py`	+275	Multi-turn REPL: `kakeya.Client` + `kakeya.Session`, Qwen3 chat-template encoding, streaming detokenization. Slash commands `/help`, `/reset`, `/info`, `/exit`. Graceful Ctrl-C interrupts. Auto session re-creation on `KakeyaError`.
`scripts/review_pr_g6_on_mac.sh`	+108	Mac M4 reviewer aid: starts gRPC server, pipes a 3-turn conversation through stdin, asserts ≥2 response prompts, emits JSON evidence.

User experience

$ PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
[chat] loading tokenizer Qwen/Qwen3-0.6B ...
Kakeya v0.3 chat — 127.0.0.1:50051  (Qwen/Qwen3-0.6B)
Session-bound runtime: server keeps history, you only send new tokens per turn.
Type /help for commands; Ctrl-D or empty line to quit.

you> Hi.
kakeya> Hello! How can I help you today?
you> What is your favorite color?
kakeya> I don't have personal preferences, but blue is often calming.
you> /info
  history_length = 73
  kv_live_bytes  = 6,815,744
  idle_seconds   = 0.142
  inv1_violations= 0
  inv2_violations= 0
you> /exit
[bye]

The /info output makes the session-bound architecture visible: history_length grows turn-by-turn but kv_live_bytes plateaus at the (sink + window) cap.

Stack

PR-G6 is independent of PR-G3 (README, #59) and PR-G5 (prewarm CLI, #60). All three can land in any order. The README's quickstart already references scripts/chat_grpc.py as the recommended demo path; once PR-G6 lands, the docs become live.

Per ADR 0008 §9

UX-only CLI. No Linux unit tests added (per CLI-plumbing convention; SDK methods are integration-tested via tests/integration/test_sdk_real.py). Mac M4 evidence required for merge because the REPL is interactive against a real server.

Reviewer checklist

Mac M4 evidence: pr-g6-mac-chat-smoke-*.json shows passed=true (≥2 response prompts in the captured output).
/help, /reset, /info, /exit all behave as documented.
Streaming detokenization produces readable text (no garbled BPE-merge artifacts).
Ctrl-C mid-generation interrupts cleanly (prints [interrupted], returns to prompt).
Session re-creates automatically after a SessionNotFoundError from the server (e.g., LRU eviction on a long-running REPL).
System prompt seeding works (--system-prompt 'be brief' shapes responses; --system-prompt '' skips).

What ships in v0.3.x with this

After all three deployment-polish PRs (#59 G3, #60 G5, this) land, the v0.3 first-time-user flow looks like:

git clone ... && cd ... && git checkout v0.3.0    # or v0.3.1
bash scripts/setup_mac.sh                          # auto-prewarm
PYTHONPATH=.:sdks/python python3 scripts/start_grpc_runtime_server.py &
PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
# Type messages, get streamed responses, /info to see KV bookkeeping

Plus a clear, current README pointing at docs/quickstart.md.

What's still missing for "普通用户" deployment (PR-G1/G2/G4 territory)

pip install kakeya-inference (PR-G1: pyproject.toml + PyPI publish)
pip install kakeya for the SDK (PR-G2)
npm install @kakeya/runtime (PR-G2)
docker pull ghcr.io/fluffyaicode/kakeya:0.3.x (PR-G4)

Those are the next deployment-polish chapter; PR-G3+G5+G6 (this trilogy) close the "the docs and CLIs work" gap that's reasonable to ship before the publishing pipeline is wired.

A multi-turn chat client that uses the Python SDK to talk to a running gRPC RuntimeService. Demonstrates the v0.3 session-bound architecture's killer feature: the server keeps the running KV cache, so every turn after the first appends only the new user message \u2014 independent of conversation length. Compare to scripts/chat.py (v0.2): that REPL re-prefilled the full conversation on every turn against an in-process SpeculativeEngine. This REPL holds one Session open across turns and the server keeps O(history) cache; per-turn prefill is O(new_user_message). Files ----- scripts/chat_grpc.py +220 lines Multi-turn REPL using: - kakeya.Client + kakeya.Session (the v0.3 SDK) - Qwen3-family AutoTokenizer for chat-template encoding + streaming detokenization - Slash commands: /help, /reset (close + new session), /info (history_length / kv_live_bytes / idle_seconds / INV-1/2 violation counts), /exit - Graceful Ctrl-C interrupts mid-generation; recovery via automatic session re-creation on KakeyaError - System prompt seeded on session creation (--system-prompt to customize, '' to skip) - Streaming token-by-token decode using the tokenizer's running buffer (same pattern as scripts/chat.py) scripts/review_pr_g6_on_mac.sh +90 lines Mac M4 reviewer aid: - Starts gRPC server with Qwen3-0.6B in background - Pipes a 3-turn conversation through chat_grpc.py via stdin - Asserts >=2 'kakeya>' response prompts in the output - Produces pr-g6-mac-chat-smoke-<unix>.json acceptance evidence Per CLI-plumbing convention this script is exempt from the unit- test coverage gate (no Linux unit tests added). End-to-end behavior is exercised by: - tests/integration/test_sdk_real.py (the SDK methods this REPL drives; integration-tested against real Qwen3-0.6B) - scripts/review_pr_g6_on_mac.sh (the REPL flow itself, against a running gRPC server) Usage ----- # 1. Start the runtime PYTHONPATH=.:sdks/python python3 \ scripts/start_grpc_runtime_server.py \ --backend cpu --verifier-id Qwen/Qwen3-0.6B \ --bind 127.0.0.1:50051 # 2. Chat PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py # > you> Hi. # > kakeya> Hello! How can I help you today? # > you> /info # > history_length = 47 # > kv_live_bytes = 6,815,744 # > idle_seconds = 1.234 # > you> /exit Stack ----- PR-G6 is independent of PR-G3 (#59 README) and PR-G5 (#60 prewarm CLI). All three can land in any order. Per ADR 0008 \u00a79: this PR ships a UX-only CLI. No new Linux unit tests; no Linux gate impact. Mac M4 evidence required for merge because the REPL is interactive against a real server. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

cursoragent and others added 2 commits June 5, 2026 12:30

Mac M4 review evidence for PR-G6

fc72bc3

Co-authored-by: Cursor <cursoragent@cursor.com>

FluffyAIcode marked this pull request as ready for review June 5, 2026 13:05

FluffyAIcode merged commit b6fdec4 into main Jun 5, 2026
8 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-g6-chat-cli-repl-8e7f branch June 5, 2026 13:05

FluffyAIcode mentioned this pull request Jun 6, 2026

PR-R1 (research): ADR 0011 + cross-attention toy prototype #63

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-G6: gRPC chat REPL (scripts/chat_grpc.py)#61

PR-G6: gRPC chat REPL (scripts/chat_grpc.py)#61
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-g6-chat-cli-repl-8e7f

FluffyAIcode commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 5, 2026

Why

Files

User experience

Stack

Per ADR 0008 §9

Reviewer checklist

What ships in v0.3.x with this

What's still missing for "普通用户" deployment (PR-G1/G2/G4 territory)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants