Skip to content

PR-G6: gRPC chat REPL (scripts/chat_grpc.py)#61

Merged
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-g6-chat-cli-repl-8e7f
Jun 5, 2026
Merged

PR-G6: gRPC chat REPL (scripts/chat_grpc.py)#61
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-g6-chat-cli-repl-8e7f

Conversation

@FluffyAIcode

Copy link
Copy Markdown
Owner

Why

A multi-turn chat client that uses the Python SDK to talk to a running gRPC RuntimeService. Demonstrates the v0.3 session-bound architecture's killer feature: the server keeps the running KV cache, so every turn after the first appends only the new user message — independent of conversation length.

Compare to scripts/chat.py (v0.2): that REPL re-prefilled the full conversation on every turn against an in-process SpeculativeEngine. PR-G6's REPL holds one Session open across turns; per-turn prefill is O(new_user_message).

Files

File Lines Purpose
scripts/chat_grpc.py +275 Multi-turn REPL: kakeya.Client + kakeya.Session, Qwen3 chat-template encoding, streaming detokenization. Slash commands /help, /reset, /info, /exit. Graceful Ctrl-C interrupts. Auto session re-creation on KakeyaError.
scripts/review_pr_g6_on_mac.sh +108 Mac M4 reviewer aid: starts gRPC server, pipes a 3-turn conversation through stdin, asserts ≥2 response prompts, emits JSON evidence.

User experience

$ PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
[chat] loading tokenizer Qwen/Qwen3-0.6B ...
Kakeya v0.3 chat — 127.0.0.1:50051  (Qwen/Qwen3-0.6B)
Session-bound runtime: server keeps history, you only send new tokens per turn.
Type /help for commands; Ctrl-D or empty line to quit.

you> Hi.
kakeya> Hello! How can I help you today?
you> What is your favorite color?
kakeya> I don't have personal preferences, but blue is often calming.
you> /info
  history_length = 73
  kv_live_bytes  = 6,815,744
  idle_seconds   = 0.142
  inv1_violations= 0
  inv2_violations= 0
you> /exit
[bye]

The /info output makes the session-bound architecture visible: history_length grows turn-by-turn but kv_live_bytes plateaus at the (sink + window) cap.

Stack

PR-G6 is independent of PR-G3 (README, #59) and PR-G5 (prewarm CLI, #60). All three can land in any order. The README's quickstart already references scripts/chat_grpc.py as the recommended demo path; once PR-G6 lands, the docs become live.

Per ADR 0008 §9

UX-only CLI. No Linux unit tests added (per CLI-plumbing convention; SDK methods are integration-tested via tests/integration/test_sdk_real.py). Mac M4 evidence required for merge because the REPL is interactive against a real server.

Reviewer checklist

  • Mac M4 evidence: pr-g6-mac-chat-smoke-*.json shows passed=true (≥2 response prompts in the captured output).
  • /help, /reset, /info, /exit all behave as documented.
  • Streaming detokenization produces readable text (no garbled BPE-merge artifacts).
  • Ctrl-C mid-generation interrupts cleanly (prints [interrupted], returns to prompt).
  • Session re-creates automatically after a SessionNotFoundError from the server (e.g., LRU eviction on a long-running REPL).
  • System prompt seeding works (--system-prompt 'be brief' shapes responses; --system-prompt '' skips).

What ships in v0.3.x with this

After all three deployment-polish PRs (#59 G3, #60 G5, this) land, the v0.3 first-time-user flow looks like:

git clone ... && cd ... && git checkout v0.3.0    # or v0.3.1
bash scripts/setup_mac.sh                          # auto-prewarm
PYTHONPATH=.:sdks/python python3 scripts/start_grpc_runtime_server.py &
PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
# Type messages, get streamed responses, /info to see KV bookkeeping

Plus a clear, current README pointing at docs/quickstart.md.

What's still missing for "普通用户" deployment (PR-G1/G2/G4 territory)

  • pip install kakeya-inference (PR-G1: pyproject.toml + PyPI publish)
  • pip install kakeya for the SDK (PR-G2)
  • npm install @kakeya/runtime (PR-G2)
  • docker pull ghcr.io/fluffyaicode/kakeya:0.3.x (PR-G4)

Those are the next deployment-polish chapter; PR-G3+G5+G6 (this trilogy) close the "the docs and CLIs work" gap that's reasonable to ship before the publishing pipeline is wired.

Open in Web Open in Cursor 

cursoragent and others added 2 commits June 5, 2026 12:30
A multi-turn chat client that uses the Python SDK to talk to a
running gRPC RuntimeService. Demonstrates the v0.3 session-bound
architecture's killer feature: the server keeps the running KV
cache, so every turn after the first appends only the new user
message \u2014 independent of conversation length.

Compare to scripts/chat.py (v0.2): that REPL re-prefilled the
full conversation on every turn against an in-process
SpeculativeEngine. This REPL holds one Session open across turns
and the server keeps O(history) cache; per-turn prefill is
O(new_user_message).

Files
-----

scripts/chat_grpc.py                       +220 lines
  Multi-turn REPL using:
    - kakeya.Client + kakeya.Session (the v0.3 SDK)
    - Qwen3-family AutoTokenizer for chat-template encoding +
      streaming detokenization
    - Slash commands: /help, /reset (close + new session),
      /info (history_length / kv_live_bytes / idle_seconds /
      INV-1/2 violation counts), /exit
    - Graceful Ctrl-C interrupts mid-generation; recovery via
      automatic session re-creation on KakeyaError
    - System prompt seeded on session creation
      (--system-prompt to customize, '' to skip)
    - Streaming token-by-token decode using the tokenizer's
      running buffer (same pattern as scripts/chat.py)

scripts/review_pr_g6_on_mac.sh             +90 lines
  Mac M4 reviewer aid:
    - Starts gRPC server with Qwen3-0.6B in background
    - Pipes a 3-turn conversation through chat_grpc.py via stdin
    - Asserts >=2 'kakeya>' response prompts in the output
    - Produces pr-g6-mac-chat-smoke-<unix>.json acceptance evidence

Per CLI-plumbing convention this script is exempt from the unit-
test coverage gate (no Linux unit tests added). End-to-end
behavior is exercised by:
  - tests/integration/test_sdk_real.py (the SDK methods this REPL
    drives; integration-tested against real Qwen3-0.6B)
  - scripts/review_pr_g6_on_mac.sh (the REPL flow itself, against
    a running gRPC server)

Usage
-----

    # 1. Start the runtime
    PYTHONPATH=.:sdks/python python3 \
        scripts/start_grpc_runtime_server.py \
        --backend cpu --verifier-id Qwen/Qwen3-0.6B \
        --bind 127.0.0.1:50051

    # 2. Chat
    PYTHONPATH=.:sdks/python python3 scripts/chat_grpc.py
    # > you> Hi.
    # > kakeya> Hello! How can I help you today?
    # > you> /info
    # >   history_length = 47
    # >   kv_live_bytes  = 6,815,744
    # >   idle_seconds   = 1.234
    # > you> /exit

Stack
-----
PR-G6 is independent of PR-G3 (#59 README) and PR-G5 (#60 prewarm
CLI). All three can land in any order.

Per ADR 0008 \u00a79: this PR ships a UX-only CLI. No new Linux unit
tests; no Linux gate impact. Mac M4 evidence required for merge
because the REPL is interactive against a real server.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 5, 2026 13:05
@FluffyAIcode FluffyAIcode merged commit b6fdec4 into main Jun 5, 2026
8 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-g6-chat-cli-repl-8e7f branch June 5, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants