Commit aa4e094
PR-G3: rewrite README + add docs/quickstart.md for v0.3 GA
The pre-PR-G3 README was severely out of date: the Release badge
pointed at v0.2.0, the ADR list stopped at 0006, no mention of
gRPC / SessionStore / SDKs / session-bound architecture. Anyone
landing on the GitHub repo today saw a v0.2-era pitch despite
v0.3.0 being shipped.
Two artifacts:
README.md (rewrite, ~490 lines new vs ~490 lines old)
- Top: badges refreshed (v0.3.0 release, ADR 0008, MIT
license). Architecture diagram showing the SDK -> gRPC ->
coordinators -> verifier flow.
- Quickstart section near the top: 5-minute clone-to-first-
generated-token. Calls out the v0.3.0 GA tag explicitly,
uses source-layout PYTHONPATH (works today; PyPI/npm
publishing flagged as 'queued for v0.3.1' in the Roadmap
table at the bottom).
- 'What's in the v0.3 architecture' table mapping each
component to its module path.
- 'v0.3 GA evidence' table summarizing the binding Mac M4
results: 9 ms latency drift over 14400 s (4400x improvement
vs v0.2.0), 480 turns / 4 h, 1.829s p50.
- SDK examples for both Python (kakeya) and TypeScript
(@kakeya/runtime).
- Deprecated-HTTP-shim section explaining the Deprecation /
Sunset headers and the curl-friendly migration path.
- 'Architecture & Background' as a SECONDARY section: keeps
the v0.2 Net-Bytes-per-Token analysis + compression-regime
measurements + 4-bit MLX section, but no longer fronting
them as the project's primary value prop.
- Project layout diagram updated for v0.3 directory shape.
- Roadmap table: v0.3.0 ✅ shipped, v0.3.1 deployment polish
queued, v0.4 proposer-back-in / alignment / cross-request
KV reuse designing.
- CI section split into Linux gate (verifier-independent)
vs Mac M4 gate (verifier-dependent, label-gated).
- ADR list now includes ADR 0008 with a 'load-bearing for
v0.3' marker.
docs/quickstart.md (new, ~390 lines)
Detailed 10-minute walkthrough:
- Time-budget table (Mac M4 warm/cold vs Linux x86 CPU)
- Step 1: clone + checkout v0.3.0
- Step 2: install (Mac via setup_mac.sh, Linux x86, Linux
CUDA, mainland-China mirror via HF_ENDPOINT, dllm import
workaround documented)
- Step 3: start gRPC server (full flag table)
- Step 4: SDK call examples (synthetic ids; real tokenizer
flow)
- Step 5: multi-turn conversation pattern that demonstrates
the session-bound architecture's killer feature
- Step 6: graceful stop
- Troubleshooting: kv_live_bytes=0 (pre-PR-E1c), capacity
exhaustion, mid-conversation NotFound, alternative
models (bf16 1.7B, 4-bit MLX), authentication / TLS
deferral notes, mainland-China networking
- Common patterns: session reset, single-session-per-
process, deprecated HTTP shim usage
Per ADR 0008 \u00a79: docs-only PR. No Python source changes; no
unit-test impact; no Mac M4 evidence required for merge.
Stack
-----
PR-G3 is independent of PR-G5 (model prewarm CLI) and PR-G6
(chat REPL). The README references both as 'queued for v0.3.1'
in the Roadmap table; once they ship, a small follow-up commit
inserts the actual command lines into the quickstart.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>1 parent 6399546 commit aa4e094
2 files changed
Lines changed: 640 additions & 412 deletions
0 commit comments