Skip to content

Commit 18dbfdd

Browse files
ADR 0006: Project positioning as local agent infrastructure (#20)
* Add ADR 0006: Project positioning as local agent infrastructure Records the strategic positioning decision that Kakeya is 'local agent infrastructure for Mac', not a generic chat-acceleration engine. Pure documentation, no code change. Context ------- Through v0.1.0 and v0.2.0 the project's external framing has been chat-acceleration via DLM speculative decoding. A series of comparison analyses (vs llama.cpp, mlx_lm, NVIDIA Nemotron self-spec) revealed that Kakeya's design choices map cleanly to the requirements of local agentic applications, but were never explicitly justified in that frame. This ADR makes the agentic positioning explicit so future release framing, integration documentation, benchmarking, and prioritization decisions all flow from a coherent product story. Decisions (§2) -------------- 2.1 Reframe v0.3+ release notes from 'chat acceleration' to 'local agent infrastructure for Mac'. Technical detail (acceptance rate, alignment training, speculative speedup) becomes implementation evidence, not the headline. 2.2 Ship docs/integrations/ as v0.3.0 first-class deliverable: - langchain.md (ChatOpenAI base_url config) - crewai.md (multi-agent Crew) - autogen.md (AssistantAgent) - cursor-bridge.md (Cursor custom endpoint) - openwebui.md (drop-in URL config) Each ~50 lines, demonstrating multi-agent concurrent execution as the discriminator vs mlx_lm.server. 2.3 Add scripts/bench_agentic/ alongside existing chat benchmarks: - bench_long_session.py (>=4-hour growing context) - bench_multi_agent.py (3 concurrent agents) - bench_tool_call_reliability.py (1000 tool calls JSON parse rate) - bench_cancellation.py (mid-stream cancellation latency) - bench_persistent_memory.py (cross-session recall) Each with mlx_lm.server comparison numbers. Headline release claims must be backed by these benchmarks. 2.4 Establish 'what we are not' stance: - Not a llama.cpp replacement (model coverage) - Not a vLLM replacement (data-center serving) - Not a complete agent framework (substrate, not framework) - Not a chat product (single-prompt parity is fine) - Not a multi-model gateway (Qwen3-only is fine) 2.5 Re-prioritize v0.3 / v0.4 work under agentic lens. Technical decisions don't reverse; ordering and presentation change. Concrete reframing matrix in §2.5. Alternatives considered (§3) ---------------------------- - Generic local LLM engine framing (rejected: llama.cpp dominates) - Pure research engine framing (rejected: understates engineering) - Chat speedup head-to-head with mlx_lm (rejected: no structural advantage on that axis) - Combined agent framework + engine (rejected: overlap with LangChain/CrewAI/AutoGen/Cursor) - Defer reframing to v0.4 (rejected: v0.3 framing decision is now) Documentation updates --------------------- - docs/adr/0006-local-agent-infrastructure-positioning.md (new ADR) - docs/adr/README.md: index updated to include 0004 (in flight), 0005 (planned), 0006 (Accepted) so readers see the planned shape of the decision tree. - README.md: ADR badge updated to '0001 | 0002 | 0003 | 0006'; ADR list at the bottom gets a one-paragraph entry for 0006. Validation criteria (§5) ------------------------ 1. v0.3.0 README and release notes lead with agentic-infrastructure framing. 2. At least three of the five integration examples ship with v0.3.0. 3. Agentic benchmark suite has at least bench_multi_agent.py and bench_long_session.py shipping with v0.3.0 + mlx_lm comparison. 4. Any post-v0.3.0 framing that contradicts §2.4 'what we are not' stance requires a follow-up ADR. No code change. Full test suite still passes (sanity check: 93 tests in 0.28s on the touched-area-adjacent surface). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * Fix sse-starlette event-loop binding in streaming tests Root cause: sse_starlette.sse.AppStatus.should_exit_event is a class-level anyio.Event lazily created on first SSE response. It binds to the asyncio event loop that is running at that moment. pytest-asyncio (>=0.21, function-scoped loop default in 1.x) creates a fresh event loop per async test, so the second SSE test inherits an Event bound to a now-closed loop and the SSE response raises: RuntimeError: <asyncio.locks.Event ...> is bound to a different event loop This made 5 streaming tests fail on main: - test_stream_concatenated_content_matches_engine_decode - test_stream_finish_reason_length_on_max_tokens - test_stream_returns_done_sentinel_at_end - test_stream_each_chunk_has_required_openai_fields - test_stream_completion_id_consistent_across_chunks Fix: Add an autouse fixture in tests/inference_engine/server/conftest.py that resets AppStatus.should_exit_event to None before and after every test in this package. Production code is unaffected — uvicorn stays on a single loop for the lifetime of the process so the lazy init runs exactly once there. Result: tests/inference_engine/server/test_app_streaming.py: 11 passed tests/inference_engine/server/ (full suite): 215 passed Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent 0305b1a commit 18dbfdd

3 files changed

Lines changed: 405 additions & 1 deletion

File tree

README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
[![CI](https://github.com/FluffyAIcode/Kakeya-LLM-Inference-engine/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/FluffyAIcode/Kakeya-LLM-Inference-engine/actions/workflows/ci.yaml)
44
[![Release](https://img.shields.io/badge/release-v0.1.0-blue)](https://github.com/FluffyAIcode/Kakeya-LLM-Inference-engine/releases/tag/v0.1.0)
55
[![Platform](https://img.shields.io/badge/platform-Apple%20Silicon-lightgrey)](docs/local-inference-engine.md)
6-
[![ADRs](https://img.shields.io/badge/ADRs-0001%20%7C%200002-green)](docs/adr/)
6+
[![ADRs](https://img.shields.io/badge/ADRs-0001%20%7C%200002%20%7C%200003%20%7C%200006-green)](docs/adr/)
77

88
Runs the speculative-decoding architecture designed in the prior product
99
discussion using **real, public** weights:
@@ -468,3 +468,13 @@ explicitly rejected.
468468
and what intermediate step ships in v0.2 — `PooledVerifier`
469469
wrapper that makes pool memory accounting accurate without
470470
touching the model forward.
471+
- [ADR 0006 — Project positioning as local agent
472+
infrastructure](docs/adr/0006-local-agent-infrastructure-positioning.md):
473+
the strategic positioning decision that Kakeya is **local agent
474+
infrastructure for Mac**, not a generic chat-acceleration engine.
475+
Reframes v0.3+ release notes around multi-agent / long-session /
476+
personalized usage, commits to shipping `docs/integrations/`
477+
examples (LangChain / CrewAI / AutoGen / Cursor) and an agentic
478+
benchmark suite (`scripts/bench_agentic/`), and explicitly declines
479+
to compete with llama.cpp on chat speedup or with vLLM on
480+
data-center serving.

0 commit comments

Comments
 (0)