Commit 18dbfdd
ADR 0006: Project positioning as local agent infrastructure (#20)
* Add ADR 0006: Project positioning as local agent infrastructure
Records the strategic positioning decision that Kakeya is
'local agent infrastructure for Mac', not a generic chat-acceleration
engine. Pure documentation, no code change.
Context
-------
Through v0.1.0 and v0.2.0 the project's external framing has been
chat-acceleration via DLM speculative decoding. A series of comparison
analyses (vs llama.cpp, mlx_lm, NVIDIA Nemotron self-spec) revealed
that Kakeya's design choices map cleanly to the requirements of local
agentic applications, but were never explicitly justified in that
frame. This ADR makes the agentic positioning explicit so future
release framing, integration documentation, benchmarking, and
prioritization decisions all flow from a coherent product story.
Decisions (§2)
--------------
2.1 Reframe v0.3+ release notes from 'chat acceleration' to
'local agent infrastructure for Mac'. Technical detail
(acceptance rate, alignment training, speculative speedup)
becomes implementation evidence, not the headline.
2.2 Ship docs/integrations/ as v0.3.0 first-class deliverable:
- langchain.md (ChatOpenAI base_url config)
- crewai.md (multi-agent Crew)
- autogen.md (AssistantAgent)
- cursor-bridge.md (Cursor custom endpoint)
- openwebui.md (drop-in URL config)
Each ~50 lines, demonstrating multi-agent concurrent execution
as the discriminator vs mlx_lm.server.
2.3 Add scripts/bench_agentic/ alongside existing chat benchmarks:
- bench_long_session.py (>=4-hour growing context)
- bench_multi_agent.py (3 concurrent agents)
- bench_tool_call_reliability.py (1000 tool calls JSON parse rate)
- bench_cancellation.py (mid-stream cancellation latency)
- bench_persistent_memory.py (cross-session recall)
Each with mlx_lm.server comparison numbers. Headline release
claims must be backed by these benchmarks.
2.4 Establish 'what we are not' stance:
- Not a llama.cpp replacement (model coverage)
- Not a vLLM replacement (data-center serving)
- Not a complete agent framework (substrate, not framework)
- Not a chat product (single-prompt parity is fine)
- Not a multi-model gateway (Qwen3-only is fine)
2.5 Re-prioritize v0.3 / v0.4 work under agentic lens. Technical
decisions don't reverse; ordering and presentation change.
Concrete reframing matrix in §2.5.
Alternatives considered (§3)
----------------------------
- Generic local LLM engine framing (rejected: llama.cpp dominates)
- Pure research engine framing (rejected: understates engineering)
- Chat speedup head-to-head with mlx_lm (rejected: no structural
advantage on that axis)
- Combined agent framework + engine (rejected: overlap with
LangChain/CrewAI/AutoGen/Cursor)
- Defer reframing to v0.4 (rejected: v0.3 framing decision is now)
Documentation updates
---------------------
- docs/adr/0006-local-agent-infrastructure-positioning.md (new ADR)
- docs/adr/README.md: index updated to include 0004 (in flight),
0005 (planned), 0006 (Accepted) so readers see the planned shape
of the decision tree.
- README.md: ADR badge updated to '0001 | 0002 | 0003 | 0006';
ADR list at the bottom gets a one-paragraph entry for 0006.
Validation criteria (§5)
------------------------
1. v0.3.0 README and release notes lead with agentic-infrastructure
framing.
2. At least three of the five integration examples ship with v0.3.0.
3. Agentic benchmark suite has at least bench_multi_agent.py and
bench_long_session.py shipping with v0.3.0 + mlx_lm comparison.
4. Any post-v0.3.0 framing that contradicts §2.4 'what we are not'
stance requires a follow-up ADR.
No code change. Full test suite still passes (sanity check: 93 tests
in 0.28s on the touched-area-adjacent surface).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
* Fix sse-starlette event-loop binding in streaming tests
Root cause:
sse_starlette.sse.AppStatus.should_exit_event is a class-level
anyio.Event lazily created on first SSE response. It binds to the
asyncio event loop that is running at that moment. pytest-asyncio
(>=0.21, function-scoped loop default in 1.x) creates a fresh event
loop per async test, so the second SSE test inherits an Event bound
to a now-closed loop and the SSE response raises:
RuntimeError: <asyncio.locks.Event ...> is bound to a different event loop
This made 5 streaming tests fail on main:
- test_stream_concatenated_content_matches_engine_decode
- test_stream_finish_reason_length_on_max_tokens
- test_stream_returns_done_sentinel_at_end
- test_stream_each_chunk_has_required_openai_fields
- test_stream_completion_id_consistent_across_chunks
Fix:
Add an autouse fixture in tests/inference_engine/server/conftest.py
that resets AppStatus.should_exit_event to None before and after
every test in this package. Production code is unaffected — uvicorn
stays on a single loop for the lifetime of the process so the lazy
init runs exactly once there.
Result:
tests/inference_engine/server/test_app_streaming.py: 11 passed
tests/inference_engine/server/ (full suite): 215 passed
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>1 parent 0305b1a commit 18dbfdd
3 files changed
Lines changed: 405 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
468 | 468 | | |
469 | 469 | | |
470 | 470 | | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
0 commit comments