Skip to content

Commit 7ba95f0

Browse files
authored
Merge pull request #29 from FluffyAIcode/AgentMemory/adr-0006-section-2.3-revision-8e7f
ADR 0006 §2.3 revision: split memory-bounded vs latency-bounded (with 4h evidence)
2 parents 066d2c3 + 6730de7 commit 7ba95f0

1 file changed

Lines changed: 119 additions & 9 deletions

File tree

docs/adr/0006-local-agent-infrastructure-positioning.md

Lines changed: 119 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,13 +58,21 @@ For comparison, the same requirements against `mlx_lm.server`
5858
| Requirement | mlx_lm.server | Kakeya v0.2.0 | Kakeya v0.4 (planned) |
5959
| ------------------------------------------ | -------------- | ------------------ | --------------------- |
6060
| Multiple concurrent agents | Single-tenant | Multi-tenant | Multi-tenant |
61-
| Long-session memory stability | KV grows linearly | sink+window bounded | sink+window bounded |
61+
| Long-session **memory** stability | KV grows linearly | sink+window bounded | sink+window bounded |
62+
| Long-session **latency** stability | grows linearly with history | grows linearly with history (stateless API) | bounded via cross-request KV reuse |
6263
| Cross-session memory | None | Schema in ADR 0005 | Personal data store |
6364
| Per-user personalization | None | Stage 1 surgery | Personal LoRA layer |
6465
| Production metrics | None | Prometheus | Prometheus |
6566
| API key auth | None | Bearer token | Bearer token |
6667
| Mid-stream cancel | Basic | Full lifecycle | Full lifecycle |
6768

69+
The "memory stability" and "latency stability" rows are deliberately
70+
separated. Sink+window bounds **memory** at the verifier level, but
71+
the OpenAI chat-completions protocol is stateless — every turn the
72+
client re-sends the full chat history and the server re-prefills it
73+
end-to-end. v0.3 makes no claim that per-turn latency stays bounded
74+
across a long session; cross-request KV reuse is a v0.4 feature.
75+
6876
For single-user, single-agent, short-session use, `mlx_lm.server`
6977
is a better fit (simpler, more model selection). For
7078
**multi-agent / long-session / personalized** use, Kakeya's
@@ -90,11 +98,16 @@ scripts):
9098

9199
> Kakeya v0.3 is a production-grade **local agent infrastructure
92100
> for Mac**. It runs multiple concurrent agents on a single
93-
> machine with bounded per-session memory, learns per-user
94-
> codebase and workflow patterns through on-device alignment
95-
> training, retains conversation history across sessions, and
96-
> exposes Prometheus metrics + API-key auth for long-running
97-
> deployment.
101+
> machine with **per-session KV memory bounded by sink+window**
102+
> (verified to the byte against the theoretical limit on Mac M4 —
103+
> see §2.3.a), learns per-user codebase and workflow patterns
104+
> through on-device alignment training, retains conversation
105+
> history across sessions, and exposes Prometheus metrics +
106+
> API-key auth for long-running deployment.
107+
>
108+
> Per-turn latency is not bounded across long sessions in v0.3
109+
> because the OpenAI chat-completions protocol is stateless;
110+
> cross-request KV reuse is a v0.4 feature (see §2.3.b).
98111
99112
The technical detail (acceptance rate, alignment training,
100113
speculative speedup) becomes implementation evidence, not the
@@ -181,12 +194,90 @@ release evidence. Specifically, the v0.3.0 release notes claim
181194
must be backed by:
182195

183196
- "3 concurrent agents on M4 24GB" → measured by `bench_multi_agent.py`
184-
- "4-hour session without OOM" → measured by `bench_long_session.py`
197+
- "Per-session KV bounded across long sessions" → measured by
198+
`bench_long_session.py` (the §2.3.a sub-claim below)
185199
- "100% tool-call JSON validity" → measured by `bench_tool_call_reliability.py`
186200

187201
If any benchmark fails to back its claim, the release notes
188202
adjust the claim, not the benchmark.
189203

204+
The long-session benchmark validates **two distinct sub-claims**;
205+
v0.3 makes only the first:
206+
207+
#### 2.3.a Memory bounded across long sessions (v0.3 claim)
208+
209+
Per-session KV cache stays bounded by the configured sink+window
210+
regardless of session duration or generated-token count. This is
211+
the §2.3 headline `bench_long_session.py` exists to measure.
212+
213+
**Status (v0.3.0-rc1)**: VERIFIED on two independent runs:
214+
215+
| Run | Wall time | Successful turns | KV peak per turn | Spread |
216+
|---|---|---|---|---|
217+
| 30-min short test #3 | 1,800 s | 58 | 7,798,784 bytes (×58) | 0.00% |
218+
| 4-hour run | 14,400 s | 58 (first 30 min) | 7,798,784 bytes (×58) | 0.00% |
219+
220+
Both runs were on Mac M4 with Qwen3-1.7B and `sink_size=4
221+
window_size=64`. Both recorded a per-turn KV peak of exactly
222+
**7,798,784 bytes** — drift 0.00 MiB, observed/expected = 100.0000%
223+
to the byte. The 4-hour run additionally confirmed the
224+
orphan-session fix invariant (PR #25) over 4 hours: `idle
225+
pool_in_use` stayed at 0 throughout, even while the run was
226+
processing 182 timeout/429 cycles in the §2.3.b regime.
227+
228+
The observed value matches the theoretical sink+window bound:
229+
230+
```
231+
68 tokens × (28 layers × 2 (K+V) × 8 KV-heads × 128 head_dim × 2 bytes) = 7,798,784
232+
```
233+
234+
Evidence files:
235+
- `results/platform-tests/bench_long_session_mac_short3_1780208693.json`
236+
- `results/platform-tests/bench_long_session_mac_4h_1780211323.json`
237+
238+
#### 2.3.b Latency bounded across long sessions (NOT a v0.3 claim)
239+
240+
Per-turn latency does **not** stay bounded as chat history grows.
241+
The OpenAI chat-completions protocol is stateless: every turn the
242+
server re-tokenizes and re-prefills the full conversation history
243+
end-to-end. Sink+window only bounds the generation-phase KV
244+
footprint, not prefill compute.
245+
246+
**Status (v0.3.0-rc1)**: NOT achieved by sink+window alone, and
247+
empirically observed:
248+
249+
- Short test #3 (30 min): p50 latency drifted 15.5 s (0–10 min)
250+
→ 38.6 s (10–20 min) → 55.3 s (20–30 min) as the chat history
251+
grew from ~50 to ~3,700 tokens.
252+
- 4-hour run: completed only 58 turns of useful work (matching
253+
the 30-min run exactly), then accumulated 182 errors over the
254+
remaining 3.5 hours. Error breakdown: 96 client-side
255+
ReadTimeouts (per-turn latency exceeded the bench's
256+
`timeout_s=120`) interleaved with 86 HTTP 429s (the timed-out
257+
request's slab still held while the server worker thread
258+
finished its prefill). This is exactly the §2.3.b pattern:
259+
prefill cost grew past the bench's fixed timeout, and the
260+
long-session degraded into a timeout/recovery loop.
261+
262+
**Practical envelope on Mac M4 / Qwen3-1.7B with `--max-tokens 64
263+
--turn-spacing-s 5 --timeout-s 120`**: useful work for ~30 min /
264+
~60 multi-turn turns of a single continuous session. Past that,
265+
client-side prompt management is required.
266+
267+
**Mitigation in v0.3**: client-side prompt management (summarization,
268+
sliding windows, history truncation) is the user-side fix.
269+
Acceptable for short-turn tool-use agents; insufficient for
270+
hours-long single-session workloads.
271+
272+
**v0.4 plan**: cross-request KV reuse via session affinity — a
273+
follow-up ADR will design the protocol extension and the engine
274+
support for it. See ADR 0006 §2.5 for prioritization context.
275+
276+
The v0.3 release framing must therefore say "long-session
277+
**memory** stability", not "long-session stability". Bench
278+
reports always carry the latency-drift series alongside the
279+
KV-bounded series so the trade-off is transparent to operators.
280+
190281
### 2.4 Establish a "what we are not" stance
191282

192283
To keep the agentic positioning sharp, we explicitly **decline**
@@ -357,13 +448,32 @@ This ADR is considered validated when:
357448
3. The agentic benchmark suite (§2.3) has at least
358449
`bench_multi_agent.py` and `bench_long_session.py` shipped
359450
with v0.3.0, with comparison numbers vs `mlx_lm.server`.
360-
4. Any post-v0.3.0 release positioning that contradicts §2.4
451+
4. **The §2.3.a memory-bounded sub-claim is validated by an
452+
on-device measurement that matches the theoretical sink+window
453+
bound to within 1% across at least 30 minutes of continuous
454+
single-session traffic.** Validated by two independent
455+
v0.3.0-rc1 runs on Mac M4 with Qwen3-1.7B:
456+
- 30-min short test #3 — 58/58 turns at exactly 7,798,784 bytes
457+
(`bench_long_session_mac_short3_1780208693.json`)
458+
- 4-hour run — same 58 turns at the same 7,798,784 bytes,
459+
plus orphan-session invariant verified for 4 hours of
460+
continuous server uptime
461+
(`bench_long_session_mac_4h_1780211323.json`)
462+
Both runs: observed/expected = 100.0000% to the byte.
463+
5. **The §2.3.b latency-not-bounded caveat is documented in v0.3.0
464+
release notes**, README, and bench script docstrings — readers
465+
must not infer "long-session stability" without seeing the
466+
memory-vs-latency split.
467+
6. Any post-v0.3.0 release positioning that contradicts §2.4
361468
(the "what we are not" stance) requires a follow-up ADR
362469
superseding this one — not a unilateral marketing decision.
363470

364471
If item 1 is satisfied but items 2-3 are not, the v0.3.0 release
365472
notes acknowledge the gap and identify which integrations /
366-
benchmarks are deferred to v0.3.1.
473+
benchmarks are deferred to v0.3.1. Items 4 and 5 are GA gates —
474+
v0.3.0 cannot promote from rc to GA without the §2.3.a validation
475+
landing in `results/` and the §2.3.b caveat appearing in release
476+
notes.
367477

368478
## 6. References
369479

0 commit comments

Comments
 (0)