README.md

Platform test & benchmark archive

This directory is the historical archive of platform-test and benchmark artifacts collected during Kakeya v0.1 → v0.3 development. Reports here are referenced from ADRs (docs/adr/) and from release notes; do not edit or re-run files in place once they have been committed — they are evidence, not working data.

When a new run is needed, write to a fresh, timestamp-suffixed filename and add an entry below.

File-naming conventions

Prefix	Meaning
`mac-mlx-1{a,b,c}-…`	Phase-1 MLX backend bring-up tests (v0.1 era).
`mac-phase-b-…`	Phase-B sparse-logits proposer tests.
`mac-streaming-…`	E2 server streaming integration tests.
`mlx-…`	First-pass MLX platform test on Mac M4 24 GB.
`bench_mac_kvcache_…`	Sink+window verifier KV peak comparison (vs baseline).
`bench_mac_m4_…`	Mac M4 micro-bench (token throughput, single prompt).
`bench_mlx_speculative_…`	MLX speculative decoder bench (CPU/CPU vs MLX/MLX).
`bench_mlx_verifier_…`	MLX verifier-only forward bench.
`bench_param_sweep_…`	Hyperparameter sweep (block size × num_diffusion_steps × proposer K).
`bench_sparse_vs_dense_…`	Proposer LM-head sparse vs dense logits comparison.
`bench_long_session_mac_…`	Long-session memory-stability run (the v0.3 §2.3.a / §2.3.b evidence).
`…junit.xml` / `…coverage.xml`	Companion test-runner artifacts for the matching `.json`.
`*.partial.json`	Live checkpoint written every N turns by `bench_long_session.py`.
`*.aborted.json`	Annotated abort note when a long run was terminated for triage.

v0.3 long-session archive (the ADR 0006 → ADR 0007 → ADR 0008 evidence chain)

These five runs were the empirical chain that drove the architecture from "OpenAI-compatible HTTP server with stateless turns" toward the session-bound runtime described in ADR 0008. Each run was originally pushed on its own AgentMemory/bench-*-8e7f branch; consolidated here so ADRs and release notes can reference stable main paths.

Index (chronological)

#	UTC date	File (in this dir)	Wall time	Successful turns	Errors	Notes
1	2026-05-30 08:42	`bench_long_session_mac_1780130542.aborted.json`	12 412 s aborted	58	0*	First 4 h attempt; aborted at ~3.4 h. Triage notes only.
		`bench_long_session_mac_1780130542.partial.json`				Last live checkpoint from run #1 before abort.
2	2026-05-30 ~16:*	`bench_long_session_mac_short_1780146230.json`	1 800 s (30 min)	57	0	First clean 30 min after orphan-session fix.
		`bench_long_session_mac_short_1780146230.partial.json`
3	2026-05-31 09:*	`bench_long_session_mac_short2_1780196477.json`	1 800 s (30 min)	58	0	Adds in-flight metrics poller (`metrics_poll_interval_s=0.25`).
		`bench_long_session_mac_short2_1780196477.partial.json`
4	2026-05-31 13:*	`bench_long_session_mac_short3_1780208693.json`	1 800 s (30 min)	58	0	KV gauge gated to active sessions; KV peak = 7.4 MiB.
		`bench_long_session_mac_short3_1780208693.partial.json`
5	2026-05-31 14:*	`bench_long_session_mac_4h_1780211323.json`	14 400 s (4 h)	58	182	Memory bounded; throughput collapses to ~0 after turn 58.
		`bench_long_session_mac_4h_1780211323.partial.json`

*The aborted.json records 0 errors only because every later request was rejected with HTTP 429 by the scheduler before the bench client even sent it; the client did not classify those as turn errors. The server log showed sustained 429s — that is the bug the orphan-session fix addressed.

What the chain tells us

Memory is bounded. Runs #2-#5 all hold KV peak ≈ 0 / 0 / 7.4 MiB (depending on whether the gauge was wired to the engine yet) with KV drift +0.00 MiB over 10-min buckets. The 4 h run (#5) holds the same bound as the 30-min run (#4). This is the evidence behind ADR 0006 §2.3.a (memory-bounded claim — VERIFIED).
Latency is not bounded. Every run shows positive latency_drift_p50 in the +38 s … +41 s range, and per-bucket p50 grows monotonically:
```
bucket 0 (0-10 min):   ~15 s p50
bucket 1 (10-20 min):  ~38 s p50
bucket 2 (20-30 min):  ~55 s p50
```
In run #5 (the only run long enough to expose this), p95 keeps rising until turns hit the 120 s client timeout and start to error — 182 such timeouts in the 3.5 h tail. This is the evidence behind ADR 0006 §2.3.b (latency-bounded claim — NOT achieved in v0.3).
The cause is full-history prefill on every turn. The bench appends each prior assistant reply to the next prompt. With sink+window KV forced to reset at every request, prefill grows linearly with turn count. This is what made cross-request KV reuse a v0.3 hard requirement rather than a v0.4 nice-to-have, and is what motivated ADR 0007 (automatic prefix matching). When the Qwen3 chat template was found to inject generation-time-only placeholders that break token-id-level prefix matching, the design pivoted again, to the explicit session-bound protocol described in ADR 0008.

Source branches (audit trail)

File	Source branch
`bench_long_session_mac_1780130542.{aborted,partial}.json`	`AgentMemory/bench-long-session-mac-results-8e7f`
`bench_long_session_mac_short_1780146230.{,partial.}json`	`AgentMemory/bench-short-test-results-8e7f`
`bench_long_session_mac_short2_1780196477.{,partial.}json`	`AgentMemory/bench-short-test-results-2-8e7f`
`bench_long_session_mac_short3_1780208693.{,partial.}json`	`AgentMemory/bench-short-test-results-3-8e7f`
`bench_long_session_mac_4h_1780211323.{,partial.}json`	`AgentMemory/bench-long-4h-mac-results-8e7f`

These branches remain on origin/ for the original commit-hash audit trail; the JSON snapshots are reproduced verbatim here.

Earlier (v0.1 / v0.2) artifacts

The mlx-…, mac-mlx-1{a,b,c}-…, mac-phase-b-…, mac-streaming-…, bench_mac_*, bench_mlx_*, bench_param_sweep_…, and bench_sparse_vs_dense_… files predate the long-session work and were landed in earlier PRs (Phase-1 bring-up, MLX-1a/1b/1c probes, sparse-logits A/B, MLX speculative decoder bring-up). They are kept here unchanged for historical reproducibility and for ADR 0001 / ADR 0002 / ADR 0003 cross- references.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Platform test & benchmark archive

File-naming conventions

v0.3 long-session archive (the ADR 0006 → ADR 0007 → ADR 0008 evidence chain)

Index (chronological)

What the chain tells us

Source branches (audit trail)

Earlier (v0.1 / v0.2) artifacts

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
bench_gemma4_26b_mac.json		bench_gemma4_26b_mac.json
bench_gemma4_26b_mac.log		bench_gemma4_26b_mac.log
bench_gemma4_26b_mac_kakeya.json		bench_gemma4_26b_mac_kakeya.json
bench_gemma4_26b_mac_kakeya.log		bench_gemma4_26b_mac_kakeya.log
bench_long_session_mac_1780130542.aborted.json		bench_long_session_mac_1780130542.aborted.json
bench_long_session_mac_1780130542.partial.json		bench_long_session_mac_1780130542.partial.json
bench_long_session_mac_4h_1780211323.json		bench_long_session_mac_4h_1780211323.json
bench_long_session_mac_4h_1780211323.partial.json		bench_long_session_mac_4h_1780211323.partial.json
bench_long_session_mac_short2_1780196477.json		bench_long_session_mac_short2_1780196477.json
bench_long_session_mac_short2_1780196477.partial.json		bench_long_session_mac_short2_1780196477.partial.json
bench_long_session_mac_short3_1780208693.json		bench_long_session_mac_short3_1780208693.json
bench_long_session_mac_short3_1780208693.partial.json		bench_long_session_mac_short3_1780208693.partial.json
bench_long_session_mac_short_1780146230.json		bench_long_session_mac_short_1780146230.json
bench_long_session_mac_short_1780146230.partial.json		bench_long_session_mac_short_1780146230.partial.json
bench_mac_kvcache_1779443119.json		bench_mac_kvcache_1779443119.json
bench_mac_m4_1779442985.json		bench_mac_m4_1779442985.json
bench_mlx_speculative_1779513396.json		bench_mlx_speculative_1779513396.json
bench_mlx_speculative_1779513677.json		bench_mlx_speculative_1779513677.json
bench_mlx_speculative_1779519151.json		bench_mlx_speculative_1779519151.json
bench_mlx_verifier_1779507029.json		bench_mlx_verifier_1779507029.json
bench_mlx_verifier_1779507043.json		bench_mlx_verifier_1779507043.json
bench_param_sweep_1779514677.json		bench_param_sweep_1779514677.json
bench_session_4h_1780332893.json		bench_session_4h_1780332893.json
bench_sparse_vs_dense-1779441993.json		bench_sparse_vs_dense-1779441993.json
mac-mlx-1a-1779505888.coverage.xml		mac-mlx-1a-1779505888.coverage.xml
mac-mlx-1a-1779505888.json		mac-mlx-1a-1779505888.json
mac-mlx-1a-1779505888.junit.xml		mac-mlx-1a-1779505888.junit.xml
mac-mlx-1b-1779506930.coverage.xml		mac-mlx-1b-1779506930.coverage.xml
mac-mlx-1b-1779506930.json		mac-mlx-1b-1779506930.json
mac-mlx-1b-1779506930.junit.xml		mac-mlx-1b-1779506930.junit.xml
mac-mlx-1c-1779513712.coverage.xml		mac-mlx-1c-1779513712.coverage.xml
mac-mlx-1c-1779513712.json		mac-mlx-1c-1779513712.json
mac-mlx-1c-1779513712.junit.xml		mac-mlx-1c-1779513712.junit.xml
mac-mlx-a3b-1780312673.coverage.xml		mac-mlx-a3b-1780312673.coverage.xml
mac-mlx-a3b-1780312673.json		mac-mlx-a3b-1780312673.json
mac-mlx-a3b-1780312673.junit.xml		mac-mlx-a3b-1780312673.junit.xml
mac-phase-b-1779443724.coverage.xml		mac-phase-b-1779443724.coverage.xml
mac-phase-b-1779443724.json		mac-phase-b-1779443724.json
mac-phase-b-1779443724.junit.xml		mac-phase-b-1779443724.junit.xml
mac-streaming-1779518093.coverage.xml		mac-streaming-1779518093.coverage.xml
mac-streaming-1779518093.json		mac-streaming-1779518093.json
mac-streaming-1779518093.junit.xml		mac-streaming-1779518093.junit.xml
mlx-1779435343.coverage.xml		mlx-1779435343.coverage.xml
mlx-1779435343.json		mlx-1779435343.json
mlx-1779435343.junit.xml		mlx-1779435343.junit.xml
mlx-1779437171.coverage.xml		mlx-1779437171.coverage.xml
mlx-1779437171.json		mlx-1779437171.json
mlx-1779437171.junit.xml		mlx-1779437171.junit.xml
mlx_probe_1779505786.json		mlx_probe_1779505786.json
pr-b1-mac-grpc-smoke-1780317235.json		pr-b1-mac-grpc-smoke-1780317235.json
pr-b1-mac-grpc-tests-1780317235.coverage.xml		pr-b1-mac-grpc-tests-1780317235.coverage.xml
pr-b1-mac-grpc-tests-1780317235.json		pr-b1-mac-grpc-tests-1780317235.json
pr-b1-mac-grpc-tests-1780317235.junit.xml		pr-b1-mac-grpc-tests-1780317235.junit.xml
pr-b2-mac-coordinator-tests-1780320664.coverage.xml		pr-b2-mac-coordinator-tests-1780320664.coverage.xml
pr-b2-mac-coordinator-tests-1780320664.json		pr-b2-mac-coordinator-tests-1780320664.json
pr-b2-mac-coordinator-tests-1780320664.junit.xml		pr-b2-mac-coordinator-tests-1780320664.junit.xml
pr-b2-mac-grpc-appender-smoke-1780320664.json		pr-b2-mac-grpc-appender-smoke-1780320664.json
pr-b2-mac-grpc-runtime-smoke-1780320664.json		pr-b2-mac-grpc-runtime-smoke-1780320664.json
pr-b2-mac-grpc-tests-1780320664.coverage.xml		pr-b2-mac-grpc-tests-1780320664.coverage.xml
pr-b2-mac-grpc-tests-1780320664.json		pr-b2-mac-grpc-tests-1780320664.json
pr-b2-mac-grpc-tests-1780320664.junit.xml		pr-b2-mac-grpc-tests-1780320664.junit.xml
pr-b3-mac-generator-tests-1780323650.coverage.xml		pr-b3-mac-generator-tests-1780323650.coverage.xml
pr-b3-mac-generator-tests-1780323650.json		pr-b3-mac-generator-tests-1780323650.json
pr-b3-mac-generator-tests-1780323650.junit.xml		pr-b3-mac-generator-tests-1780323650.junit.xml
pr-b3-mac-grpc-appender-smoke-1780323650.json		pr-b3-mac-grpc-appender-smoke-1780323650.json
pr-b3-mac-grpc-generator-smoke-1780323650.json		pr-b3-mac-grpc-generator-smoke-1780323650.json
pr-b3-mac-grpc-runtime-smoke-1780323650.json		pr-b3-mac-grpc-runtime-smoke-1780323650.json
pr-b3-mac-grpc-tests-1780323650.coverage.xml		pr-b3-mac-grpc-tests-1780323650.coverage.xml
pr-b3-mac-grpc-tests-1780323650.json		pr-b3-mac-grpc-tests-1780323650.json
pr-b3-mac-grpc-tests-1780323650.junit.xml		pr-b3-mac-grpc-tests-1780323650.junit.xml
pr-d2-mac-integration-tests-1780375297.json		pr-d2-mac-integration-tests-1780375297.json
pr-d2-mac-integration-tests-1780375297.junit.xml		pr-d2-mac-integration-tests-1780375297.junit.xml
pr-d2-mac-integration-tests-1780375454.json		pr-d2-mac-integration-tests-1780375454.json
pr-d2-mac-integration-tests-1780375454.junit.xml		pr-d2-mac-integration-tests-1780375454.junit.xml
pr-g5-mac-prewarm-1780663772.json		pr-g5-mac-prewarm-1780663772.json
pr-g6-mac-chat-smoke-1780664113.chat.log		pr-g6-mac-chat-smoke-1780664113.chat.log
pr-g6-mac-chat-smoke-1780664113.json		pr-g6-mac-chat-smoke-1780664113.json
pr-g6-mac-chat-smoke-1780664113.server.log		pr-g6-mac-chat-smoke-1780664113.server.log
smoke-all-prs-1780370637.junit.xml		smoke-all-prs-1780370637.junit.xml

FilesExpand file tree

platform-tests

Directory actions

More options

Directory actions

More options

Latest commit

History

platform-tests

Folders and files

parent directory

README.md

Platform test & benchmark archive

File-naming conventions

v0.3 long-session archive (the ADR 0006 → ADR 0007 → ADR 0008 evidence chain)

Index (chronological)

What the chain tells us

Source branches (audit trail)

Earlier (v0.1 / v0.2) artifacts