Skip to content

Commit be0a35e

Browse files
authored
Merge pull request #146 from FluffyAIcode/AgentMemory/mac-continuous-decode-restoration-2815
fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate
2 parents c396d0f + e62b9bf commit be0a35e

9 files changed

Lines changed: 403 additions & 77 deletions

File tree

docs/kakeya-autonomous-iteration-and-self-correction.md

Lines changed: 63 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ How it slipped through — the **silent-fallback anti-pattern**, in its observed
3434
| B | f_θ bypassed under S5 ("free lunch" smoke opt) | "restoration engine" | `build_restoration` returns `{}`; no f_θ forward |
3535
| C | a proxy/plumbing run | "engine validated" | wrong model (Qwen3-4B), no trained f_θ/proposer, prompt inside window |
3636
| D | a simpler component shipped | "the engine" | verifier-only AR chat presented as the product |
37-
| E | long-decode degeneration | "the engine works (smoke passed)" | restoration covers only ≤ window decode tokens; a real long answer (780 tok ≫ 64) degenerated to garbage + throughput collapse (0.31 tok/s) — masked because every smoke answer was ≤ window |
37+
| E | long-decode degeneration | "the engine works (smoke passed)" | a long answer (>1024 tok) degenerated to a `由于由于…` loop — masked because every smoke answer was short. **Root cause (confirmed by debug loop, not the initial guess):** the fused spec-decode rollback's `trim_prompt_cache` silently fails once the native `RotatingKVCache` ring wraps at `max_size`≈1024, desyncing `cache.offset` from `past_len`. Fixed via single-token commits past the wrap. The *initial* hypothesis ("restoration only covers ≤ window=64") was disproved by runtime evidence — see §4b. |
3838

3939
Common root cause: an agent (or optimization) chose the **easy/robust path** and
4040
**relabeled it as the hard one**, and no automated check asserted the intended
@@ -146,12 +146,26 @@ the output was garbage and throughput collapsed. So the gate **also** asserts th
146146

147147
| Invariant | Manifest field | Gate assertion | Code |
148148
| --- | --- | --- | --- |
149-
| restoration covers the generation | `window`, per-turn `tokens` | restored run: `tokens <= window` (beyond it, evicted-during-decode positions are unrestored) | `RESTORATION_COVERAGE` |
150-
| output is not degenerate | per-turn `text` | no runaway repeat (≥8 identical short lines) | `OUTPUT_DEGENERATE` |
151-
152-
Verified: a PoW-style report (`tokens=780 > window=64`, repeated `* * *`) now
153-
**fails** the walker (CI + on-device) with both codes. **Liveness proves the
154-
components ran; quality proves they produced a valid result — the gate needs both.**
149+
| output is not degenerate | per-turn `text` | no runaway repeat — ≥8 identical short lines **or** a 1–8 char unit tiled ≥8× at the tail (catches the newline-free `由于由于…` collapse) | `OUTPUT_DEGENERATE` |
150+
151+
Verified: a PoW-style report (repeated `* * *` lines, or `"由于"×120` with no
152+
line breaks) **fails** the walker (CI + on-device); the real coherent long answer
153+
and templated `矿工 A/B/C` enumerations **pass**. **Liveness proves the components
154+
ran; quality proves they produced a valid result — the gate needs both.**
155+
156+
**Correction (2026-06-17) — `RESTORATION_COVERAGE` removed.** An earlier gate
157+
fired when a restored run generated more tokens than the S5 `window` (=64), on the
158+
theory that decode-time evicted positions are "unrestored" and the output beyond
159+
the window must degenerate. **Mac runtime evidence disproved that theory** (see
160+
the §"long-decode degeneration" root-cause below): the decode cache is the model's
161+
native hybrid cache (sliding `RotatingKVCache` with `max_size`≈1024, not the S5
162+
window), so nothing is evicted until ~1024 tokens; and a 1300-token run with **332
163+
evicted-unrestored positions stayed fully coherent** once the *actual* bug was
164+
fixed. "tokens > window" and even "evicted > 0" are not degeneration signals, so
165+
the rule was a pure false-positive (it would have failed every coherent answer
166+
> 64 tokens). The only trustworthy quality gate is the **empirical** one:
167+
`OUTPUT_DEGENERATE`. This is itself an instance of the North-Star discipline —
168+
*verify against runtime, never trust a plausible code comment/hypothesis.*
155169

156170
---
157171

@@ -208,18 +222,48 @@ proposer live (`blocks=2/4`, `accept_len=4.0/3.5`), f_θ live by default
208222
(`f_theta_ran=TRUE`, 25 sliding layers), correct answers, bounded KV, natural EOS
209223
stop. One-command launcher: `scripts/run_kakeya_mac.sh`. (PR #144 + this PR.)
210224

211-
**Known limitation (anti-pattern E, found 2026-06-17):** the Mac fused engine's
212-
restoration is **prefill-amortized for the prompt only** — it covers ≤ `window`
213-
decode tokens (code comment, `k3_integrated_niah_eval_mac.py` §"Per-sample
214-
restoration"). Generations longer than the window degenerate (garbage + throughput
215-
collapse + KV growth). The §4b gate now **fails loud** on it; the *fix* is
216-
**continuous decode-time restoration** (re-restore positions evicted during decode,
217-
as the CUDA engine does) — the real open engineering work, not a gate matter.
218-
219-
**Open / next:** (1) continuous decode-time restoration so long generations don't
220-
degenerate (the engine fix); (2) full-attention model (Qwen/Llama) where f_θ is
221-
load-bearing for the large memory win. The gate (§4/§4b) now prevents silent
222-
regression to verifier-only AND silent long-decode degeneration.
225+
**Long-decode degeneration — root cause found and FIXED (2026-06-17).** The
226+
originally-hypothesised cause (anti-pattern E: "restoration covers only ≤ `window`
227+
decode tokens") was **wrong**, and the debug loop disproved it with runtime
228+
evidence — a textbook case of *verify, don't trust the comment*:
229+
230+
1. **Characterization (128 → 800 → 1300 tokens, Mac M4, prompt "请详细解释POW的工作原理"):**
231+
- The decode cache is the model's **native hybrid cache** — sliding layers are
232+
`RotatingKVCache` (`max_size`=1024, `keep`=0), full layers are `KVCache`. The
233+
S5 `--window-size 64` only feeds the analytical memory math; it does **not**
234+
bound the decode cache. So nothing is evicted until ~1024 tokens.
235+
- At 128 and 800 tokens the fused output was **fully coherent** (`max_run=1`);
236+
`lost=0`; the hypothesis predicted failure at 64 — disproved.
237+
- At **1300 tokens** the fused engine **degenerated** into a `由于由于…` loop
238+
(`cyc_frac=1.0`) starting at gen≈1064 — *only after the ring wrapped at
239+
gen≈1017*. The **native-greedy control on the same prompt stayed coherent**
240+
past the wrap (terminated cleanly at gen 1247), proving the model handles
241+
>1024 fine and the **fused engine** was at fault.
242+
2. **Root cause:** once the sliding `RotatingKVCache` ring wraps (`offset ≥ max_size`),
243+
`mlx_lm.trim_prompt_cache` is **all-or-nothing and refuses** (a rotating layer is
244+
`is_trimmable` only while `offset < max_size`). The fused speculative loop's
245+
rejected-draft rollback then silently fails — 15 `trim short:true` events — so
246+
`cache.offset` ran **+8 ahead of the committed `past_len`** on every post-wrap
247+
block, misaligning RoPE/causal masking → logit corruption → collapse.
248+
3. **Fix (`fused_specdecode.py`, `_sliding_ring_would_wrap` + `if wrap_l1: L=1`):**
249+
detect the impending wrap and commit **single-token blocks** past it. With L=1
250+
the bonus token is always accepted (it *is* `argmax(next_token_logits)`), so
251+
there is never a rejected tail to trim and `offset` stays `== past_len`.
252+
4. **Validated (re-run, 1300 tokens):** `trim short:true` 15→0; post-wrap
253+
offset-desync 76/76→0; post-wrap `cyc_frac` 1.0→0.158; fused output **coherent**,
254+
clean termination at gen 1241 — matching the native control. (Cost: spec-decode
255+
speedup is forgone past `max_size`; correctness-first.)
256+
257+
So eviction past `max_size` is **normal and harmless** (it is gemma's native
258+
sliding-window behavior); "continuous decode-time restoration" is **not** required
259+
for ≤-context coherence. The §4b gate now keys purely on the empirical
260+
`OUTPUT_DEGENERATE` signal (above).
261+
262+
**Open / next:** (1) optional perf: a sound *wrapped-ring rollback* (snapshot/restore
263+
of the rotating cache) to keep speculative speedup past `max_size` — pure throughput,
264+
not correctness; (2) full-attention model (Qwen/Llama) where f_θ is load-bearing for
265+
the large memory win. The gate (§4/§4b) prevents silent regression to verifier-only
266+
AND silent long-decode degeneration.
223267

224268
> Maintenance: append to §7 every iteration; update §4 if new components/
225269
> invariants appear; never delete the §1 failure record — it is the reason for §0.

docs/kakeyainferenceenginebuildskill.md

Lines changed: 98 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,9 @@ v0.5-cuda) into: what the engine is, where the code lives, how to run/benchmark
77
it, the milestone roadmap, the hard-won bugs+fixes, and — most important — the
88
**validation honesty standards** (the rules that keep claims defensible).
99

10-
> If you only read one section, read **§7 Validation & honesty standards**. The
11-
> most expensive mistakes in this project were *overclaims*, not bugs.
10+
> If you only read one section, read **§8 Validation & honesty standards**. The
11+
> most expensive mistakes in this project were *overclaims*, not bugs. For the
12+
> debugging method, **§7 is a reusable worked template.**
1213
1314
---
1415

@@ -142,7 +143,7 @@ Port lessons: `docs/mlx-port-lessons.md`.
142143
| **KIE-v1.1.z** (#139) | throughput + N=75 | **N=75 MET** (recall 1.0, 126.7 GB, ~4.8× vLLM; ~31 tok/s aggregate); **decode ≥ vLLM NOT met** (eager 26B-MoE wall) |
143144
| **KIE-v1.1.z2** | rebuild fused-MoE + graph forward | **abandoned** — superseded by KIE-v2 (run *on* vLLM) |
144145
| **KIE-v2** (#140) | **Kakeya Attention on vLLM** | decode **≥ vLLM (1.15–1.23×)** @16k, recall 1.0, measured to N=70 — inherits vLLM runtime |
145-
| **v0.5-cuda** (#141) | release `KakeyaVLLM` + consolidated reports | done (gemma-4 instantiation). Product concurrency claim = **`KakeyaVLLM` N→70 @16k** on vLLM; the **N=75 @62k is the *eager* `KakeyaEngine` substrate**, not the v0.5 product path — do not conflate. See §7 for exact validation scope |
146+
| **v0.5-cuda** (#141) | release `KakeyaVLLM` + consolidated reports | done (gemma-4 instantiation). Product concurrency claim = **`KakeyaVLLM` N→70 @16k** on vLLM; the **N=75 @62k is the *eager* `KakeyaEngine` substrate**, not the v0.5 product path — do not conflate. See §8 for exact validation scope |
146147
| **v0.6** (= ADR 0015 KIE-v1.2) | **restoration backend on full-attention models** (Qwen/Llama): train f_θ/proposer + inject restoration at vLLM prefill + graph-capturable quantized-exact kernel | **planned — the real memory differentiator (~6×)** |
147148
148149
> **N=16 vs N=24 (KIE-v1.1 precaution).** The evicting StaticCache alone at the
@@ -168,6 +169,7 @@ Port lessons: `docs/mlx-port-lessons.md`.
168169
| `torch.compile` attention 6.6× but **0% e2e decode gain** | decode dominated by **eager 26B-MoE full-model forward**, not attention | need fused-MoE + full-forward graph capture → that's vLLM's job → **KIE-v2** |
169170
| fused-MoE port blocked | HF `kernels` incompatible w/ transformers 5.12; vLLM `fused_moe` cross-venv surgery; from-scratch = multi-week | **run Kakeya ON vLLM** instead of rebuilding it (KIE-v2) |
170171
| `KakeyaVLLM` crash on text-only model | unconditional `text_config` nesting (gemma multimodal) breaks Qwen/Llama (`num_attention_heads` missing) | **auto-detect** `text_config` via `AutoConfig`: nested for gemma-4, flat for Qwen/Llama |
172+
| MLX fused engine **long-decode degeneration** (`由于由于…` runaway past ~1024 tok, throughput collapse) | once the native sliding `RotatingKVCache` ring **wraps** (`offset ≥ max_size`1024), `mlx_lm.trim_prompt_cache` refuses the spec-decode rejected-draft rollback (all-or-nothing; `is_trimmable` needs `offset < max_size`) → un-trimmed rejects leave `cache.offset` **+8 ahead of `past_len`** → RoPE/mask desync → logit corruption | detect the impending wrap (`_sliding_ring_would_wrap`) and commit **single-token blocks** past it (`L=1`): the bonus is always accepted, so there's no rejected tail to trim and `offset` stays `== past_len`. **Full worked template in §7.** |
171173
172174
---
173175
@@ -189,12 +191,96 @@ Port lessons: `docs/mlx-port-lessons.md`.
189191
190192
---
191193
192-
## 7. Validation & honesty standards (READ THIS)
194+
## 7. Worked case study: debugging the long-decode degeneration (a TEMPLATE)
195+
196+
This is the **model example** of how to debug a non-obvious runtime bug in this
197+
project. Reuse the *shape* of this process for any "it works in smoke tests but
198+
breaks in the real workload" bug. The actual fix is the `RotatingKVCache`-wrap
199+
row in §5; what follows is the **method**, written so it transfers.
200+
201+
### 7.A The symptom
202+
Mac (MLX) fused spec-decode engine produced **garbage on long answers**: a long
203+
reply (e.g. "请详细解释POW的工作原理") started coherent, then collapsed into a runaway
204+
repeat (`由于由于由于…`) with throughput falling off. Short answers were fine, so it
205+
had slipped through every smoke test.
206+
207+
### 7.B The process (the reusable template)
208+
209+
> **Golden rule (this project's §6 principle made concrete): never fix from code
210+
> alone. Reproduce → instrument → measure → let runtime evidence pick the
211+
> hypothesis. Be ready to have your first hypothesis killed by the data.**
212+
213+
1. **Write down the initial hypothesis — then try to disprove it, not confirm it.**
214+
Initial guess (from a code comment): "restoration only covers ≤ `window`=64
215+
decode tokens, so output past 64 is unrestored → degenerate." Plausible, and
216+
**wrong**. Treat plausible hypotheses as suspects, not conclusions.
217+
2. **Reproduce at increasing scale, on the real device, with one fixed prompt.**
218+
Drive the Mac M4 via the bridge (`mlx-kakeya-degen-probe` preset). Sweep the
219+
one variable that matters (generation length):
220+
221+
| run | length | result | inference |
222+
| --- | --- | --- | --- |
223+
| 1 | 128 tok | coherent | kills "fails at window=64"; also reveals the decode cache is the model's **native `RotatingKVCache` (`max_size`≈1024)**, *not* the S5 window |
224+
| 2 | 800 tok | coherent | failure is past 800 → keep going |
225+
| 3 | 1300 tok | **degenerates** at gen≈1064 | reproduced; onset is right after the ring **wraps** at gen≈1017 |
226+
227+
3. **Add a discriminating control (the single highest-value step).** In run 3,
228+
also decode the **same prompt with a plain native-greedy loop** (`--chat-native-ref`)
229+
as an A/B. Native stayed **fully coherent** past the wrap (clean stop @ 1247) →
230+
*the model handles >1024 fine; the fused engine corrupts it.* A control that
231+
isolates "your code" from "the model/library" is worth more than ten more logs.
232+
4. **Instrument the exact mechanism the data now points at.** NDJSON per-block
233+
logs of cache `offset` vs committed `past_len`, and of every `trim_prompt_cache`
234+
call. Smoking gun: after the wrap, `offset` ran **+8 ahead of `past_len`** on
235+
every block, with **15 "trim refused" events** — only post-wrap.
236+
5. **State the root cause mechanistically** (see §5 row): wrapped ring →
237+
`trim_prompt_cache` refuses → rejected drafts linger → offset/`past_len` desync
238+
→ RoPE/mask misalignment → logit corruption.
239+
6. **Fix correctness-first**, then re-run the *identical* probe and show the
240+
metrics move the right way:
241+
242+
| signal | before | after |
243+
| --- | --- | --- |
244+
| "trim refused" events | 15 | **0** |
245+
| post-wrap offset desync | 76/76 blocks | **0/225** |
246+
| repetition `cyc_frac` | 1.0 (collapse) | **0.158** |
247+
| final text | `由于…` runaway | **coherent**, clean stop @ 1241 (= native) |
248+
249+
### 7.C Two lessons that generalize (the "样板" payload)
250+
251+
- **L1 — runtime evidence overrides plausible hypotheses (and code comments).**
252+
The comment-derived "≤ window restoration coverage" theory was disproved by run 1
253+
(128 tok coherent) and run 3's native control (332 evicted-yet-coherent tokens).
254+
Eviction past `max_size` is *normal* (native sliding-window behavior), not a
255+
degeneration cause. **Always verify the assumption against a run before building
256+
on it.**
257+
- **L2 — a gate built on a wrong hypothesis is a false-positive factory.** A
258+
`RESTORATION_COVERAGE` quality gate had shipped that fired on `tokens > window`.
259+
Once L1 disproved the theory, that gate was shown to flag **every** coherent
260+
answer > 64 tokens. It was removed; the quality gate now keys only on the
261+
**empirical** signal (did the text actually collapse?`_has_runaway_substring`
262+
catches the newline-free `由于…` case, and is conservative enough to *not* trip on
263+
legitimate templated text like `矿工 A/B/C` enumerations). **Gate on observed
264+
outcomes, not on theorized proxies.**
265+
266+
### 7.D Pointers
267+
- Fix + control flag: `inference_engine/backends/mlx/fused_specdecode.py`
268+
(`_sliding_ring_would_wrap`), `scripts/research/k3_integrated_niah_eval_mac.py`
269+
(`--chat-native-ref`).
270+
- Corrected gate: `inference_engine/bench/k3_report_gate.py`
271+
(`assert_quality`, `_has_runaway_substring`).
272+
- Full narrative + the disproved-hypothesis timeline:
273+
`docs/kakeya-autonomous-iteration-and-self-correction.md`"long-decode
274+
degeneration"). PR #146.
275+
276+
---
277+
278+
## 8. Validation & honesty standards (READ THIS)
193279
194280
The single most damaging error pattern in this project is **overclaiming a
195281
validation**. Follow these rules rigidly.
196282
197-
### 7.1 What counts as validating "the engine" vs "the plumbing"
283+
### 8.1 What counts as validating "the engine" vs "the plumbing"
198284
199285
- **Engine/algorithm validation** = the actual claim (recall, memory, throughput)
200286
measured **on the release model, through the release code path, exercising the
@@ -203,9 +289,9 @@ validation**. Follow these rules rigidly.
203289
generates" — proves the code runs, proves **nothing** about the algorithm.
204290
- **Label every artifact as one or the other.** Never let a smoke test masquerade
205291
as engine validation. (Case study: a Qwen3-4B run of `KakeyaVLLM` was wrongly
206-
presented as "end-to-end validation". It was plumbing-only — see §7.3.)
292+
presented as "end-to-end validation". It was plumbing-only — see §8.3.)
207293
208-
### 7.2 The Gemma-4 "S5 free lunch" — and why it does NOT generalize
294+
### 8.2 The Gemma-4 "S5 free lunch" — and why it does NOT generalize
209295
210296
- On **gemma-4-26B-A4B**, recall is **1.0 at `sliding_window=68` with NO
211297
restoration**, because **5 of 30 layers are native full-attention and carry
@@ -218,7 +304,7 @@ validation**. Follow these rules rigidly.
218304
recall** — so restoration is the *only* way to bound memory at full recall, and
219305
vLLM (no restoration) must keep full KV.
220306
221-
### 7.3 HARD RULE: never validate Kakeya Attention on a model without trained f_θ/proposer
307+
### 8.3 HARD RULE: never validate Kakeya Attention on a model without trained f_θ/proposer
222308
223309
A bounded window **without** trained restoration is **naive truncation, not Kakeya
224310
Attention.** On a full-attention model with no trained f_θ/proposer:
@@ -228,9 +314,9 @@ Attention.** On a full-attention model with no trained f_θ/proposer:
228314
229315
So you **cannot** demonstrate the engine on such a model. The v0.6 work is exactly
230316
"train f_θ/proposer for a full-attention model **then** validate". Until then, the
231-
only defensible engine evidence is gemma-47.2).
317+
only defensible engine evidence is gemma-48.2).
232318
233-
### 7.4 Decode-speed honesty
319+
### 8.4 Decode-speed honesty
234320
235321
- The **eager `KakeyaEngine`** wins memory/concurrency but is slow at decode
236322
(~2531 tok/s aggregate; the eager 26B-MoE forward dominates). Report decode-only
@@ -241,7 +327,7 @@ only defensible engine evidence is gemma-4 (§7.2).
241327
inherits vLLM's fused-MoE + CUDA graphs + scheduler. Don't claim product decode
242328
speed from the eager engine.
243329
244-
### 7.5 Checklist before writing "validated" anywhere
330+
### 8.5 Checklist before writing "validated" anywhere
245331
246332
1. Did the **release code path** run (not a side script that approximates it)?
247333
2. Was the claim's **mechanism actually exercised** (restoration ran? eviction
@@ -257,7 +343,7 @@ If any answer is "no", write the weaker, true claim.
257343
258344
---
259345
260-
## 8. Pointers
346+
## 9. Pointers
261347
262348
- North star + algorithm + milestones: `docs/adr/0015-kakeya-attention-and-engine-substrate.md`
263349
- Engine architecture: `docs/design/kakeya-inference-engine-architecture.md`

0 commit comments

Comments
 (0)