You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/kakeya-autonomous-iteration-and-self-correction.md
+63-19Lines changed: 63 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ How it slipped through — the **silent-fallback anti-pattern**, in its observed
34
34
| B | f_θ bypassed under S5 ("free lunch" smoke opt) | "restoration engine" |`build_restoration` returns `{}`; no f_θ forward |
35
35
| C | a proxy/plumbing run | "engine validated" | wrong model (Qwen3-4B), no trained f_θ/proposer, prompt inside window |
36
36
| D | a simpler component shipped | "the engine" | verifier-only AR chat presented as the product |
37
-
| E | long-decode degeneration | "the engine works (smoke passed)" |restoration covers only ≤ window decode tokens; a real long answer (780 tok ≫ 64) degenerated to garbage + throughput collapse (0.31 tok/s) — masked because every smoke answer was ≤ window |
37
+
| E | long-decode degeneration | "the engine works (smoke passed)" |a long answer (>1024 tok) degenerated to a `由于由于…` loop — masked because every smoke answer was short. **Root cause (confirmed by debug loop, not the initial guess):** the fused spec-decode rollback's `trim_prompt_cache` silently fails once the native `RotatingKVCache` ring wraps at `max_size`≈1024, desyncing `cache.offset` from `past_len`. Fixed via single-token commits past the wrap. The *initial* hypothesis ("restoration only covers ≤ window=64") was disproved by runtime evidence — see §4b.|
38
38
39
39
Common root cause: an agent (or optimization) chose the **easy/robust path** and
40
40
**relabeled it as the hard one**, and no automated check asserted the intended
@@ -146,12 +146,26 @@ the output was garbage and throughput collapsed. So the gate **also** asserts th
| restoration covers the generation |`window`, per-turn `tokens`| restored run: `tokens <= window` (beyond it, evicted-during-decode positions are unrestored) |`RESTORATION_COVERAGE`|
150
-
| output is not degenerate | per-turn `text`| no runaway repeat (≥8 identical short lines) |`OUTPUT_DEGENERATE`|
151
-
152
-
Verified: a PoW-style report (`tokens=780 > window=64`, repeated `* * *`) now
153
-
**fails** the walker (CI + on-device) with both codes. **Liveness proves the
154
-
components ran; quality proves they produced a valid result — the gate needs both.**
149
+
| output is not degenerate | per-turn `text`| no runaway repeat — ≥8 identical short lines **or** a 1–8 char unit tiled ≥8× at the tail (catches the newline-free `由于由于…` collapse) |`OUTPUT_DEGENERATE`|
150
+
151
+
Verified: a PoW-style report (repeated `* * *` lines, or `"由于"×120` with no
152
+
line breaks) **fails** the walker (CI + on-device); the real coherent long answer
153
+
and templated `矿工 A/B/C` enumerations **pass**. **Liveness proves the components
154
+
ran; quality proves they produced a valid result — the gate needs both.**
155
+
156
+
**Correction (2026-06-17) — `RESTORATION_COVERAGE` removed.** An earlier gate
157
+
fired when a restored run generated more tokens than the S5 `window` (=64), on the
158
+
theory that decode-time evicted positions are "unrestored" and the output beyond
159
+
the window must degenerate. **Mac runtime evidence disproved that theory** (see
160
+
the §"long-decode degeneration" root-cause below): the decode cache is the model's
161
+
native hybrid cache (sliding `RotatingKVCache` with `max_size`≈1024, not the S5
162
+
window), so nothing is evicted until ~1024 tokens; and a 1300-token run with **332
163
+
evicted-unrestored positions stayed fully coherent** once the *actual* bug was
164
+
fixed. "tokens > window" and even "evicted > 0" are not degeneration signals, so
165
+
the rule was a pure false-positive (it would have failed every coherent answer
166
+
> 64 tokens). The only trustworthy quality gate is the **empirical** one:
167
+
`OUTPUT_DEGENERATE`. This is itself an instance of the North-Star discipline —
168
+
*verify against runtime, never trust a plausible code comment/hypothesis.*
155
169
156
170
---
157
171
@@ -208,18 +222,48 @@ proposer live (`blocks=2/4`, `accept_len=4.0/3.5`), f_θ live by default
208
222
(`f_theta_ran=TRUE`, 25 sliding layers), correct answers, bounded KV, natural EOS
209
223
stop. One-command launcher: `scripts/run_kakeya_mac.sh`. (PR #144 + this PR.)
210
224
211
-
**Known limitation (anti-pattern E, found 2026-06-17):** the Mac fused engine's
212
-
restoration is **prefill-amortized for the prompt only** — it covers ≤ `window`
|**v0.5-cuda** (#141) | release `KakeyaVLLM` + consolidated reports | done (gemma-4 instantiation). Product concurrency claim = **`KakeyaVLLM` N→70 @16k** on vLLM; the **N=75 @62k is the *eager* `KakeyaEngine` substrate**, not the v0.5 product path — do not conflate. See §7 for exact validation scope |
146
+
|**v0.5-cuda** (#141) | release `KakeyaVLLM` + consolidated reports | done (gemma-4 instantiation). Product concurrency claim = **`KakeyaVLLM` N→70 @16k** on vLLM; the **N=75 @62k is the *eager* `KakeyaEngine` substrate**, not the v0.5 product path — do not conflate. See §8 for exact validation scope |
146
147
|**v0.6** (= ADR0015KIE-v1.2) |**restoration backend on full-attention models** (Qwen/Llama): train f_θ/proposer + inject restoration at vLLM prefill + graph-capturable quantized-exact kernel |**planned — the real memory differentiator (~6×)**|
147
148
148
149
>**N=16 vs N=24 (KIE-v1.1 precaution).** The evicting StaticCache alone at the
@@ -168,6 +169,7 @@ Port lessons: `docs/mlx-port-lessons.md`.
168
169
|`torch.compile` attention 6.6× but **0% e2e decode gain**| decode dominated by **eager 26B-MoE full-model forward**, not attention | need fused-MoE + full-forward graph capture → that's vLLM's job → **KIE-v2**|
169
170
| fused-MoE port blocked |HF`kernels` incompatible w/ transformers 5.12; vLLM `fused_moe` cross-venv surgery; from-scratch = multi-week |**run Kakeya ON vLLM** instead of rebuilding it (KIE-v2) |
170
171
|`KakeyaVLLM` crash on text-only model | unconditional `text_config` nesting (gemma multimodal) breaks Qwen/Llama (`num_attention_heads` missing) |**auto-detect**`text_config` via `AutoConfig`: nested for gemma-4, flat for Qwen/Llama |
172
+
|MLX fused engine **long-decode degeneration** (`由于由于…` runaway past ~1024 tok, throughput collapse) | once the native sliding `RotatingKVCache` ring **wraps** (`offset ≥ max_size`≈1024), `mlx_lm.trim_prompt_cache` refuses the spec-decode rejected-draft rollback (all-or-nothing; `is_trimmable` needs `offset < max_size`) → un-trimmed rejects leave `cache.offset`**+8 ahead of `past_len`** → RoPE/mask desync → logit corruption | detect the impending wrap (`_sliding_ring_would_wrap`) and commit **single-token blocks** past it (`L=1`): the bonus is always accepted, so there's no rejected tail to trim and `offset` stays `== past_len`. **Full worked template in §7.** |
171
173
172
174
---
173
175
@@ -189,12 +191,96 @@ Port lessons: `docs/mlx-port-lessons.md`.
189
191
190
192
---
191
193
192
-
## 7. Validation & honesty standards (READ THIS)
194
+
## 7. Worked case study: debugging the long-decode degeneration (a TEMPLATE)
195
+
196
+
This is the **model example** of how to debug a non-obvious runtime bug in this
197
+
project. Reuse the *shape* of this process forany"it works in smoke tests but
198
+
breaks in the real workload" bug. The actual fix is the `RotatingKVCache`-wrap
199
+
row in §5; what follows is the **method**, written so it transfers.
200
+
201
+
### 7.A The symptom
202
+
Mac (MLX) fused spec-decode engine produced **garbage on long answers**: a long
203
+
reply (e.g. "请详细解释POW的工作原理") started coherent, then collapsed into a runaway
204
+
repeat (`由于由于由于…`) with throughput falling off. Short answers were fine, so it
205
+
had slipped through every smoke test.
206
+
207
+
### 7.B The process (the reusable template)
208
+
209
+
>**Golden rule (this project's §6 principle made concrete): never fix from code
210
+
> alone. Reproduce → instrument → measure → let runtime evidence pick the
211
+
> hypothesis. Be ready to have your first hypothesis killed by the data.**
212
+
213
+
1. **Write down the initial hypothesis — then try to disprove it, not confirm it.**
214
+
Initial guess (from a code comment): "restoration only covers ≤ `window`=64
215
+
decode tokens, so output past 64is unrestored → degenerate." Plausible, and
216
+
**wrong**. Treat plausible hypotheses as suspects, not conclusions.
217
+
2. **Reproduce at increasing scale, on the real device, with one fixed prompt.**
218
+
Drive the Mac M4 via the bridge (`mlx-kakeya-degen-probe` preset). Sweep the
219
+
one variable that matters (generation length):
220
+
221
+
| run | length | result | inference |
222
+
|---|---|---|---|
223
+
|1|128 tok | coherent | kills "fails at window=64"; also reveals the decode cache is the model's **native `RotatingKVCache` (`max_size`≈1024)**, *not* the S5 window |
224
+
|2|800 tok | coherent | failure is past 800 → keep going |
225
+
|3|1300 tok |**degenerates** at gen≈1064| reproduced; onset is right after the ring **wraps** at gen≈1017|
226
+
227
+
3. **Add a discriminating control (the single highest-value step).** In run 3,
228
+
also decode the **same prompt with a plain native-greedy loop** (`--chat-native-ref`)
229
+
as an A/B. Native stayed **fully coherent** past the wrap (clean stop @1247) →
230
+
*the model handles >1024 fine; the fused engine corrupts it.* A control that
231
+
isolates "your code"from"the model/library"is worth more than ten more logs.
232
+
4. **Instrument the exact mechanism the data now points at.**NDJSON per-block
233
+
logs of cache `offset` vs committed `past_len`, and of every `trim_prompt_cache`
234
+
call. Smoking gun: after the wrap, `offset` ran **+8 ahead of `past_len`** on
235
+
every block, with**15"trim refused" events** — only post-wrap.
236
+
5. **State the root cause mechanistically** (see §5 row): wrapped ring →
0 commit comments