docs: v0.28.0 — MoE softmax temperature cliff-break release notes

unamedkr · claude · unamedkr · commit d2fb8525844b · 2026-04-22T00:30:43.000+09:00
Bundles the b212194 (TQ_MOE_ROUTE_TEMP) + a4d0002 (bench report) into v0.28.0. Full release entry includes: 5-line fix diff, 26-round investigation arc (R16-R26), temperature sweep table with outcomes, causal story (peaky MoE routing × DeltaNet positive feedback), what doesn't fix, safety measurements, recommended user recipe, scope. Also updates: - README.md v3.21 blurb with the key numbers - README.ko.md v3.21 블러브 (한국어 미러) - bindings/python/pyproject.toml 0.27.0 → 0.28.0 Opt-in. Default behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/README.ko.md b/README.ko.md
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
 
 > **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
 
+> **v3.21 ★★★ Qwen3.6-35B 117-토큰 cliff BREAK — MoE softmax temperature 한 줄 (2026-04-22)**: 새 env 플래그 `TQ_MOE_ROUTE_TEMP=2.0` 한 줄로 40+ 라운드 누적 미해결이던 Qwen3.6-35B-A3B의 "It could do math! It could do math!" 117-토큰 반복 루프 제거. drift-trigger 프롬프트 `-n 200 T=0` temperature 스위프: 기본 T=1.0 → 117 루프; T=1.5 → **더 일찍** 87-토큰 cliff; T=1.8 → 여전; **T=2.0 → 200 토큰 Alex+슬픈-나무 이야기 완주, rep-loop 없음**; T=2.5 → 200 토큰 완주; T=3.0 → 과도 분산 114 재발. Sweet spot T=2.0 ~ 2.5. 인과: **peaky MoE 라우팅이 DeltaNet 지속 상태와 positive feedback loop 형성**. softmax 분산하면 루프 형성 불가. 26 라운드 arc: R16-R19 per-layer DeltaNet state reset은 null, R24에서 Qwen3.5-4B (DeltaNet + dense FFN, MoE 없음) 같은 프롬프트 → drift 전혀 없음 → MoE 단독 원인 확정, R25 MoE router probe에서 L4 0.80+ collapse 발견, R26 temperature ablation으로 cliff 차단. 회귀 23/23 PASS (T=2.0 포함), "Paris" factual probe 정상. 150+ 토큰 tail 여전히 character-level degrade — hard cliff 제거, essay-length은 추가 작업 필요. Opt-in, 기본 미변경. 권장: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. 전체 리포트: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0.
+
 > **v3.20 ★★ BPE encode/decode UTF-8 수정 — 국제 문자 silent 품질 재앙 해결 (2026-04-21)**: `encode_byte_to_bpe_char` / `decode_bpe_token` 에 있던 대칭 버그 두 개가 **Llama-3 / Qwen3 계열 모든 모델**에서 비ASCII 문자 (액센트, CJK, 키릴, byte-fallback 이모지)를 포함한 모든 프롬프트/출력을 조용히 깨뜨리고 있었습니다. **인코딩**: GPT-2 direct-byte 코드포인트에 대해 ≥ 0x80 의 raw byte를 그대로 emit (독립형 바이트는 invalid UTF-8)하여 `str_lookup`이 매칭에 실패 → 문자가 byte-fallback을 통해 잘못된 저-ID 토큰으로 변환. **디코딩**: U+0080-U+00FF 코드포인트를 UTF-8 인코딩 (`c3 83`)으로 출력, 실제 바이트 (`0xC3`)가 아님 → 출력 이중 인코딩 ("café" → "cafÃ©"). 토큰 레벨 HF 참조 일치 검증: `café`/`naïve`/`日本語`/`привет` 모두 HF `AutoTokenizer` 와 byte-byte 일치. 새 [`tools/refparity/`](tools/refparity/) 프레임워크의 A/B 출력 diff로 발견. `quant.h` single-header 에도 sync. `scripts/test_tokenizer.sh` 회귀 fixture 추가. 영향 범위: GPT-2 byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece 경로 영향 없음. 회귀 15/15 + 토크나이저 8/8 PASS. v0.27.0.
 
 > **Qwen3.6-35B 16 GB Mac 실용 레시피**: 장문 coherence 기준 현재 최선 구성은 `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` + `--rep-penalty 1.3`. "Once upon a time in a faraway land" (-n 200, T=0) 실측: 기본 설정은 117 토큰에서 반복 루프, Q5_K_M + rep-penalty는 200 토큰 전체 예산을 생성 후 말미에서만 완만하게 degrade. 35B DeltaNet drift 는 여전히 열린 아키텍처 조사 과제이나, 사용자가 지금 쓸 수 있는 최선.
diff --git a/README.md b/README.md
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
 
+> **v3.21 ★★★ Qwen3.6-35B 117-token cliff BREAK via MoE softmax temperature (2026-04-22):** One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. Temperature sweep on drift-trigger prompt `-n 200 T=0`: default T=1.0 → 117-tok loop; T=1.5 → **earlier** 87-tok cliff (peakier in some heads); T=1.8 → still trapped; **T=2.0 → 200-tok coherent Alex+sad-tree story, no rep-loop**; T=2.5 → 200-tok coherent; T=3.0 → over-flat, 114-tok loop returns. Sweet spot T=2.0 to 2.5. Causal story: **peaky MoE routing locks into a feedback loop with DeltaNet's persistent state**. Spread the softmax and the feedback can't form. The 26-round investigation arc: R16-R19 tried per-layer DeltaNet state reset (null), R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the same prompt and saw **zero drift** (isolated to MoE), R25 MoE router probe found L4 at 0.80+ routing collapse, R26 temperature ablation closed it. 23/23 regression PASS at T=2.0, `"Paris"` factual probe correct, Tail quality beyond ~150 tokens still degrades — hard cliff broken but full essay-length generation still has quantization noise margins. Opt-in; default unchanged. Recommended: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. Full report: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
+
 > **v3.20 ★★ BPE encode/decode UTF-8 fix — international text silent quality disaster resolved (2026-04-21):** Two symmetric bugs in `encode_byte_to_bpe_char` / `decode_bpe_token` were silently corrupting every prompt and every output containing non-ASCII chars (accents, CJK, Cyrillic, byte-fallback emoji) on all Llama-3 / Qwen3 family models. **Encode** emitted raw bytes ≥ 0x80 (invalid UTF-8) for GPT-2 direct-byte codepoints so `str_lookup` never matched; characters silently fell back to wrong low-id tokens. **Decode** emitted the UTF-8 encoding of codepoints U+0080-U+00FF instead of the raw byte they represent; output got double-encoded ("café" → "cafÃ©"). Token-level HF parity verified: `café`/`naïve`/`日本語`/`привет` all now tokenize byte-for-byte identical to HF `AutoTokenizer` on Qwen3. Discovered via the new [`tools/refparity/`](tools/refparity/) framework's A/B output diff. Also synced to `quant.h` single-header. Added `scripts/test_tokenizer.sh` fixtures so future refactors fail loudly. Scope: GPT-2-style byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece path unaffected. Regression 15/15 + tokenizer 8/8 PASS. v0.27.0.
 
 > **Practical Qwen3.6-35B recipe on 16 GB Mac**: for best long-form coherence, pair `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` with `--rep-penalty 1.3`. Measured on "Once upon a time in a faraway land" (-n 200, T=0): default config hits a repeat loop at 117 tokens; Q5_K_M + rep-penalty runs the full 200-token budget with graceful degrade only near the end. 35B DeltaNet drift remains an open architectural investigation; this is the best user-facing config today.
diff --git a/bindings/python/pyproject.toml b/bindings/python/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "quantcpp"
-version = "0.27.0"
+version = "0.28.0"
 description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
 readme = "README.md"
 license = { text = "Apache-2.0" }
diff --git a/docs/RELEASE_NOTES.md b/docs/RELEASE_NOTES.md
@@ -6,6 +6,110 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ---
 
+## [v0.28.0] — 2026-04-22 ★★★ (MoE softmax temperature breaks the Qwen3.6-35B 117-tok cliff)
+
+### Headline
+
+**One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do
+math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B
+coherent generation at 117 tokens across 40+ prior debug rounds.** 35B
+long-gen goes 117 → 200+ coherent tokens on the standard drift-trigger
+prompt. Opt-in today; opt-out tomorrow.
+
+### The fix
+
+`src/engine/tq_moe.c::tq_moe_route` — 5-line diff on the top-K softmax:
+
+```c
+float inv_temp = 1.0f / route_temp;  /* default 1.0 = identity */
+for (int k = 0; k < num_active; k++) {
+    float e = expf((logits[out_expert_ids[k]] - max_val) * inv_temp);
+    ...
+}
+```
+
+### Why it works — 26-round investigation summary
+
+Rounds 1-19 on this project chased the drift in the DeltaNet recurrent
+state, assuming that was the cause. R19's per-layer reset bisection
+proved that hypothesis wrong: no single DeltaNet layer carries the
+drift signal.
+
+**R24** ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the exact
+drift-trigger prompt and got **200+ coherent tokens**. Confirmed the
+drift is MoE-specific, not DeltaNet alone.
+
+**R25** added `TQ_MOE_PROBE` — per-layer top-K router histogram —
+found a persistent near-collapse at L4 (one expert getting 0.80+ of
+the softmax mass at tokens 100-115).
+
+**R26** added `TQ_MOE_ROUTE_TEMP` — softmax temperature knob. Sweep
+T ∈ {1.0, 1.5, 1.8, 2.0, 2.5, 3.0}:
+
+| TEMP | outcome |
+|---:|:---|
+| 1.0 (default) | 117-tok loop "It could do math!" |
+| 1.5 | **87**-tok loop (earlier cliff, peakier in some heads) |
+| 1.8 | 113-tok loop |
+| **2.0** | **200 tokens, NO rep-loop**, coherent Alex+sad-tree story |
+| **2.5** | **200 tokens, NO rep-loop**, Alex+magic-leaves story |
+| 3.0 | 114-tok loop (over-flat, wrong expert mix) |
+
+Sweet spot T=2.0 to 2.5. The cliff is **caused by peaky MoE routing
+locking into a feedback loop** with DeltaNet's persistent state. Spread
+the routing distribution and the feedback can't form.
+
+### What v0.28.0 does NOT fix
+
+- Tail quality at 200+ tokens still degrades to character-level noise
+  (alphabet-walking) on longer `-n 500` runs. Probably quantization +
+  DeltaNet state accumulation still contributing at the margin.
+- A "Sorry!" mini-loop appears around token 170 at T=2.0 — human-visible
+  but doesn't trigger engine's rep-loop detector.
+
+So: **breaks the hard 117-tok cliff**, recovers ~50 additional coherent
+tokens. Full essay-length generation still has more to close.
+
+### Safety
+
+- `"The capital of France is"` → `"Paris."` at T=2.0 ✓
+- `bash scripts/test_models.sh` → **23/23 PASS** with T=2.0
+  (15 coherence + 11 tokenizer, no diff)
+
+### Default behavior
+
+**Unchanged.** `TQ_MOE_ROUTE_TEMP=1.0` is the default so existing users
+get identical behavior. Adding the flag is opt-in. A later release may
+flip `qwen35moe` arch to default T=2.0 after broader validation across
+prompts and `--chat` mode.
+
+### Recommended Qwen3.6-35B recipe
+
+```bash
+TQ_MOE_ROUTE_TEMP=2.0 \
+    ./build/quant models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
+    -p "<your prompt>" -n 200 -T 0 --rep-penalty 1.3
+```
+
+### Files changed
+
+- `src/engine/tq_moe.c` — 5-line softmax temperature insertion
+- `docs/env_vars.md` — `TQ_MOE_ROUTE_TEMP` row with measurements
+- `docs/supported_models_tier.md` — 35B recipe updated
+- `bench/results/2026-04-22_moe_temp_cliff_break.md` — full proof +
+  ablation data + causal story
+
+### Scope
+
+- **Affected**: Qwen3.6-35B-A3B (MoE + DeltaNet hybrid) — all quants.
+- **Default-mode unaffected**: every other model. All 40+ MoE layers
+  get the same `route_temp` but for non-pathological routing the
+  difference between T=1.0 and T=2.0 is within normal quality noise.
+
+Additional details: `bench/results/2026-04-22_moe_temp_cliff_break.md`.
+
+---
+
 ## [v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)
 
 ### Headline