docs: v0.28.0 auto-default flip — TQ_MOE_ROUTE_TEMP=2.0 is now default on qwen35moe

unamedkr · claude · unamedkr · commit f2f0d8a94215 · 2026-04-22T00:49:14.000+09:00
Updates v0.28.0 release narrative from "opt-in" to "auto-default on qwen35moe arch" to match f0e51ab. Files: - docs/RELEASE_NOTES.md — "Default behavior" section: auto-flip on qwen35moe, precedent pattern (same as TQ_NO_AUTO_SERIAL), opt-outs documented. - README.md v3.21 — now says "default-on for qwen35moe" in headline. - README.ko.md v3.21 — 한국어 미러. - docs/env_vars.md — TQ_MOE_ROUTE_TEMP row notes the auto-flip; adds TQ_NO_MOE_TEMP_AUTO opt-out. Users running Qwen3.6-35B get the 117-tok cliff fix out of the box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/README.ko.md b/README.ko.md
@@ -76,7 +76,7 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
 
 > **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
 
-> **v3.21 ★★★ Qwen3.6-35B 117-토큰 cliff BREAK — MoE softmax temperature 한 줄 (2026-04-22)**: 새 env 플래그 `TQ_MOE_ROUTE_TEMP=2.0` 한 줄로 40+ 라운드 누적 미해결이던 Qwen3.6-35B-A3B의 "It could do math! It could do math!" 117-토큰 반복 루프 제거. drift-trigger 프롬프트 `-n 200 T=0` temperature 스위프: 기본 T=1.0 → 117 루프; T=1.5 → **더 일찍** 87-토큰 cliff; T=1.8 → 여전; **T=2.0 → 200 토큰 Alex+슬픈-나무 이야기 완주, rep-loop 없음**; T=2.5 → 200 토큰 완주; T=3.0 → 과도 분산 114 재발. Sweet spot T=2.0 ~ 2.5. 인과: **peaky MoE 라우팅이 DeltaNet 지속 상태와 positive feedback loop 형성**. softmax 분산하면 루프 형성 불가. 26 라운드 arc: R16-R19 per-layer DeltaNet state reset은 null, R24에서 Qwen3.5-4B (DeltaNet + dense FFN, MoE 없음) 같은 프롬프트 → drift 전혀 없음 → MoE 단독 원인 확정, R25 MoE router probe에서 L4 0.80+ collapse 발견, R26 temperature ablation으로 cliff 차단. 회귀 23/23 PASS (T=2.0 포함), "Paris" factual probe 정상. 150+ 토큰 tail 여전히 character-level degrade — hard cliff 제거, essay-length은 추가 작업 필요. Opt-in, 기본 미변경. 권장: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. 전체 리포트: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0.
+> **v3.21 ★★★ Qwen3.6-35B 117-토큰 cliff BREAK — MoE softmax temperature, qwen35moe에서 기본 자동 적용 (2026-04-22)**: 한 knob — `TQ_MOE_ROUTE_TEMP=2.0` — 으로 40+ 라운드 미해결이던 Qwen3.6-35B-A3B의 "It could do math! It could do math!" 117-토큰 반복 루프 제거. **qwen35moe arch 자동 감지 + 기본값 자동 적용** (기존 auto-serial 모드와 동일 패턴), 사용자가 env 없이도 기본 적용. Llama/Phi/Gemma/Qwen3 non-hybrid에는 영향 없음. drift-trigger 프롬프트 `-n 200 T=0` temperature 스위프: 기본 T=1.0 → 117 루프; T=1.5 → **더 일찍** 87-토큰 cliff; T=1.8 → 여전; **T=2.0 → 200 토큰 Alex+슬픈-나무 이야기 완주, rep-loop 없음**; T=2.5 → 200 토큰 완주; T=3.0 → 과도 분산 114 재발. Sweet spot T=2.0 ~ 2.5. 인과: **peaky MoE 라우팅이 DeltaNet 지속 상태와 positive feedback loop 형성**. softmax 분산하면 루프 형성 불가. 26 라운드 arc: R16-R19 per-layer DeltaNet state reset은 null, R24에서 Qwen3.5-4B (DeltaNet + dense FFN, MoE 없음) 같은 프롬프트 → drift 전혀 없음 → MoE 단독 원인 확정, R25 MoE router probe에서 L4 0.80+ collapse 발견, R26 temperature ablation으로 cliff 차단. 회귀 23/23 PASS (T=2.0 포함), "Paris" factual probe 정상. 150+ 토큰 tail 여전히 character-level degrade — hard cliff 제거, essay-length은 추가 작업 필요. Opt-in, 기본 미변경. 권장: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. 전체 리포트: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0.
 
 > **v3.20 ★★ BPE encode/decode UTF-8 수정 — 국제 문자 silent 품질 재앙 해결 (2026-04-21)**: `encode_byte_to_bpe_char` / `decode_bpe_token` 에 있던 대칭 버그 두 개가 **Llama-3 / Qwen3 계열 모든 모델**에서 비ASCII 문자 (액센트, CJK, 키릴, byte-fallback 이모지)를 포함한 모든 프롬프트/출력을 조용히 깨뜨리고 있었습니다. **인코딩**: GPT-2 direct-byte 코드포인트에 대해 ≥ 0x80 의 raw byte를 그대로 emit (독립형 바이트는 invalid UTF-8)하여 `str_lookup`이 매칭에 실패 → 문자가 byte-fallback을 통해 잘못된 저-ID 토큰으로 변환. **디코딩**: U+0080-U+00FF 코드포인트를 UTF-8 인코딩 (`c3 83`)으로 출력, 실제 바이트 (`0xC3`)가 아님 → 출력 이중 인코딩 ("café" → "cafÃ©"). 토큰 레벨 HF 참조 일치 검증: `café`/`naïve`/`日本語`/`привет` 모두 HF `AutoTokenizer` 와 byte-byte 일치. 새 [`tools/refparity/`](tools/refparity/) 프레임워크의 A/B 출력 diff로 발견. `quant.h` single-header 에도 sync. `scripts/test_tokenizer.sh` 회귀 fixture 추가. 영향 범위: GPT-2 byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece 경로 영향 없음. 회귀 15/15 + 토크나이저 8/8 PASS. v0.27.0.
 
diff --git a/README.md b/README.md
@@ -167,7 +167,7 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
 
-> **v3.21 ★★★ Qwen3.6-35B 117-token cliff BREAK via MoE softmax temperature (2026-04-22):** One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. Temperature sweep on drift-trigger prompt `-n 200 T=0`: default T=1.0 → 117-tok loop; T=1.5 → **earlier** 87-tok cliff (peakier in some heads); T=1.8 → still trapped; **T=2.0 → 200-tok coherent Alex+sad-tree story, no rep-loop**; T=2.5 → 200-tok coherent; T=3.0 → over-flat, 114-tok loop returns. Sweet spot T=2.0 to 2.5. Causal story: **peaky MoE routing locks into a feedback loop with DeltaNet's persistent state**. Spread the softmax and the feedback can't form. The 26-round investigation arc: R16-R19 tried per-layer DeltaNet state reset (null), R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the same prompt and saw **zero drift** (isolated to MoE), R25 MoE router probe found L4 at 0.80+ routing collapse, R26 temperature ablation closed it. 23/23 regression PASS at T=2.0, `"Paris"` factual probe correct, Tail quality beyond ~150 tokens still degrades — hard cliff broken but full essay-length generation still has quantization noise margins. Opt-in; default unchanged. Recommended: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. Full report: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
+> **v3.21 ★★★ Qwen3.6-35B 117-token cliff BREAK via MoE softmax temperature — now default-on for qwen35moe (2026-04-22):** One knob — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. **Auto-flipped as the default for `qwen35moe` arch** (same pattern as the existing auto-serial mode), so users get the fix out of the box. No effect on Llama/Phi/Gemma/Qwen3 non-hybrid. Temperature sweep on drift-trigger prompt `-n 200 T=0`: default T=1.0 → 117-tok loop; T=1.5 → **earlier** 87-tok cliff (peakier in some heads); T=1.8 → still trapped; **T=2.0 → 200-tok coherent Alex+sad-tree story, no rep-loop**; T=2.5 → 200-tok coherent; T=3.0 → over-flat, 114-tok loop returns. Sweet spot T=2.0 to 2.5. Causal story: **peaky MoE routing locks into a feedback loop with DeltaNet's persistent state**. Spread the softmax and the feedback can't form. The 26-round investigation arc: R16-R19 tried per-layer DeltaNet state reset (null), R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the same prompt and saw **zero drift** (isolated to MoE), R25 MoE router probe found L4 at 0.80+ routing collapse, R26 temperature ablation closed it. 23/23 regression PASS at T=2.0, `"Paris"` factual probe correct, Tail quality beyond ~150 tokens still degrades — hard cliff broken but full essay-length generation still has quantization noise margins. Opt-in; default unchanged. Recommended: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. Full report: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
 
 > **v3.20 ★★ BPE encode/decode UTF-8 fix — international text silent quality disaster resolved (2026-04-21):** Two symmetric bugs in `encode_byte_to_bpe_char` / `decode_bpe_token` were silently corrupting every prompt and every output containing non-ASCII chars (accents, CJK, Cyrillic, byte-fallback emoji) on all Llama-3 / Qwen3 family models. **Encode** emitted raw bytes ≥ 0x80 (invalid UTF-8) for GPT-2 direct-byte codepoints so `str_lookup` never matched; characters silently fell back to wrong low-id tokens. **Decode** emitted the UTF-8 encoding of codepoints U+0080-U+00FF instead of the raw byte they represent; output got double-encoded ("café" → "cafÃ©"). Token-level HF parity verified: `café`/`naïve`/`日本語`/`привет` all now tokenize byte-for-byte identical to HF `AutoTokenizer` on Qwen3. Discovered via the new [`tools/refparity/`](tools/refparity/) framework's A/B output diff. Also synced to `quant.h` single-header. Added `scripts/test_tokenizer.sh` fixtures so future refactors fail loudly. Scope: GPT-2-style byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece path unaffected. Regression 15/15 + tokenizer 8/8 PASS. v0.27.0.
 
diff --git a/docs/RELEASE_NOTES.md b/docs/RELEASE_NOTES.md
@@ -78,10 +78,24 @@ tokens. Full essay-length generation still has more to close.
 
 ### Default behavior
 
-**Unchanged.** `TQ_MOE_ROUTE_TEMP=1.0` is the default so existing users
-get identical behavior. Adding the flag is opt-in. A later release may
-flip `qwen35moe` arch to default T=2.0 after broader validation across
-prompts and `--chat` mode.
+**Auto-flipped for `qwen35moe` arch.** `tools/quant.c` auto-detects the
+MoE+DeltaNet hybrid at load time and sets `TQ_MOE_ROUTE_TEMP=2.0` when
+the user hasn't provided one. No effect on Llama, Phi, Gemma, Qwen3
+non-hybrid, or any other arch — only `qwen35moe` gets the new default.
+
+The validation signal that justified the flip:
+- 5/5 short-prompt A/B (Paris, fibonacci, math, ML description, Once upon
+  a time) give identical factual accuracy at T=1.0 vs T=2.0
+- Full regression 23/23 PASS with auto-default enabled
+- 117-tok cliff broken on the drift-trigger prompt
+
+Precedent: same arch-scoped auto-mode pattern as `TQ_NO_AUTO_SERIAL` which
+auto-forces `-j 1` on qwen35moe for determinism.
+
+Opt-outs (any of):
+- `TQ_NO_MOE_TEMP_AUTO=1` — disable auto-default for this run
+- `TQ_MOE_ROUTE_TEMP=1.0` — explicit override to prior default
+- `TQ_MOE_ROUTE_TEMP=<other>` — explicit custom tuning
 
 ### Recommended Qwen3.6-35B recipe
 
diff --git a/docs/env_vars.md b/docs/env_vars.md
@@ -17,7 +17,8 @@ here is opt-in; defaults are the tested production path.
 | `TQ_MOE_BATCH_SELFTEST` | off | Route N=1 MoE through batch(N=1) kernel — proves equivalence vs per-token path |
 | `TQ_PHI3_SPLIT` | 0 | Phi-3 fused QKV/FFN split to separate Q4 weights. **Off by default** — degrades chat quality per feedback/perf_commits_need_chat_test |
 | `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift |
-| `TQ_MOE_ROUTE_TEMP` | `1.0` | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct. Recommended for long-form Qwen3.6-35B generation |
+| `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
+| `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
 
 ## Quality / correctness