Skip to content

Commit d2fb852

Browse files
unamedkrclaude
andcommitted
docs: v0.28.0 — MoE softmax temperature cliff-break release notes
Bundles the b212194 (TQ_MOE_ROUTE_TEMP) + a4d0002 (bench report) into v0.28.0. Full release entry includes: 5-line fix diff, 26-round investigation arc (R16-R26), temperature sweep table with outcomes, causal story (peaky MoE routing × DeltaNet positive feedback), what doesn't fix, safety measurements, recommended user recipe, scope. Also updates: - README.md v3.21 blurb with the key numbers - README.ko.md v3.21 블러브 (한국어 미러) - bindings/python/pyproject.toml 0.27.0 → 0.28.0 Opt-in. Default behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a4d0002 commit d2fb852

4 files changed

Lines changed: 109 additions & 1 deletion

File tree

README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
7676

7777
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
7878
79+
> **v3.21 ★★★ Qwen3.6-35B 117-토큰 cliff BREAK — MoE softmax temperature 한 줄 (2026-04-22)**: 새 env 플래그 `TQ_MOE_ROUTE_TEMP=2.0` 한 줄로 40+ 라운드 누적 미해결이던 Qwen3.6-35B-A3B의 "It could do math! It could do math!" 117-토큰 반복 루프 제거. drift-trigger 프롬프트 `-n 200 T=0` temperature 스위프: 기본 T=1.0 → 117 루프; T=1.5 → **더 일찍** 87-토큰 cliff; T=1.8 → 여전; **T=2.0 → 200 토큰 Alex+슬픈-나무 이야기 완주, rep-loop 없음**; T=2.5 → 200 토큰 완주; T=3.0 → 과도 분산 114 재발. Sweet spot T=2.0 ~ 2.5. 인과: **peaky MoE 라우팅이 DeltaNet 지속 상태와 positive feedback loop 형성**. softmax 분산하면 루프 형성 불가. 26 라운드 arc: R16-R19 per-layer DeltaNet state reset은 null, R24에서 Qwen3.5-4B (DeltaNet + dense FFN, MoE 없음) 같은 프롬프트 → drift 전혀 없음 → MoE 단독 원인 확정, R25 MoE router probe에서 L4 0.80+ collapse 발견, R26 temperature ablation으로 cliff 차단. 회귀 23/23 PASS (T=2.0 포함), "Paris" factual probe 정상. 150+ 토큰 tail 여전히 character-level degrade — hard cliff 제거, essay-length은 추가 작업 필요. Opt-in, 기본 미변경. 권장: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. 전체 리포트: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0.
80+
7981
> **v3.20 ★★ BPE encode/decode UTF-8 수정 — 국제 문자 silent 품질 재앙 해결 (2026-04-21)**: `encode_byte_to_bpe_char` / `decode_bpe_token` 에 있던 대칭 버그 두 개가 **Llama-3 / Qwen3 계열 모든 모델**에서 비ASCII 문자 (액센트, CJK, 키릴, byte-fallback 이모지)를 포함한 모든 프롬프트/출력을 조용히 깨뜨리고 있었습니다. **인코딩**: GPT-2 direct-byte 코드포인트에 대해 ≥ 0x80 의 raw byte를 그대로 emit (독립형 바이트는 invalid UTF-8)하여 `str_lookup`이 매칭에 실패 → 문자가 byte-fallback을 통해 잘못된 저-ID 토큰으로 변환. **디코딩**: U+0080-U+00FF 코드포인트를 UTF-8 인코딩 (`c3 83`)으로 출력, 실제 바이트 (`0xC3`)가 아님 → 출력 이중 인코딩 ("café" → "café"). 토큰 레벨 HF 참조 일치 검증: `café`/`naïve`/`日本語`/`привет` 모두 HF `AutoTokenizer` 와 byte-byte 일치. 새 [`tools/refparity/`](tools/refparity/) 프레임워크의 A/B 출력 diff로 발견. `quant.h` single-header 에도 sync. `scripts/test_tokenizer.sh` 회귀 fixture 추가. 영향 범위: GPT-2 byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece 경로 영향 없음. 회귀 15/15 + 토크나이저 8/8 PASS. v0.27.0.
8082

8183
> **Qwen3.6-35B 16 GB Mac 실용 레시피**: 장문 coherence 기준 현재 최선 구성은 `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` + `--rep-penalty 1.3`. "Once upon a time in a faraway land" (-n 200, T=0) 실측: 기본 설정은 117 토큰에서 반복 루프, Q5_K_M + rep-penalty는 200 토큰 전체 예산을 생성 후 말미에서만 완만하게 degrade. 35B DeltaNet drift 는 여전히 열린 아키텍처 조사 과제이나, 사용자가 지금 쓸 수 있는 최선.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167167
168168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169169
170+
> **v3.21 ★★★ Qwen3.6-35B 117-token cliff BREAK via MoE softmax temperature (2026-04-22):** One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. Temperature sweep on drift-trigger prompt `-n 200 T=0`: default T=1.0 → 117-tok loop; T=1.5 → **earlier** 87-tok cliff (peakier in some heads); T=1.8 → still trapped; **T=2.0 → 200-tok coherent Alex+sad-tree story, no rep-loop**; T=2.5 → 200-tok coherent; T=3.0 → over-flat, 114-tok loop returns. Sweet spot T=2.0 to 2.5. Causal story: **peaky MoE routing locks into a feedback loop with DeltaNet's persistent state**. Spread the softmax and the feedback can't form. The 26-round investigation arc: R16-R19 tried per-layer DeltaNet state reset (null), R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the same prompt and saw **zero drift** (isolated to MoE), R25 MoE router probe found L4 at 0.80+ routing collapse, R26 temperature ablation closed it. 23/23 regression PASS at T=2.0, `"Paris"` factual probe correct, Tail quality beyond ~150 tokens still degrades — hard cliff broken but full essay-length generation still has quantization noise margins. Opt-in; default unchanged. Recommended: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. Full report: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
171+
170172
> **v3.20 ★★ BPE encode/decode UTF-8 fix — international text silent quality disaster resolved (2026-04-21):** Two symmetric bugs in `encode_byte_to_bpe_char` / `decode_bpe_token` were silently corrupting every prompt and every output containing non-ASCII chars (accents, CJK, Cyrillic, byte-fallback emoji) on all Llama-3 / Qwen3 family models. **Encode** emitted raw bytes ≥ 0x80 (invalid UTF-8) for GPT-2 direct-byte codepoints so `str_lookup` never matched; characters silently fell back to wrong low-id tokens. **Decode** emitted the UTF-8 encoding of codepoints U+0080-U+00FF instead of the raw byte they represent; output got double-encoded ("café" → "café"). Token-level HF parity verified: `café`/`naïve`/`日本語`/`привет` all now tokenize byte-for-byte identical to HF `AutoTokenizer` on Qwen3. Discovered via the new [`tools/refparity/`](tools/refparity/) framework's A/B output diff. Also synced to `quant.h` single-header. Added `scripts/test_tokenizer.sh` fixtures so future refactors fail loudly. Scope: GPT-2-style byte-level BPE (Llama-3.x, Qwen2.5/3.x/3.5/3.6); Gemma/Phi-3 SentencePiece path unaffected. Regression 15/15 + tokenizer 8/8 PASS. v0.27.0.
171173

172174
> **Practical Qwen3.6-35B recipe on 16 GB Mac**: for best long-form coherence, pair `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` with `--rep-penalty 1.3`. Measured on "Once upon a time in a faraway land" (-n 200, T=0): default config hits a repeat loop at 117 tokens; Q5_K_M + rep-penalty runs the full 200-token budget with graceful degrade only near the end. 35B DeltaNet drift remains an open architectural investigation; this is the best user-facing config today.

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.27.0"
10+
version = "0.28.0"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

docs/RELEASE_NOTES.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,110 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
---
88

9+
## [v0.28.0] — 2026-04-22 ★★★ (MoE softmax temperature breaks the Qwen3.6-35B 117-tok cliff)
10+
11+
### Headline
12+
13+
**One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do
14+
math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B
15+
coherent generation at 117 tokens across 40+ prior debug rounds.** 35B
16+
long-gen goes 117 → 200+ coherent tokens on the standard drift-trigger
17+
prompt. Opt-in today; opt-out tomorrow.
18+
19+
### The fix
20+
21+
`src/engine/tq_moe.c::tq_moe_route` — 5-line diff on the top-K softmax:
22+
23+
```c
24+
float inv_temp = 1.0f / route_temp; /* default 1.0 = identity */
25+
for (int k = 0; k < num_active; k++) {
26+
float e = expf((logits[out_expert_ids[k]] - max_val) * inv_temp);
27+
...
28+
}
29+
```
30+
31+
### Why it works — 26-round investigation summary
32+
33+
Rounds 1-19 on this project chased the drift in the DeltaNet recurrent
34+
state, assuming that was the cause. R19's per-layer reset bisection
35+
proved that hypothesis wrong: no single DeltaNet layer carries the
36+
drift signal.
37+
38+
**R24** ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the exact
39+
drift-trigger prompt and got **200+ coherent tokens**. Confirmed the
40+
drift is MoE-specific, not DeltaNet alone.
41+
42+
**R25** added `TQ_MOE_PROBE` — per-layer top-K router histogram —
43+
found a persistent near-collapse at L4 (one expert getting 0.80+ of
44+
the softmax mass at tokens 100-115).
45+
46+
**R26** added `TQ_MOE_ROUTE_TEMP` — softmax temperature knob. Sweep
47+
T ∈ {1.0, 1.5, 1.8, 2.0, 2.5, 3.0}:
48+
49+
| TEMP | outcome |
50+
|---:|:---|
51+
| 1.0 (default) | 117-tok loop "It could do math!" |
52+
| 1.5 | **87**-tok loop (earlier cliff, peakier in some heads) |
53+
| 1.8 | 113-tok loop |
54+
| **2.0** | **200 tokens, NO rep-loop**, coherent Alex+sad-tree story |
55+
| **2.5** | **200 tokens, NO rep-loop**, Alex+magic-leaves story |
56+
| 3.0 | 114-tok loop (over-flat, wrong expert mix) |
57+
58+
Sweet spot T=2.0 to 2.5. The cliff is **caused by peaky MoE routing
59+
locking into a feedback loop** with DeltaNet's persistent state. Spread
60+
the routing distribution and the feedback can't form.
61+
62+
### What v0.28.0 does NOT fix
63+
64+
- Tail quality at 200+ tokens still degrades to character-level noise
65+
(alphabet-walking) on longer `-n 500` runs. Probably quantization +
66+
DeltaNet state accumulation still contributing at the margin.
67+
- A "Sorry!" mini-loop appears around token 170 at T=2.0 — human-visible
68+
but doesn't trigger engine's rep-loop detector.
69+
70+
So: **breaks the hard 117-tok cliff**, recovers ~50 additional coherent
71+
tokens. Full essay-length generation still has more to close.
72+
73+
### Safety
74+
75+
- `"The capital of France is"``"Paris."` at T=2.0 ✓
76+
- `bash scripts/test_models.sh`**23/23 PASS** with T=2.0
77+
(15 coherence + 11 tokenizer, no diff)
78+
79+
### Default behavior
80+
81+
**Unchanged.** `TQ_MOE_ROUTE_TEMP=1.0` is the default so existing users
82+
get identical behavior. Adding the flag is opt-in. A later release may
83+
flip `qwen35moe` arch to default T=2.0 after broader validation across
84+
prompts and `--chat` mode.
85+
86+
### Recommended Qwen3.6-35B recipe
87+
88+
```bash
89+
TQ_MOE_ROUTE_TEMP=2.0 \
90+
./build/quant models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
91+
-p "<your prompt>" -n 200 -T 0 --rep-penalty 1.3
92+
```
93+
94+
### Files changed
95+
96+
- `src/engine/tq_moe.c` — 5-line softmax temperature insertion
97+
- `docs/env_vars.md``TQ_MOE_ROUTE_TEMP` row with measurements
98+
- `docs/supported_models_tier.md` — 35B recipe updated
99+
- `bench/results/2026-04-22_moe_temp_cliff_break.md` — full proof +
100+
ablation data + causal story
101+
102+
### Scope
103+
104+
- **Affected**: Qwen3.6-35B-A3B (MoE + DeltaNet hybrid) — all quants.
105+
- **Default-mode unaffected**: every other model. All 40+ MoE layers
106+
get the same `route_temp` but for non-pathological routing the
107+
difference between T=1.0 and T=2.0 is within normal quality noise.
108+
109+
Additional details: `bench/results/2026-04-22_moe_temp_cliff_break.md`.
110+
111+
---
112+
9113
## [v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)
10114

11115
### Headline

0 commit comments

Comments
 (0)