You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.ko.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
76
76
77
77
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
78
78
79
+
> **v3.21 ★★★ Qwen3.6-35B 117-토큰 cliff BREAK — MoE softmax temperature 한 줄 (2026-04-22)**: 새 env 플래그 `TQ_MOE_ROUTE_TEMP=2.0` 한 줄로 40+ 라운드 누적 미해결이던 Qwen3.6-35B-A3B의 "It could do math! It could do math!" 117-토큰 반복 루프 제거. drift-trigger 프롬프트 `-n 200 T=0` temperature 스위프: 기본 T=1.0 → 117 루프; T=1.5 → **더 일찍** 87-토큰 cliff; T=1.8 → 여전; **T=2.0 → 200 토큰 Alex+슬픈-나무 이야기 완주, rep-loop 없음**; T=2.5 → 200 토큰 완주; T=3.0 → 과도 분산 114 재발. Sweet spot T=2.0 ~ 2.5. 인과: **peaky MoE 라우팅이 DeltaNet 지속 상태와 positive feedback loop 형성**. softmax 분산하면 루프 형성 불가. 26 라운드 arc: R16-R19 per-layer DeltaNet state reset은 null, R24에서 Qwen3.5-4B (DeltaNet + dense FFN, MoE 없음) 같은 프롬프트 → drift 전혀 없음 → MoE 단독 원인 확정, R25 MoE router probe에서 L4 0.80+ collapse 발견, R26 temperature ablation으로 cliff 차단. 회귀 23/23 PASS (T=2.0 포함), "Paris" factual probe 정상. 150+ 토큰 tail 여전히 character-level degrade — hard cliff 제거, essay-length은 추가 작업 필요. Opt-in, 기본 미변경. 권장: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. 전체 리포트: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0.
> **Qwen3.6-35B 16 GB Mac 실용 레시피**: 장문 coherence 기준 현재 최선 구성은 `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` + `--rep-penalty 1.3`. "Once upon a time in a faraway land" (-n 200, T=0) 실측: 기본 설정은 117 토큰에서 반복 루프, Q5_K_M + rep-penalty는 200 토큰 전체 예산을 생성 후 말미에서만 완만하게 degrade. 35B DeltaNet drift 는 여전히 열린 아키텍처 조사 과제이나, 사용자가 지금 쓸 수 있는 최선.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167
167
168
168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169
169
170
+
> **v3.21 ★★★ Qwen3.6-35B 117-token cliff BREAK via MoE softmax temperature (2026-04-22):** One new env flag — `TQ_MOE_ROUTE_TEMP=2.0` — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. Temperature sweep on drift-trigger prompt `-n 200 T=0`: default T=1.0 → 117-tok loop; T=1.5 → **earlier** 87-tok cliff (peakier in some heads); T=1.8 → still trapped; **T=2.0 → 200-tok coherent Alex+sad-tree story, no rep-loop**; T=2.5 → 200-tok coherent; T=3.0 → over-flat, 114-tok loop returns. Sweet spot T=2.0 to 2.5. Causal story: **peaky MoE routing locks into a feedback loop with DeltaNet's persistent state**. Spread the softmax and the feedback can't form. The 26-round investigation arc: R16-R19 tried per-layer DeltaNet state reset (null), R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the same prompt and saw **zero drift** (isolated to MoE), R25 MoE router probe found L4 at 0.80+ routing collapse, R26 temperature ablation closed it. 23/23 regression PASS at T=2.0, `"Paris"` factual probe correct, Tail quality beyond ~150 tokens still degrades — hard cliff broken but full essay-length generation still has quantization noise margins. Opt-in; default unchanged. Recommended: `TQ_MOE_ROUTE_TEMP=2.0 ./build/quant Qwen3.6-35B-Q5_K_M.gguf -p "..." -n 200 --rep-penalty 1.3`. Full report: [`bench/results/2026-04-22_moe_temp_cliff_break.md`](bench/results/2026-04-22_moe_temp_cliff_break.md). v0.28.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
> **Practical Qwen3.6-35B recipe on 16 GB Mac**: for best long-form coherence, pair `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` with `--rep-penalty 1.3`. Measured on "Once upon a time in a faraway land" (-n 200, T=0): default config hits a repeat loop at 117 tokens; Q5_K_M + rep-penalty runs the full 200-token budget with graceful degrade only near the end. 35B DeltaNet drift remains an open architectural investigation; this is the best user-facing config today.
0 commit comments