Skip to content

Commit c09aa02

Browse files
unamedkrclaude
andcommitted
★★★ research(r45): llama.cpp HEAD-TO-HEAD proves 1000+ tok achievable — 23× gap exposed
User pushed: "1000+ 안될 게 없어 보입니다. 세션 한계에서 멈추지 말고 돌파." Built llama.cpp from refs/ and ran identical model + prompt head-to-head: Qwen3.5-4B llama.cpp: 1286 words (~1700 tok), coherent thinking+drafts Qwen3.5-4B ours: 185 tok (natural stop) Qwen3.6-35B llama.cpp: 1101 words (~1500 tok), COMPLETE fantasy story CPU (-ngl 0), same weights, T=0 Qwen3.6-35B ours: ~65 tok before "Sorry!" attractor 1000+ IS ACHIEVABLE on 35B. Not architectural. Not quant-inherent. Not prompt-too-short. Our implementation has a ~23× gap vs llama.cpp on the same CPU, same weights, same prompt. This invalidates R41-R42's conclusion that "all patchable paths already match reference". Point-checking individual operations doesn't prove end-to-end correctness — numerical precision compounds differently over 40 layers × 1000 positions. Revised hunt priority: 1. Our tq_matmul_gguf on-the-fly dequant vs llama.cpp's fused Q4_K×FP32 2. Attention softmax/normalize precision 3. Matmul accumulator precision / reduction order 4. EOS logit margin (our sharper peaks may pull EOS earlier) Next: per-token output diff vs llama.cpp to find first divergence point, then narrow to specific op. Goal clear, gap quantified, direction actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b8291c3 commit c09aa02

1 file changed

Lines changed: 56 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,62 @@
33
**Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
44
**Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
55

6+
## ★★★ Phase 3 R45 — llama.cpp HEAD-TO-HEAD: 1000+ tok ACHIEVABLE on 35B, our engine has 23× gap (2026-04-22) ★★★
7+
8+
User demanded: "1000+ tok 안될 게 없어 보입니다... 세션 한계에서 멈추지 말고 돌파."
9+
10+
Built llama.cpp from refs/ and ran head-to-head on identical models:
11+
12+
| Setup | Engine | Coherent output |
13+
|---|---|---:|
14+
| Qwen3.5-4B Q4_K_M, "Once upon a time in a faraway land", n=2000, T=0 | **llama.cpp** | **1286 words (~1700 tok)** |
15+
| Same | Ours | 185 tok (naturally stopped) |
16+
| Qwen3.6-35B UD-IQ4_XS, same prompt, n=2000, T=0, **CPU** | **llama.cpp** | **1101 words (~1500 tok), complete fantasy story** |
17+
| Same | Ours | ~65 tok |
18+
19+
**1000+ IS achievable.** llama.cpp produced a **complete narrative with character,
20+
companion, quest, and resolution**, plus offered to continue — from an identical
21+
7-token prompt on IDENTICAL quantized weights, SAME CPU.
22+
23+
Our gap: ~23× on 35B, ~10× on 4B. Not architectural. Implementation.
24+
25+
### What was false hypothesis all along
26+
27+
My R41-R42 research concluded "all patchable candidates already correctly
28+
implemented". That was a POINT check on attention gate, RoPE, QK-norm,
29+
RMSNorm formula — which were correct. But POINT correctness doesn't
30+
mean END-TO-END correctness. Numerical precision compounds differently
31+
when 40 layers × 1000 tokens push small per-step errors into attractors.
32+
33+
### Next-step hypothesis (now pointed and testable)
34+
35+
Root cause must be one of (ranked):
36+
37+
1. **Our GGUF on-the-fly dequant matmul precision** — we do
38+
tq_matmul_gguf which dequants Q4_K block → FP32 → multiplies by
39+
activation. llama.cpp may do fused-dequant-matmul with different
40+
accumulator ordering (ggml's highly-tuned CPU kernels).
41+
2. **Our KV compression (turbo_kv_4b default)** — we tested `-k none`
42+
but that still uses our attention softmax path. llama.cpp might have
43+
different attention precision even on FP32 KV.
44+
3. **Our matmul threading / reduction order** — we auto-serial for
45+
determinism, but summation order within the single thread may still
46+
differ from llama.cpp's.
47+
4. **EOS/stop-token handling** — our engine may emit EOS-ID tokens
48+
earlier due to sharper logit peaking.
49+
50+
Next round: diff our 35B output with llama.cpp's at EACH token position
51+
to find where the first divergence occurs. Then narrow to the specific
52+
operation that differs.
53+
54+
### The one metric that matters
55+
56+
llama.cpp 35B: 1500 tokens coherent complete story.
57+
Ours 35B: 65 tokens partial story then loop.
58+
59+
**Goal reset: close the 23× gap on 35B long-gen.** Not "improve a bit"
60+
— fully match llama.cpp quality. User confirmed the target is correct.
61+
662
## Phase 3 R44 — 1000-tok hunt: engine-gap confirmed, shared-expert fix +20%, remaining gap is MoE-internal (2026-04-22)
763

864
User pushback ("Q4 is common — why fail?") correctly identified engine

0 commit comments

Comments
 (0)