Release Notes

All notable changes to quant.cpp are documented here. Format follows Keep a Changelog. Versioning follows Semantic Versioning.

[v0.28.0] — 2026-04-22 ★★★ (MoE softmax temperature breaks the Qwen3.6-35B 117-tok cliff)

Headline

One new env flag — TQ_MOE_ROUTE_TEMP=2.0 — breaks the "It could do math! It could do math!" repetition loop that capped Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior debug rounds. 35B long-gen goes 117 → 200+ coherent tokens on the standard drift-trigger prompt. Opt-in today; opt-out tomorrow.

The fix

src/engine/tq_moe.c::tq_moe_route — 5-line diff on the top-K softmax:

float inv_temp = 1.0f / route_temp;  /* default 1.0 = identity */
for (int k = 0; k < num_active; k++) {
    float e = expf((logits[out_expert_ids[k]] - max_val) * inv_temp);
    ...
}

Why it works — 26-round investigation summary

Rounds 1-19 on this project chased the drift in the DeltaNet recurrent state, assuming that was the cause. R19's per-layer reset bisection proved that hypothesis wrong: no single DeltaNet layer carries the drift signal.

R24 ran Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the exact drift-trigger prompt and got 200+ coherent tokens. Confirmed the drift is MoE-specific, not DeltaNet alone.

R25 added TQ_MOE_PROBE — per-layer top-K router histogram — found a persistent near-collapse at L4 (one expert getting 0.80+ of the softmax mass at tokens 100-115).

R26 added TQ_MOE_ROUTE_TEMP — softmax temperature knob. Sweep T ∈ {1.0, 1.5, 1.8, 2.0, 2.5, 3.0}:

TEMP	outcome
1.0 (default)	117-tok loop "It could do math!"
1.5	87-tok loop (earlier cliff, peakier in some heads)
1.8	113-tok loop
2.0	200 tokens, NO rep-loop, coherent Alex+sad-tree story
2.5	200 tokens, NO rep-loop, Alex+magic-leaves story
3.0	114-tok loop (over-flat, wrong expert mix)

Sweet spot T=2.0 to 2.5. The cliff is caused by peaky MoE routing locking into a feedback loop with DeltaNet's persistent state. Spread the routing distribution and the feedback can't form.

What v0.28.0 does NOT fix

Tail quality at 200+ tokens still degrades to character-level noise (alphabet-walking) on longer -n 500 runs. Probably quantization + DeltaNet state accumulation still contributing at the margin.
A "Sorry!" mini-loop appears around token 170 at T=2.0 — human-visible but doesn't trigger engine's rep-loop detector.

So: breaks the hard 117-tok cliff, recovers ~50 additional coherent tokens. Full essay-length generation still has more to close.

Safety

"The capital of France is" → "Paris." at T=2.0 ✓
bash scripts/test_models.sh → 23/23 PASS with T=2.0 (15 coherence + 11 tokenizer, no diff)

Default behavior

Auto-flipped for qwen35moe arch. tools/quant.c auto-detects the MoE+DeltaNet hybrid at load time and sets TQ_MOE_ROUTE_TEMP=2.0 when the user hasn't provided one. No effect on Llama, Phi, Gemma, Qwen3 non-hybrid, or any other arch — only qwen35moe gets the new default.

The validation signal that justified the flip:

5/5 short-prompt A/B (Paris, fibonacci, math, ML description, Once upon a time) give identical factual accuracy at T=1.0 vs T=2.0
Full regression 23/23 PASS with auto-default enabled
117-tok cliff broken on the drift-trigger prompt

Precedent: same arch-scoped auto-mode pattern as TQ_NO_AUTO_SERIAL which auto-forces -j 1 on qwen35moe for determinism.

Opt-outs (any of):

TQ_NO_MOE_TEMP_AUTO=1 — disable auto-default for this run
TQ_MOE_ROUTE_TEMP=1.0 — explicit override to prior default
TQ_MOE_ROUTE_TEMP=<other> — explicit custom tuning

Recommended Qwen3.6-35B recipe

TQ_MOE_ROUTE_TEMP=2.0 \
    ./build/quant models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
    -p "<your prompt>" -n 200 -T 0 --rep-penalty 1.3

Files changed

src/engine/tq_moe.c — 5-line softmax temperature insertion
docs/env_vars.md — TQ_MOE_ROUTE_TEMP row with measurements
docs/supported_models_tier.md — 35B recipe updated
bench/results/2026-04-22_moe_temp_cliff_break.md — full proof + ablation data + causal story

Scope

Affected: Qwen3.6-35B-A3B (MoE + DeltaNet hybrid) — all quants.
Default-mode unaffected: every other model. All 40+ MoE layers get the same route_temp but for non-pathological routing the difference between T=1.0 and T=2.0 is within normal quality noise.

Additional details: bench/results/2026-04-22_moe_temp_cliff_break.md.

[v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)

Headline

Two symmetric BPE bugs were silently corrupting every prompt and every output containing international characters (accents, CJK, Cyrillic, byte-fallback emoji) on all Llama-3 and Qwen3 family models. Fixed both sides of the GPT-2 byte-to-unicode mapping. Token-level parity with HF reference now 100% on tested inputs.

The bugs

Both tq_tokenizer.c (split-source) and quant.h (single-header) had mirrored bugs on the encode and decode paths for GPT-2-style BPE vocabs.

Encode (encode_byte_to_bpe_char): for "direct" bytes in the range 0xA1-0xAC and 0xAE-0xFF, we emitted the raw byte into the lookup string. Standalone bytes ≥ 0x80 are invalid UTF-8, so str_lookup never matched the vocab (which stores these as proper UTF-8 strings: byte 0xC3 → "Ã" = UTF-8 c3 83). The character silently fell back to a wrong low-id token.

Decode (decode_bpe_token): for codepoints U+00A1-U+00AC and U+00AE-U+00FF found in vocab pieces, we emitted the UTF-8 encoding of the codepoint (2 bytes c3 83 for U+00C3 "Ã") instead of the raw byte 0xC3 that the codepoint represents in GPT-2's byte-to-unicode mapping. Output got double-encoded: "café" (3 bytes 63 61 66 c3 a9) became 63 61 66 c3 83 c2 a9 (6 bytes, renders as "cafÃ©").

Measured impact

HF Qwen3 reference tokenization vs ours, before/after:

Input	HF reference	Before	After
café	[924, 58858]	[68796]	[924, 58858] ✓
naïve	[3376, 37572, 586]	[77, 523]	[3376, 37572, 586] ✓
日本語	[101059, 102819]	[245, 250, 252]	[101059, 102819] ✓
привет	[124436, 26991, 8178]	[222, 224]	[124436, 26991, 8178] ✓

All four strings now tokenize byte-for-byte identical to the HF tokenizer. Before: model saw a completely different sequence than its training distribution — silent quality degradation proportional to share of non-ASCII content in the prompt.

Discovery

Both bugs surfaced from the tools/refparity/ framework added earlier this session. The decode bug was flagged first by an A/B output diff ("cafÃ©" artifact on Llama-3.2-1B); once fixed, a targeted encode comparison vs HF tokenizer surfaced the symmetric encode bug.

Files changed

src/engine/tq_tokenizer.c: encode_byte_to_bpe_char and decode_bpe_token each get a direct-byte branch
quant.h: synced (single-header had identical bugs)
Regression: 15/15 PASS unchanged

Scope

Affected: Llama-3.x, Qwen2.5, Qwen3.x, Qwen3.5, Qwen3.6, any model using GPT-2-style byte-level BPE (log line shows is_sentencepiece=0)
Not affected: Gemma (SentencePiece), Phi-3 (SentencePiece path)

Latent silent-quality bug for users whose prompts touch international text. Now unblocked.

[v0.26.0] — 2026-04-21 ★ (L2-norm formulation matches ggml — Qwen3.6 +36% coherence window)

Headline

DeltaNet L2-normalization formulation mismatched llama.cpp's ggml_l2_norm for 30+ rounds. Fixed to match reference. Qwen3.6-35B coherent generation extends from ~117 → 160 tokens on the same prompt (+36%), with noticeably more coherent mid-section content.

The bug

R26 had added eps = 1e-6f to our l2_normalize:

/* OLD (R26 form) */
float inv = 1.0f / sqrtf(ss + eps);

But llama.cpp's ggml_compute_forward_l2_norm_f32 uses a different formulation:

/* llama.cpp reference — eps is a floor on the DENOMINATOR */
const float scale = 1.0f / fmaxf(sqrtf(sum), eps);

For typical inputs (sum ~ 1), both give scale ~ 1 — no difference. But for near-zero inputs:

Ours: scale = 1 / sqrt(0 + 1e-6) ≈ 1000
llama.cpp: scale = 1 / max(0, 1e-6) = 1,000,000

Three orders of magnitude different for near-zero K/Q vectors. Over 30 DeltaNet layers × position, this systematic under-scaling of K,Q magnitudes compounds into the decode-length degradation we've been chasing across Pillars 1, 1.5, and 30+ rounds of Mission C.

Fix

src/engine/tq_transformer.c:l2_normalize:

float denom = sqrtf(ss);
if (denom < eps) denom = eps;
float inv = 1.0f / denom;

Both NEON and scalar paths updated. Now bit-equivalent to ggml_l2_norm.

A/B on Qwen3.6-35B IQ4_XS, auto-serial, "Write a 300-word essay about AI." + 300 tok gen

	Coherent content	Total tokens
v0.25.0 (old l2)	~45 coherent then "the new normal" loop	117
v0.26.0 (ggml l2)	~110 coherent content before mild drift	160

New output (excerpt):

"Artificial Intelligence (AI) has rapidly evolved from a transformative force in the modern world, reshaping industries and transforming daily life across every sector from healthcare to education and entertainment. At its core, AI's role is to redefine what we know as 'intelligence itself.' In this context, the role of AI is both a tool and a teacher, shaping how we live and work today. AI's impact is profound: it is reshaping economies and societies globally."

Why this was missed

R26's "epsilon fix" was the right diagnosis (missing eps) but the wrong formulation. Since typical inputs gave scale ~ 1 in both forms, regression tests pass. The bug only surfaces with near-zero K/Q magnitudes × many positions.

Discovered via direct reference-diff of llama.cpp's ggml-cpu/ops.cpp ggml_compute_forward_l2_norm_f32 against our l2_normalize.

Honest status

Not yet "1000+ char coherent generation." Still degrades after ~110 tokens on some prompts. But:

36% longer coherent window vs v0.25.0
More varied content before drift (not stuck in "new normal" loop)
Quantization-independent fix (IQ4_XS and Q5_K_M both benefit)
Compounds with prior fixes (v0.19-0.25)

Regression

15/15 test_models + 4/4 test_tokenizer PASS.

Methodology note

Yet another ref-diff win. Mission C's 30 rounds missed this because the diagnosis stopped at "needs eps" rather than "needs THIS eps formulation." Always ship the EXACT reference implementation, then optimize — don't paraphrase.

[v0.25.0] — 2026-04-20 (Qwen3.6 Auto-Serial Quality Mode — Determinism + Longer Coherence)

Honest headline

Qwen3.6-35B with multi-thread matmul is non-deterministic at T=0. Same prompt run twice gives different output. Discovery: parallel FP reduction order variance compounds over 30 MoE layers × position feedback, amplifying to top-1 argmax flips. Auto-force single-thread for qwen35moe+DeltaNet hybrid models brings back determinism and extends coherent generation from ~60-70 → ~90-100 tokens.

What this ships

tools/quant.c: detect is_moe && delta_n_heads > 0 (qwen35moe hybrid) and auto-force -j 1 unless user explicitly passed -j or sets TQ_NO_AUTO_SERIAL=1.

Visible on load:

Auto-serial: detected qwen35moe hybrid — forcing -j 1 for
  deterministic correctness (TQ_NO_AUTO_SERIAL=1 to opt out)
Threads: 1 (auto-serial quality mode)

Verified

Scenario	Multi-thread (-j 8)	Auto-serial (-j 1)
"Write a 300-word essay about AI." × 2 runs	Different outputs	Identical, coherent
250-token gen	Degrades at 60-70 tok	~95 tokens coherent then mild degradation
Decode speed	~8 t/s	~3 t/s (2-3× slower)
Prefill 280 words	29s	~75s (slower, but was garbled at multi-thread anyway)

Honest limits — NOT a full fix

1000+ char coherent generation on Qwen3.6-35B still fails on some prompts. Auto-serial extends the coherence window but does not close it. Remaining bug class: numerical precision accumulation over 40 layers × MoE 8-expert weighted sum × decode positions. Even single- threaded, FP32 + IQ4_XS quantization errors compound enough to eventually drift into repetition.

Why it's still worth shipping

Before: every Qwen3.6 run on same prompt gave different answer (unusable for reproducible work)
After: deterministic output, extended coherence window, explicit trade-off communicated to user.
Opt-out documented: TQ_NO_AUTO_SERIAL=1 restores multi-thread for users who want speed over stability.

What's still needed (honest)

Find the exact parallel-reduction source of non-determinism (even -j 2 diverges). Candidate: FP32 matmul row partition ordering producing bit-level variance → cascades via MoE feedback.
Higher-precision MoE accumulator (FP64 intermediate) — would dampen compound error growth even in single-thread.
Router stability — top-K from softmax probs (llama.cpp convention) rather than raw logits for FP tie-break robustness.

Session arc (2026-04-20)

Ver	Root cause closed
0.19.0	BPE stale-entry (tokenizer)
0.20.0	QK-norm + NEOX RoPE (Qwen3 family structural)
0.21.0	MoE batched N>>1 → opt-in
0.22.0	Chunked batched prefill (+30% TTFT, correctness preserved)
0.23.0	Prompt buffer silent truncation (4K → 32K)
0.24.0	MoE SwiGLU exact expf (precision margin)
0.25.0	Qwen3.6 auto-serial quality mode (determinism + longer window)

Regression

15/15 test_models + 4/4 test_tokenizer PASS.

[v0.24.0] — 2026-04-20 (MoE SwiGLU Exact expf — Coherence Margin)

Headline

MoE SwiGLU activation now uses exact expf by default, replacing the ~2% error Schraudolph approximation. On Qwen3.6-35B this pushes back the long-context degradation boundary — 400-word documents now produce noticeably more coherent continuation. Speed cost: unmeasurable (SwiGLU is not the hot path).

Change

src/engine/tq_moe.c:swiglu_fused now routes through expf scalar loop by default. Opt-out: TQ_MOE_FAST_EXP=1 reverts to NEON Schraudolph path (for benchmarking only).

A/B (Qwen3.6-35B IQ4_XS, 400-word prose + "In summary,")

	Output
default (fast)	"most AI/ML (AI/ML) is a powerful tool for large-scale data processing."
exact expf	"most of the other ' is a very important. The democratization of this is a very important and another particularly powerful and even more so that"

Longer, more varied output. Still not perfect at 400w+, but the degradation curve is noticeably softer. 280-word prose unchanged (already coherent pre-fix).

Speed

Speed test on Qwen3.6-35B 280-word prompt (TTFT + decode):

default fast: 28-29s TTFT, 8.9-9.3 tok/s decode
exact expf: 28-29s TTFT, 9.0-9.3 tok/s decode

Identical within noise. SwiGLU is not a bottleneck on CPU.

Known remaining

Qwen3.6-35B at 500+ words still degrades (repetition loops on some prompts). The MoE long-context accumulation bug has MULTIPLE compounding sources; exact expf is one contributor, not the full fix. Next investigation targets: MoE router softmax stability at long positions, expert scale factor correctness, DeltaNet state spectral radius monitoring.

Regression

15/15 test_models + 4/4 test_tokenizer PASS.

[v0.23.0] — 2026-04-20 ★★ (Prompt Buffer + MoE Long-Context Isolation)

Headline

Silent prompt truncation at >4K chars FIXED. Any prompt longer than ~4096 chars (≈ 700 words of English) was being cut off at the initial BPE char-level step and silently treated as a shorter input. After fix, Qwen3.5-4B and other non-MoE models now handle 500+ word documents cleanly. Qwen3.6-35B MoE hybrid long-context bug isolated to MoE path (DeltaNet and tokenization both proven correct).

The bug

src/engine/tq_generate.c:217 allocated int prompt_tokens[4096] and passed max_tokens=4096 to tq_encode. Our BPE does char-level initial tokenization (one vocab token per UTF-8 char) then merges them down. So a 4171-char text would hit the 4096 initial cap, discarding everything past char ~4096 BEFORE merges could reduce. The merged result (~684 tokens) would appear normal to the caller, but the TEXT beyond char 4096 was silently gone.

Diagnostic path (OpenMythos-inspired reference diff)

HF Qwen3-0.6B on text_1000.txt (561 words) + "Summarize..." → 698 tokens, coherent output.
Our engine same input → 684 tokens, garbage output.
Tokenization check: our first 5 tokens = HF first 5 tokens [785 3840 315 24231 646] ("The history of computing can") ✓
Our last tokens decoded: ". The abacus" — from the BEGINNING of the text, not the end!
Root cause: prompt was TRUNCATED; engine processed first 684 tokens of char-level initial tokenization, never reached the "Summarize..." suffix.

Fix

Buffer bumped 4096 → 32768 with dynamic max_tokens from sizeof(prompt_tokens)/sizeof(...). 128 KB stack — fine on macOS (8 MB default thread stack).

Validation (same 561-word document + "In summary,")

Model	Before	After
Qwen3-0.6B (pure)	truncated → garbage	full text seen, model still weak at 698 tok
Qwen3.5-4B (dense hybrid)	truncated → garbage	coherent: "the future of AI is not just about what we can do with it - it's about how we think about what matters most to us" ✓
Qwen3.6-35B (MoE hybrid)	truncated → garbage	full text seen, still garbage → MoE-specific bug isolated

Remaining bug (isolated)

Qwen3.6-35B at 561 words produces 2019, 20191345688... repetition loop in BOTH per-token and chunked-batched modes. Qwen3.5-4B with the SAME DeltaNet architecture but DENSE FFN (no MoE) handles the SAME input fine. Conclusion: the bug is in the MoE feedback loop at long positions (expert accumulation, not DeltaNet state, not tokenization). Future investigation target.

Regression

15/15 test_models + 4/4 test_tokenizer PASS.

Lesson

Before concluding "long context broken," always verify the engine actually SAW the full input. Silent truncation at char buffers is a classic class of bug that hides underneath model-quality complaints.

[v0.22.0] — 2026-04-20 (Qwen3.6 Chunked Batched Prefill — +30% TTFT)

Headline

Chunked batched prefill restores most of the batched-MoE speedup while keeping v0.21.0's correctness guarantee. Qwen3.6-35B on a 280-300 word document prefills in ~29s (vs ~38s per-token), producing the same correct summaries.

The idea

v0.21.0 made tq_forward_batch_moe_hybrid opt-in because at N≥40 the batched MoE kernel produced garbage. But dispatching the same driver in small chunks (CHUNK tokens at a time) stays within the safe N regime. State (KV cache, DeltaNet ssm, conv buffer) is already persistent across driver calls, so chunking is semantically correct.

Change

src/engine/tq_generate.c — hybrid MoE prefill now loops over chunks of TQ_MOE_BATCH_CHUNK tokens (default 8). Each batched call satisfies the small-N safe region; chunks concatenate automatically via position accumulation.

Speed (Qwen3.6-35B IQ4_XS, CPU-only, TQ_NO_METAL=1)

Input	v0.21.0 (per-token)	v0.22.0 (chunk=8)	Speedup
44 words natural prose	12.6s	7.0s	+44%
280 words natural prose	38.0s	29.4s	+29%
294 words document Q&A	—	29.4s	—

Correctness

Default chunk=8 tested on:

Short (44w): "Artificial intelligence, powered by advanced algorithms and large-scale data, has transformed industries by enabling machines to learn and adapt like humans." ✓
Medium (280w): "The democratization of AI has been a major driver of the change in how we do things…" ✓
Long-medium (294w): "AI has become a ubiquitous technology, enabling billions of people to perform tasks previously impossible…" ✓

Tunable: TQ_MOE_BATCH_CHUNK=N (default 8, safe up to ~300w doc). Chunk=32 shows degradation at long inputs; chunk=16 occasionally leaks minor UTF-8 noise; chunk=8 is the validated default.

Known limits

500+ word inputs: both per-token and chunked produce garbage at ~560 words. This is a DIFFERENT bug from the batched-MoE N>>1 issue (both paths fail) — likely KV cache or DeltaNet state accumulation at large token counts. Investigation deferred.
Root cause of batched MoE N>>1 bug: still unidentified (sanity test only covers N=1). Chunked approach sidesteps it rather than fixing.

Regression

15/15 test_models + 4/4 test_tokenizer PASS.

Compatibility

No API change. Existing TQ_USE_MOE_BATCH, TQ_NO_MOE_BATCH, TQ_NO_BATCH_PREFILL env vars unchanged. New TQ_MOE_BATCH_CHUNK env overrides chunk size.

[v0.21.0] — 2026-04-20 ★★★ (Qwen3.6-35B Practically Usable)

Headline

Qwen3.6-35B-A3B produces perfect coherent summaries on 40+ word natural prose via per-token prefill. Combined with v0.19.0 (BPE) and v0.20.0 (NEOX RoPE + QK-norm), the 35B MoE hybrid is now a genuine daily-driver tool for document Q&A and summarization on 16 GB Mac.

The fix

Bisection via A/B testing isolated the remaining Qwen3.6 long-prompt bug to tq_forward_batch_moe_hybrid (specifically the batched MoE kernel tq_moe_forward_batch at N≥40). Per-token prefill through tq_forward produces perfect output on the same input. Root cause inside the batched MoE scatter path is deferred (sanity test only covers N=1; the bug is at N≫1).

src/engine/tq_generate.c line 318 — flipped the MoE hybrid driver dispatch from default ON (Step 3i / R6) to opt-in via TQ_USE_MOE_BATCH=1. Default behavior now falls back to per-token forward, which is slower but correct.

Before/after evidence (Qwen3.6 IQ4_XS, 44-word natural prose + "Summarize in one sentence.")

	Output
v0.20.0 default (batched MoE)	`!` `inteligØª sWith` evolu tempr Øª dÃ³ä¸�å¿µã�£ã�� assemb…` UTF-8 garbage
v0.21.0 default (per-token)	`"Artificial intelligence, particularly through deep learning and large language models, has transformed how we create and interact with content by generating coherent text from vast amounts of data."` ✓

Broad validation (8-prompt matrix)

Prompt	v0.20.0	v0.21.0
short_story "Once upon a time"	✓	✓
short_code "def fibonacci(n):"	✗ (empty)	✓ (Python with type hints)
short_qa "capital of France"	✓	✓
mid_tech hash table	✓	✓
long_essay supervised/unsupervised	✓	✓
mid_recipe, long_story, long_code	coherent but missed keyword	same

4/8 → 5/8 PASS. All "FAIL"s are coherent outputs that simply don't contain the test's hardcoded keyword.

Trade-off

Speed: TTFT on 44-word prompt 12.6s per-token vs ~4-7s batched (when batched works). Decode unchanged.
Correctness: 100% vs ~50% garbage rate.
Opt-back: Speed-tolerant users can TQ_USE_MOE_BATCH=1 to re-enable batched MoE prefill (risks garbage on long prompts).

Complete session arc (2026-04-20)

Ver	Root cause	Symptom
v0.19.0	BPE stale-entry (tokenizer.c:1442)	"Helll" for "Hello", all Qwen3 family
v0.20.0 fix 1	R40 QK-norm over-broad disable	Layer 2 norm explosion on pure Qwen3 long prompts
v0.20.0 fix 2	`tq_rope` LLaMA-pairs vs NEOX	Qwen3 full-rotary + all batched prefill
v0.21.0	`tq_moe_forward_batch` at N≫1	Qwen3.6-35B long-prompt garbage

Six Pillar 1 + 1.5 rounds closed what 30+ empirical rounds (R26-R50) had not. HF reference diff methodology (OpenMythos-inspired) was the decisive tool.

Known remaining

The root cause of the batched MoE scatter bug at N≫1 is still unidentified. The Mission A sanity test (TQ_MOE_BATCH_SELFTEST=1) only covers N=1. Future work: extend sanity to N=40..200 range, diff per-token vs batched expert outputs at specific positions.

Compatibility

No API change. tq_moe_forward_batch kernel still exported and exercised by sanity mode. tq_forward_batch_moe_hybrid still available via TQ_USE_MOE_BATCH=1. Existing code paths unchanged.

[v0.20.0] — 2026-04-20 ★★ (NEOX RoPE ROOT-CAUSE — Qwen3 Long-Prompt Fix)

Headline

Two transformer-level bugs that blocked Qwen3 family long-prompt coherence are fixed. Combined with v0.19.0's BPE tokenizer fix, all three root causes of the 30+ round "Qwen3 drift" investigation (R26-R50) are now closed. Discovered via HF reference diff methodology (tools/pillar1/) after refs/OpenMythos analysis crystallized the principle: compare to ground truth FIRST.

Two fixes

Fix 1 — Pure-Qwen3 QK-norm restored (tq_transformer.c:1204):

R40 had disabled QK-norm for ALL GGUF arch strings matching "qwen". That was correct for Qwen3.5/3.6 HYBRID (DeltaNet + self-attn, delta_n_heads > 0) — those degrade with QK-norm applied. But pure Qwen3 (0.6B..32B) REQUIRES q_norm/k_norm per HF config. Without them, the residual stream explodes at layer 2 (norm ~5400 vs HF ~10).

Fix: restrict the QK-norm disable to delta_n_heads > 0 only. Pure Qwen3 now applies QK-norm as HF does.

Fix 2 — NEOX-ordering RoPE (tq_ops.c + two sites in tq_transformer.c):

llama.cpp maps LLM_ARCH_QWEN3 / QWEN3MOE / QWEN35 / QWEN35MOE to LLAMA_ROPE_TYPE_NEOX / IMROPE — half-split pairs (q[i], q[i+head_dim/2]). Our engine used LLaMA-style interleaved pairs (q[2i], q[2i+1]). R34 had fixed this for the partial-rotary path (Qwen3.5/3.6 hybrid) but pure Qwen3 (full rotary) and tq_forward_batch were never converted.

Fix: new tq_rope_neox function + arch-detection at all three relevant call sites. Per-token full-rotary, batched learned-freq, batched fallback. TQ_ROPE_PAIRS=1 opt-out for legacy LLaMA/Qwen2.

Symptom (before/after, Qwen3-0.6B Q4, 50-word synthetic input)

Path	Before	After
Batched prefill	`alyticsÐ°Ð½cieaâ��à¹�…` UTF-8 garbage	`" Let me try to understand this"`
Per-token prefill	`lenameuously…catchØ�`	`" ... and so on… So, the problem is to find the number of possible ways"`

Natural prose — 31 words, "Summary:" continuation

Model	Output (first 20 tok)
Qwen3-0.6B	`"The main features of AI technology are that it has the ability to process information…"` ✓
Qwen3.5-4B	`"Artificial intelligence is a field of computer science that focuses on the development of intelligent machines…"` ✓

Qwen3.6-35B broad validation (8-prompt matrix, 40+ words max)

Zero UTF-8 garbage outputs (was 100% on 40+ words before v0.19.0).
Short story, long essay, tech explanation, factual Q&A all coherent.
Remaining weak spots are chat-template-induced early EOS (0 tokens on some raw-completion prompts) — model behavior, not engine bug.

Methodology — OpenMythos insights applied

refs/OpenMythos (RDT / MLA / ACT architecture reconstruction) crystallized the principle that ENABLED this session's breakthroughs:

Compare to ground truth (HF reference diff) BEFORE guessing at kernels or recurrence state. 30+ rounds R26-R50 had all been empirical; Pillar 1 R1-R3 + Pillar 1.5 R1-R3 solved three distinct root causes in 6 rounds by diffing against HF output.

Saved as memory/project_openmythos_insights.md for future sessions.

Files changed

src/engine/tq_tokenizer.c — BPE stale-entry check (v0.19.0 fix retained)
src/engine/tq_transformer.c — QK-norm scope + NEOX in 2 call sites
src/engine/tq_ops.c — new tq_rope_neox function
include/turboquant/tq_engine.h — export tq_rope_neox
scripts/test_models.sh + scripts/test_tokenizer.sh — regression expanded
tools/pillar1/ — HF reference diff toolchain retained for follow-on
bench/results/2026-04-20_bpe_fix_proof.md — before/after evidence
bench/results/2026-04-20_longseq_transformer_bug.md — R7/R8 discovery trail

Regression

test_models.sh: 15/15 PASS (unchanged through both fixes)
test_tokenizer.sh: 4/4 PASS

Known remaining

Qwen3.6-35B DeltaNet state accumulation on 40+ word natural prose can sometimes trigger repetition-loop detection. This is separate from the RoPE/QK-norm bugs and needs OpenMythos Insight #2 (spectral-radius monitoring of recurrent state) applied as diagnostic. Short-medium prompts fully coherent.
Chat-template interactions producing 0-token responses on some coding prompts (Qwen3.6's thinking-mode prefix consuming the tokens).

Compatibility

No API change. Existing code using tq_rope continues to work for LLaMA/Qwen2. New tq_rope_neox opt-in for Qwen3 family (auto- detected via GGUF arch string).

[v0.19.0] — 2026-04-20 ★ (BPE Stale-Entry ROOT-CAUSE Fix)

Headline

One-line fix to src/engine/tq_tokenizer.c:1442 eliminates the structural tokenization bug that caused every "Qwen3 drift" symptom across 30+ rounds of kernel/MoE/DeltaNet investigation. Pillar 1 of the Mission E roadmap, closed in 3 rounds via HF reference diff.

The fix

  if (top.gen != gen[top.pos]) continue;
+ if (tokens[top.pos] < 0) continue;   // ★ missing dead-slot guard
  int ri = next[top.pos];
  if (ri >= n_tokens || tokens[ri] < 0) continue;

Root cause: In the heap-based BPE merge loop, a position P that dies as the RIGHT neighbor of some other merge has tokens[P] set to -1 but gen[P] is not bumped. Stale heap entries at position P pass the gen-based staleness check, then the code overwrites dead tokens[P] with a new merge result — resurrecting the slot, scrambling the linked list, and producing malformed token sequences.

Symptom (same prompt, before/after)

	Tokens for "Hello"	Decoded
HF reference	`[9707]`	"Hello"
Our engine BEFORE	`[32713, 654]`	"Helll" (extra 'l', lost 'o')
Our engine AFTER	`[9707]`	"Hello" ✓

What this fixes (consolidated)

Symptom (previous attributed cause)	Actual cause
Qwen3.5/3.6 "quicck bbrrown" char doubling	tokenizer
Qwen3.6-35B ≥40-word prompt → UTF-8 garbage	tokenizer
Phi-3.5 "What is 2+2?" → hallucinating "tti"	tokenizer
R32 Mission C "drift is Qwen-common architecture"	WRONG — was tokenizer
R46-50 Mission D "structural bug needs HF Python diff"	correct diagnosis; R3 finishes it

Validation

Regression: 15/15 test_models.sh + new test_tokenizer.sh 4/4
Real output: Qwen3.6-35B on 40+ word prompts produces coherent Python code and full narrative text (previously garbage)
Phi-3.5: "What is 2+2?" → "The sum of 2 and 2 is equal to four." (previously "I'm sorry but 'tti' doesn't appear to...")

Methodology (the actual insight)

Pillar 1 R1-R3 built Python + HF Qwen3-0.6B FP32 reference env (tools/pillar1/) specifically to enable per-layer diff debugging. Before the first layer diff was ever needed, just comparing tokenizer output revealed the mismatch. The entire transformer investigation from R26-R50 had been working with corrupted input.

Lesson: When debugging LLM coherence, compare tokens to HF reference FIRST. Don't "rule out" the tokenizer without actually running AutoTokenizer.encode(prompt) side-by-side.

Files changed

src/engine/tq_tokenizer.c — 1-line fix + comment
src/engine/tq_transformer.c — env-gated per-layer dump (TQ_DUMP_HIDDEN=dir) retained as debugging infrastructure
scripts/test_models.sh — Phi-3.5 expected "answer" → "sum" (Phi-3 now gives actual factual math answer)
scripts/test_tokenizer.sh — NEW 4-test regression guard
tools/pillar1/ — HF reference env + hf_dump.py dump tool
bench/results/2026-04-20_bpe_fix_proof.md — full before/after proof

Non-impact

quant.h (single-header): uses naive O(n²) BPE merge, correct by construction. Embed/WASM users have NEVER hit this bug. Only the split-source engine needed the fix.
No API change.
No performance change (the stale check is O(1)).

Compatibility

No migration needed. Users of prior versions will simply see coherent output on previously-broken prompts. All existing models work.

[v0.18.0] — 2026-04-20 (Daily-Driver UX — TTFT/decode split)

Headline

CLI now reports TTFT and decode rate separately, replacing the blended "overall tok/s" that dominated short-query reports with cold-start latency. Individual developers evaluating the engine on a 16 GB Mac can now distinguish prefill cost from sustained decode.

Changes

R51 — TTFT measurement (tools/quant.c): print_token callback records first-token timestamp via a cli_timing_ctx_t struct passed through user_data. After tq_generate returns, the summary line splits:

TTFT 0.99s | decode 29 tok in 1.99s (14.6 tok/s) | total 3.0s (10.1 tok/s overall)

Fallback to the old single-line format when n_generated ≤ 1. 25 LOC total. No engine behavior change.

R52 — Daily-driver baseline matrix (bench/results/2026-04-20_ttft_daily_driver.md): Measured warm numbers on 16 GB M1 Pro CPU-only:

Model	Warm TTFT	Warm decode
Phi-3.5 Q4_K_M (3.8B)	2.3s	14.5 t/s
Llama-3.2-3B Q8→Q4	0.97s	29.0 t/s
Qwen3.6-35B IQ4_XS	1.83s	10.5 t/s

Cold first-run Qwen3.6 TTFT is 9.6s (5.3× warm) due to cold SSD paging; subsequent runs benefit from macOS page cache.

R53 — README v3.11 blurb (README.md + README.ko.md): External discoverability — first-visit users see the TTFT/decode framing and warm numbers immediately above the v3.10 correctness entry.

Why "decode is a model property, TTFT is a warmup property"

For short queries (-n 30), cold-start mmap + MADV traversal + transformer pass #1 can dominate wall time. Reporting only the blended rate understates engine compute speed. Example:

Qwen3.6-35B cold: total 19.3s → 1.6 t/s overall
Qwen3.6-35B warm: total 4.6s → 6.5 t/s overall, 10.5 t/s decode

The engine isn't 6× faster warm; it's doing the same compute. TTFT just dropped from 9.6s to 1.8s. Individual devs need to see both.

Compatibility

No API change. Existing tq_gen_config_t::on_token callbacks with user_data = NULL continue to work — user_data is opaque from the library's perspective; the CLI passes its own timing struct.

Regression

15/15 PASS (unchanged). Existing COHERENT + STRICT + Metal-ON tier tests traverse new stderr format without breakage.

[v0.17.0] — 2026-04-20 (Qwen3.x Correctness — ALL FORMATS WIN)

Headline

Qwen3.5 / Qwen3.6 long-form generation restored across all formats. Two structural fixes (R34 NEOX RoPE + R40 arch-conditional QK-norm) eliminate prompt-sensitive drift that manifested as "quicck bbrrown" char doubling, digit/alphabet spam, and incoherent output past ~20 tokens.

Fix 1 — NEOX partial RoPE (R34)

refs/llama.cpp/src/llama-model.cpp:9298 maps LLM_ARCH_QWEN35/QWEN35MOE → LLAMA_ROPE_TYPE_IMROPE. ggml.h:1826: "NEOX ordering cannot be disabled for MROPE/VISION" (IMROPE is MROPE family). Our partial-rotary path used LLaMA-style (q[2i], q[2i+1]); Qwen3.x requires NEOX half-split (q[i], q[i+rope_pairs]). Opt-out: TQ_ROPE_PAIRS=1.

Fix 2 — arch-conditional QK-norm (R40)

Gemma 4 REQUIRES QK-norm (2+2=4 test breaks without). Qwen family DEGRADES with QK-norm applied (40+ token drift). Empirical: Qwen3.6 "Once upon a time" n=60 → drift WITH, perfect story WITHOUT. Fix: arch detection gates QK-norm; Gemma keeps, Qwen skips. Opt-in: TQ_FORCE_QK_NORM=1.

Measured (Qwen3.6-UD-IQ4_XS, T=0, --chat)

Format	v0.16	v0.17
"Once upon a time" n=60	drift at ~20 tok	60 tok Jack/compass/map story
`def fibonacci(n):`	garbled	`if n <= 0: return "Invalid input"`
Haiku	char doubling	"Silence speaks loud, Silence speaks in the quietest way."
List 5 items	partial	"1. Apple 2. Banana 3. Orange"
Factual	✓	"Paris", "12", "1945"

Numerical cleanup (R26-29)

Supporting fixes now complete: l2_normalize epsilon; DeltaNet softplus/decay/silu + attention sigmoid + conv1d silu all use exact expf() (was Schraudolph fast_expf with ~2% error, compounding across 30 DeltaNet layers).

Regression 15/15 PASS

Added R41 long-form guards: "Once upon a time" n=40 + "def fibonacci" n=30 strict substring checks. Gemma/Llama/Phi unchanged.

Impact

Qwen3.6-35B-A3B on 16 GB M1 Pro: v0.16 loaded Q5_K_M, v0.17 sustains coherent long generation across story/code/poem/list/explanation/Q&A.

Session arc (40 rounds → v0.17)

R1-15 Mission A (MoE batched +39%)
R16-17 Q5_K_M 16 GB load
R25-33 drift discovery + partial fixes
R34 NEOX RoPE structural fix
R40 arch-conditional QK-norm — ALL FORMATS WIN
R41 long-form regression
R42 v0.17.0 release

[v0.16.0] — 2026-04-19 (Q5_K_M on 16 GB Mac + auto-policy MADV)

Headline

Qwen3.6-35B-A3B at Q5 quality (5.5 bpw) on 16 GB M1 Pro — first engine to load the 26.5 GB Q5_K_M GGUF and produce coherent decode at 7.9 t/s warm steady-state on a 16 GB Mac. Previously impossible (llama.cpp + Q5_K_M OOMs on the same hardware).

The key mechanism — auto-policy MADV (Round 12)

tq_model.c MoE GGUF loader now selects madvise strategy by file_size vs hw.memsize:

File ≤ 75% RAM: blanket MADV_WILLNEED (previous behavior; optimal read-ahead, no swap risk).
File > 75% RAM: selective MADV_WILLNEED on non-expert tensors only (attn_*, norm_*, token_embd, output.weight, ffn_*_shared_exp). Routed ffn_{gate,up,down}_exps left at OS default (NORMAL with read-ahead). MoE sparsity (K=8/N=256 active) keeps working set bounded.

Result for Qwen3.6-UD-Q5_K_M (24.6 GB):

Non-expert WILLNEED: 2.50 GB
Routed-expert OS-managed: 22.13 GB
RSS: 9.65 GB (36.7% of file) on 16 GB M1 Pro
Decode warm steady-state: 7.9 t/s (interactive range)

Override envs: TQ_FLAT_MADV=1, TQ_SELECTIVE_MADV=1.

Q5_K NEON kernel optimization (Round 17)

q5k_int_dot_worker: 5th-bit extraction chain shortened from (AND + CEQ + AND + OR) to (SHL + AND + OR) using variable-shift vshlq_u8 with runtime shift vector. Target bit moved directly to position 4 via single shift — saves one instruction per qh extraction.

Before: 1.5 t/s cold (Round 16)
After: 2.1 t/s cold (+40%), 7.9 t/s warm (+5-10× after cache warm)

Full Qwen3.6-35B-A3B quant matrix on 16 GB Mac

Quant	File	RSS	Decode (warm)
IQ2_XXS	10.0 GB	~6.5 GB	16.1 t/s
IQ3_XXS	12.3 GB	~6.5 GB	14.6 t/s
Q3_K_S	14.3 GB	5.24 GB	14.3 t/s
IQ4_XS	16.5 GB	7.25 GB	10.6 t/s
Q5_K_M	24.6 GB	9.65 GB	7.9 t/s

vs llama.cpp CPU 5.1 t/s (Q3_K_S): 2.8-3.2× faster across tiers. llama.cpp can't load Q5_K_M on 16 GB Mac at all.

Other improvements

Layer prefetch pipelining (Round 15): __builtin_prefetch on next-layer non-expert weights during current MoE compute. Neutral on fits-in-RAM quants (Q3_K_S), positive on Q5_K_M page-cache pressure. TLB priming benefit.
Dead LRU infrastructure removed (Round 13): −219 LOC of unreachable Q8 cache code and its support chain. Eliminated split-source vs quant.h drift (quant.h already shipped as no-op stubs).
Full score.sh first run this session: 0.9979 / 1.0000 (99.8%) — new all-time high. Previously --quick hid quality/performance /integration dimensions (all actually at 100%).

Regression

13/13 test_models.sh PASS (added Q5_K_M tier in Round 21). Rounds 18 (2-row register pressure, −14%) and 19 (per-dispatch madvise, −70%) attempted and rolled back — both would now be auto-caught by the regression suite.

Memories added

feedback_madvise_willneed_per_call.md: per-dispatch madvise on Apple Silicon is a trap (VM contention on resident pages). Only use at load time.

Session metrics

21 /grow rounds completed
Net code change: −180 LOC (Round 13 cleanup vs Round 12/17/21 adds)
Score: 0.9946 → 0.9979
5-tier Qwen3.6 coverage on 16 GB Mac (IQ2/IQ3/Q3/IQ4/Q5)

[v0.15.0] — 2026-04-19 (Mission A: MoE Batched Prefill complete)

Headline

Qwen3.6-35B-A3B MoE prefill on 16 GB M1 Pro: 4.4 → 6.1 t/s (+39%), wall -29%, CPU work -41%.

The batched prefill path is now default-on. Opt out via TQ_NO_MOE_BATCH=1.

Mission A Step 3 — measured on Qwen3.6-UD-Q3_K_S, 450-word prompt, warm, j=8:

Step	Wall	Prefill	vs baseline
baseline (per-token)	103 s	4.4 t/s	—
+ 3e driver	92 s	4.9 t/s	+11%
+ 3f cross-expert parallel	85 s	5.4 t/s	+23%
+ 3h batched shared expert	82 s	5.5 t/s	+25%
+ 3g dynamic FCFS queue	73 s	6.1 t/s	+39%

With 951-token prompt (more favorable amortization): baseline 11.4 → 13.4 t/s (+17% over prior steps alone).

Added

tq_batched_matmul_q8_0 (b7c42dd) — Q8_0 batched kernel, Qwen3.6 non-expert attn path.
fused_dot_iq3_xxs_int8_batched (8dd4920, fixed 61d7ce8 — missing qs += 8 bug caught by sanity) — 35.6% of Qwen3.6 prefill compute.
fused_dot_iq3_s_int8_batched (30428f3) — 19% compute.
fused_dot_iq4_xs_int8_batched (30428f3) — TBL-16 codebook.
fused_dot_q3_k_int8_batched (f9e5af1) — for pure Q3_K MoE models.
tq_moe_forward_batch (9fb237d) — 3-phase dispatch: batch-route → inverse index → expert-wise batched gather/matmul/scatter.
tq_forward_batch_moe_hybrid (627b65e, f255b46) — Qwen3.6-style driver: per-token DeltaNet + per-token self-attn + batched MoE FFN.
Cross-expert parallel dispatch (e5f721a) — 8 workers, one expert each, private scatter buffer reduced serially.
Batched shared expert (3a34cbf) — tq_batched_matmul_q4 × 3 (gate/up/down) for Q4-converted shared experts.
tq_tp_run_dynamic (f195a78) — FCFS atomic-counter thread-pool dispatch, flattens expert-workload stragglers. Opt-in via TQ_MOE_BATCH_DYNAMIC=1.
TQ_MOE_BATCH_SELFTEST=1 — N=1 sanity mode proves numerical equivalence (max_abs_diff 1.2e-7).

Changed

TQ_MOE_BATCH=1 is now default-on (3f74f3e). Opt out with TQ_NO_MOE_BATCH=1.

Fixed

fused_dot_iq3_xxs_int8_batched missed qs += 8 advance per sub-block (61d7ce8). Same precedent as the single-query kernel bug. Caught by sanity infrastructure before release.

Verification

scripts/test_models.sh: 12/12 PASS throughout all 7 commits.
Sanity max_abs_diff: N=1 path = 1.2e-7, N=7 path ≤ 2e-4 (well under 1e-3 spec).
Decode unchanged (13+ t/s warm peak on Qwen3.6).

Known limitations

Dynamic FCFS queue (TQ_MOE_BATCH_DYNAMIC) is opt-in pending broader model coverage verification. Measured +17% when activated.
Non-q4_converted shared experts fall back to per-token (not triggered on current Qwen3.6 UD quants).
Decode path remains per-token (batched only affects prefill).

Complementary work (this release cycle, 2026-04-18→19)

v0.14.0: Q6_K NEON int8 (+115% on Q4_K_M models).
v0.14.1: Q3 tier breakthrough (Q3_K/IQ3_XXS/IQ4_XS int8 kernels).
v0.14.2: RoPE TLS sin/cos cache across all 4 branches; SwiGLU fast_exp_neon.
v0.14.3: Q3_K_S tier on Qwen3.6 (RSS 5.24 GB on 16 GB Mac).
v0.15.0 (this): batched MoE prefill default-on.

Cumulative Qwen3.6-35B-A3B arc (session start → v0.15.0):

Decode: 3.08 → 16.1 t/s (IQ2_XXS peak); 2.8× faster than llama.cpp CPU.
Prefill: 5 → 6.1 t/s at j=8 (+22%); 13.4 t/s at longer prompt (+17% over prior).
RSS: 12 GB → 5.24 GB (TQ_NO_MLOCK).

[v0.14.3] — 2026-04-18 night (Q3_K_S tier on Qwen3.6)

Highlights

Unsloth's UD-Q3_K_S (3.5 bpw, 14.3 GB) variant measured end-to-end after the Q3_K int8 kernel landed earlier in the day. Outcome: smallest RSS, best quality, same speed class. Recommended Qwen3.6 variant on 16 GB Macs as of this release.

Measurements on M1 Pro 16 GB, CPU 8t, TQ_NO_METAL=1 TQ_NO_MLOCK=1, warm 3-run peak:

Variant	bpw	Disk	RSS	Decode	llama.cpp CPU	Speedup
UD-IQ2_XXS	2.05	10.0 GB	6.54 GB	16.1 t/s	5.07	3.2×
UD-IQ3_XXS	3.06	12.3 GB	6.82 GB	14.6 t/s	5.23	2.8×
UD-Q3_K_S	3.5	14.3 GB	5.24 GB	14.3 t/s	5.11	2.8×

Quality step: "William Shakespeare wrote Hamlet" answers correctly on UD-Q3_K_S where UD-IQ3_XXS drifts. Decode prose reaches "Jack loved to play with his guitar" vs IQ2's "Jack lived in the small village of the mountains".

Why RSS is smaller despite higher bpw: Under TQ_NO_MLOCK=1 the OS page cache holds only hot expert pages. Q3_K_S uses uniform 256-element Q3_K blocks; UD-IQ3_XXS mixes IQ3_XXS + IQ3_S + IQ4_XS + Q4_K + Q6_K block sizes. Uniform layout → fewer distinct pages touched per matmul → smaller page-cache working set.

Added

RoPE TLS sin/cos cache extended to all remaining branches: Phi-3 LongRoPE (commit e00ff21, key includes factors* pointer) and Gemma 4 NeoX (commit 5a8c093, key includes rope_base for sliding/global distinction). The earlier Qwen 3.x partial-rotary cache (b4d7807) and Llama / Qwen 2.5 tq_rope() cache (27c6707) remain. Only remaining uncached variant: learned rope_freqs[] (Gemma 4 global-attention freq_factors) — deferred.
bench/results/2026-04-18_q3_k_s_tier.md — full Q3_K_S vs IQ3_XXS vs IQ2_XXS methodology + reproduce.

Regression

scripts/test_models.sh: 12/12 PASS — RoPE cache extensions verified; Q3_K kernel verified end-to-end on Q3_K_S as well.

[v0.14.2] — 2026-04-18 evening (RoPE + SwiGLU NEON cleanups)

Highlights

Two structural perf fixes discovered in the post-Q3 profile. Neither is a headline speedup on its own (Qwen3.6 decode is weight-read-bound), but both lower the instruction-level ceiling for any future fusion / pipelining work.

Added

RoPE TLS sin/cos cache (src/engine/tq_transformer.c, partial-rotary branch). Keyed on (pos, rope_base, rope_dim) — those are identical across all heads and all layers in one forward pass on models with partial_rotary_factor (every Qwen 3.x model). First layer pays powf + cosf + sinf per pair; remaining ~179 head-layer combinations do array reads. ~180× reduction in libc transcendental calls per token on Qwen3.6.
fast_exp_neon — lifts the Schraudolph bit-twiddle exp into a single NEON FMA + vcvtq_s32_f32 + reinterpret, replacing the per-lane scalar round-trip in swiglu_fused. Halves the instruction count in the 8-element SwiGLU tile.

Regression

scripts/test_models.sh: 12/12 PASS.

Commits

b4d7807 (RoPE cache), d4c0fc6 (SwiGLU NEON).

[v0.14.1] — 2026-04-18 (Q3 breakthrough)

Highlights

Q3 weight-class unlocked on 16 GB Mac. Three more scalar fused_dot_* kernels replaced with vdotq_s32 int8 fast paths. Primary target: raise Qwen3.6-35B-A3B quantization from IQ2_XXS (2.05 bpw) to UD-IQ3_XXS (3.06 bpw) for a measurable quality step-up, without losing the speed lead over llama.cpp.

Measured on Qwen3.6-35B-A3B-UD-IQ3_XXS (M1 Pro 16 GB, CPU 8 threads, TQ_NO_MLOCK=1, warm peak):

iteration	t/s	vs llama.cpp CPU
scalar baseline (new kernels disabled)	7.9	1.5×
+ Q3_K int8	12.2	2.3×
+ IQ3_XXS int8 (post qs-advance fix)	12.8	2.4×
+ IQ4_XS int8 (TBL lookup)	14.6	2.8×
llama.cpp CPU 8t reference	5.23	—

RSS: 6.82 GB on 16 GB Mac (vs 6.54 GB for IQ2_XXS — only +0.28 GB for the quality step-up). Coherent decode persists ~2× longer before drift compared with IQ2_XXS.

Added

Q3_K × int8 NEON fast path (fused_dot_q3_k_int8). Scalar fused_dot_q3_k was latent since initial Q3_K support. 16 × vdotq_s32 per 256-element block. vbicq_u8 resolves the (hmask_bit ? 0 : 4) branch without conditional. Env TQ_Q3K_NOINT=1 reverts. Covers Q3_K_S / Q3_K_M / Q3_K_L / Q3_K_XL.
IQ3_XXS × int8 NEON fast path (fused_dot_iq3_xxs_int8). Previous kernel was partial NEON (float FMA end). Reuses iq3s_build8 helper from IQ3_S int8 path. Env TQ_IQ3XXS_NOINT=1 reverts.
IQ4_XS × int8 NEON fast path (fused_dot_iq4_xs_int8). kvalues_iq4nl[16] codebook fits in one ARM NEON TBL register — single vqtbl1q_s8 does 16 parallel byte lookups per sub-block, cleanest possible NEON kernel shape. Env TQ_IQ4XS_NOINT=1 reverts.
scripts/qwen36_quality_probe.sh — factual Q&A (10 prompts, greedy T=0) + 100-token coherence probe + 3 multi-turn probes. Used for Q3 vs IQ2 A/B quality comparison.
bench/results/2026-04-18_q3_breakthrough.md — full methodology, bug-caught-during-A/B writeup, reproduce commands.

Fixed

fused_dot_iq3_xxs_int8 missing qs += 8; between sub-blocks (caught during A/B before commit). Without the advance, all 8 sub-blocks read the first sub-block's grid indices → 0/10 factual and digit-soup decode. A/B toggle (TQ_IQ3XXS_NOINT=1 vs new kernel) isolated the bug in minutes. Precedent documented in commit 11e3c32.

Regression

scripts/test_models.sh: 12/12 PASS across the full model suite (no Q3-family model is in the regression suite, so these kernels were validated via Qwen3.6 greedy-decode + A/B against scalar).

[v0.14.0] — 2026-04-18

Highlights

MoE & Q4_K_M throughput breakthrough — Qwen3.6-35B-A3B (MoE) now runs at 16.1 t/s on a 16 GB M1 Pro (CPU-only), 3.2× faster than llama.cpp's CPU path (5.07 t/s) at 35% lower RSS (6.5 GB vs ~10 GB). Every Q4_K_M model in the suite also picked up +115% to +180% decode throughput from a single kernel fix.

All three changes were driven by sample-based profiling done after model load (earlier samples were dominated by the single-threaded Q4 load conversion, which hid the real hot path).

Added

Q6_K × int8 NEON fast path (fused_dot_q6_k_int8, src/engine/tq_gguf_quants.c). The existing fused_dot_q6_k is pure scalar; Q4_K_M embeds Q6_K for attention.wo and ffn_down, so it silently dominated decode on every Q4_K_M model. New kernel pre-quantizes activation to int8 (Q8_0 layout) and issues one vdotq_s32 per 16-element sub-block. Env TQ_Q6K_NOINT=1 reverts for A/B.
IQ3_S × int8 NEON fast path (fused_dot_iq3_s_int8). UD-IQ2_XXS quantizations (e.g., Qwen3.6) embed IQ3_S for some critical layers; same scalar-to-vdotq_s32 fix. Env TQ_IQ3S_NOINT=1 reverts.
MoE router NEON vectorize (tq_moe_route in src/engine/tq_moe.c). Previous scalar for e in 256 experts: dot(hidden, row) loop replaced with 4-accumulator FMA (16 floats/cycle peak). Scratch buffers moved to static __thread — eliminates per-call malloc/calloc (60 allocs/token on Qwen3.6).
TQ_NO_MLOCK=1 environment variable — for MoE models on memory-constrained hosts, skips mlock() and uses MADV_WILLNEED instead. On a 16GB M1 Pro this is both faster (OS page cache LRU tracks the small hot-expert set better than mlock pinning the whole 10 GB) and saves ~5 GB RSS.
pthread QoS hint (QOS_CLASS_USER_INTERACTIVE) applied to thread-pool workers on macOS to prefer P-cores over E-cores on asymmetric Apple Silicon (M-series Pro / Max / Ultra).
Dual-accumulator pair in matmul_q4_rows inner loop (kept for kernel readability even though it did not move the needle on M1 — FMA throughput was not the bound).
bench/results/2026-04-18_moe_and_q4_k_m_breakthrough.md — full methodology, per-iteration numbers, and reproduce commands.

Performance

Measured on M1 Pro 16 GB, macOS 24, CPU-only (TQ_NO_METAL=1), 8 threads, warm 3-run peak, greedy decode.

Model	before	after	vs llama.cpp CPU 8t
Qwen3.6-35B-A3B-UD-IQ2_XXS	3.08 → 7.8	16.1	5.07 — 3.2× faster
Qwen3.5-4B Q4_K_M	5.0	14.1	19.9 (71%)
Phi-3.5-mini Q4_K_M	6.2	14.1	26.7 (53%)

Qwen3.6 RSS: 12.0 GB → 6.5 GB with TQ_NO_MLOCK=1.

Fixed

fused_dot_q6_k scalar performance regression (latent since initial Q6_K support). Sample profiling attributed its cost to "matmul" in the wall-clock profile, hiding it for multiple releases. Fixed by the int8 fast path above.
tq_moe_route hot-path heap churn — malloc(num_experts) and calloc(num_experts) on every router call (per layer, per token). Now thread-local.

Recommended usage

# 16 GB Mac, Qwen3.6-35B-A3B MoE (best speed AND lowest RSS)
TQ_NO_METAL=1 TQ_NO_MLOCK=1 ./build/quant \
  models/Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf \
  --chat -p "..." -n 60 -T 0.7 -j 8

# Q4_K_M dense models (Phi-3.5, Qwen3.5-4B, Llama family)
TQ_NO_METAL=1 ./build/quant <model-Q4_K_M.gguf> -p "..." -n 50 -T 0 -j 8

Regression

scripts/test_models.sh: 12/12 PASS across Llama 3.1/3.2, Qwen 2.5/3.5/3.6, Phi-3.5, Gemma-4.

[v0.13.0] — 2026-04-12

Highlights

Phi-3 / Phi-3.5 architecture fully supported — the highest-value model quant.cpp was missing. Phi-3.5-mini (3.8B params, vocab 32K) is now the recommended default, delivering the best speed/quality combo:

pip install quantcpp
quantcpp                  # downloads Phi-3.5-mini Q8_0 (~3.8 GB), starts chat

Added

Phi-3 / Phi-3.5 architecture support — fused QKV projection, fused gate+up FFN, LongRoPE with NeoX-style rotation. Validated end-to-end on Phi-3.5-mini-instruct-Q4_K_M and Q8_0.
Phi-3.5-mini as default model — replaces SmolLM2-1.7B as the recommended model. Q8_0 variant is 2x faster than Q4_K_M on Apple Silicon NEON (3.0 vs 1.5 tok/s).
ChatML template marker filter — 32-byte lookahead filter in chat_accum_callback catches BPE-split markers (<|im_start|>, <|im_end|>, <end_of_turn> etc.) across token boundaries. Prevents template tokens from leaking into chat output.
Unsupported architecture hard-fail — loading a model with fused QKV that quant.cpp can't handle (e.g., before Phi-3 support) now fails fast with a clear error message instead of silently producing garbage tokens.
quant-server-unified — new server binary built directly on quant.h (single-header amalgamation). Eliminates divergence between quant.h and libturboquant split sources. CLI quantcpp serve now prefers this binary.
SmolLM2-1.7B and Phi-3.5-mini added to _MODEL_REGISTRY with CLI aliases (smollm2, phi3.5, phi-3.5-mini etc.).
ChatContextOverflow exception — Python Model.chat() now raises a typed exception on context overflow instead of silently returning empty output.
docs/supported_models.md — architecture compatibility matrix, vocab-size speed guide, model selection recommendations.
tools/gguf_inspect.c — GGUF tensor/metadata inspector for architecture debugging.

Fixed

16 chat-cache bugs eliminated (PRs #52, #53) — two audit passes found hidden bugs in KV cache prefix matching, text accumulation, server session management, WASM state handling.
tq_generate_continue overflow — sliding-window truncation silently desynced cached_text from KV positions → garbage on long histories. Now returns -2 on overflow.
chat_accum_callback realloc failure — silently dropped tokens AND skipped user callback. Now always passes tokens through; marks accumulator tainted.
Server error handling — gen_rc == -1 produced HTTP 200 with empty content; now returns HTTP 500 with error JSON. Streaming sends finish_reason: "error".
Server session kv_type mismatch — reusing a session ID with different kv_type/value_quant_bits corrupted KV blocks. Now detects and rebuilds.
WASM wasm_load_model — didn't reset g_generating flag → stuck busy after interrupted run.
rep_penalty in fast-path — silently ignored in tq_generate_chat_text's fast path (slow path applied it). Now consistent.
BOS token for Phi-3/Llama — <s> added to BOS lookup chain. Phi-3 produces garbage without BOS.
Python CLI overflow handling — cmd_run caught ChatContextOverflow, drops oldest turn, retries.

Changed

Default model: Llama-3.2-1B → SmolLM2-1.7B → Phi-3.5-mini Q8_0.
CLI examples and README quickstart updated to use Phi-3.5-mini.
Metal GPU dispatch disabled for fused-tensor models (CPU is faster for sub-4B).

Performance

Phi-3.5-mini Q8_0: 3.0 tok/s on Apple M3 (2x faster than Q4_K_M).
Chat KV cache reuse: turn N+1 prefill is O(new tokens), not O(history). ~50% latency reduction on multi-turn chat.

[v0.3.0] — 2026-04-01

Highlights

Real-model validation, adaptive compression, and information-theoretic foundations. Every theoretical claim is now backed by measured data from actual model inference.

Added

Real-Model Validation (Phase A)

Perplexity pipeline (--ppl <file>): Teacher-forced PPL measurement. Gemma 4B results: 1-bit K + Q4 V PPL = 36.00 vs FP16 PPL = 35.99 — +0.03% degradation (effectively lossless).
Formal unbiasedness (tests/test_unbiased.cpp): 100K random vector pairs prove all quant.cpp types have < 0.2% relative bias. The "unbiased inner product" claim is empirically verified.
Activation profiling (--profile-kv): Per-layer pre/post-RHT distribution statistics. RHT reduces kurtosis from 10-99 to 3.9-7.9 and eliminates skewness. Honest finding: post-RHT is not perfectly Gaussian.
Memory bandwidth benchmark (--bench-memory): tok/s vs context length across KV types.

Adaptive Compression (Phase B)

Per-layer bit recommendation (--recommend): Profiles activation kurtosis, recommends 1-bit or 3-bit per layer. Gemma 270M: average 2.0 bits (vs 3.0 uniform) → 33% memory savings potential.
Attention entropy analysis (--attn-entropy): Per-head Shannon entropy identifies sharp vs diffuse attention patterns.
V highres window (-V N): Recent N tokens stored as FP16 alongside Q4/Q2 V. Test showed Q4 V already near-lossless (PPL +0.03%), so hybrid adds no measurable benefit.
Online codebook calibration (--calibrate): Lloyd-Max iteration on real activation data. MSE improved 49.7% over default N(0,1) codebook — proves model-specific calibration matters.

Engine (Phase C)

Fused Q4 domain attention: Weighted sum computed directly from packed nibbles without dequantize buffer. NEON vfmaq_f32 path. Reduces memory traffic.
Prefill benchmark (--bench-prefill): Measures KV quantization overhead during prompt processing.
CoW benchmark (bench/cow_bench.sh): Analytical memory savings for shared-prefix serving.
Auto compression profile (bench/auto_profile.sh): Full pipeline: profile → recommend → calibrate → JSON output.

Theory (Phase D)

Rate-distortion bounds (tests/test_rate_distortion.cpp): Computes info-theoretic minimum MSE at each bit-width. Q4 uniform: 2.41x gap. Lloyd-Max: < 0.15 bits wasted.
Cumulative error analysis (tests/test_cumulative_error.cpp): 16-layer simulation shows errors grow sub-linearly. Cosine similarity after 16 layers: 0.998 (Q4), 0.951 (Q2).

Measured Results

Metric	Value	Source
Gemma 4B PPL (uniform_4b)	35.99	`--ppl`
Gemma 4B PPL (1b K + Q4 V)	36.00 (+0.03%)	`--ppl`
Gemma 4B PPL (1b K + Q2 V)	42.23 (+17.3%)	`--ppl`
Unbiasedness (all types)	< 0.2% rel_bias	`test_unbiased`
Post-RHT kurtosis range	3.9 – 7.9	`--profile-kv`
Adaptive bit average	2.0 bits (33% saving)	`--recommend`
Calibrated codebook MSE improvement	49.7%	`--calibrate`
16-layer cumulative cosine (Q4)	0.998	`test_cumulative_error`
Rate-distortion gap (Q4 uniform)	2.41x	`test_rate_distortion`

[v0.2.0] — 2026-04-01

Highlights

V cache quantization and expert-grade validation — total K+V compression reaches 4.9x (Q4) to 7.1x (Q2), with every claim backed by measured data.

Added

V Cache Quantization

Q4 value quantization (-v q4): 4-bit per-block scale + packed nibbles. V compression 3.8x.
Q2 value quantization (-v q2): 2-bit Lloyd-Max codebook. V compression 7.6x.
FP16 value auto-enable: Values stored as FP16 when KV quantization is active (was FP32).
Combined 1-bit K + Q4 V: 27.62 KB/token, 4.9x total K+V (was 136 KB FP16).
Combined 1-bit K + Q2 V: 19.12 KB/token, 7.1x total K+V.
CLI flag -v q4|q2|fp16 for value quantization control.
Memory reporting (-M) shows K and V breakdown separately.

Validation Suite

NEON/scalar consistency (tests/test_neon_scalar.cpp): 14 tests verify every NEON path against pure C reference — Q4 dequant, Q2 dequant, RHT butterfly, RoPE, matmul, RMSNorm, Hamming attention.
Attention distribution (tests/test_attention_distribution.cpp): 8 tests measure cosine similarity (0.996/0.918/0.634), Spearman rank correlation, top-k overlap. Proves compression is non-trivial (random K = 0.089).
Codebook theory (tests/test_codebook_theory.cpp): 5 tests verify Lloyd-Max centroids match N(0,1) literature values within 0.001, MSE within 1.18x of information-theoretic optimal.
Edge cases (tests/test_edge_cases.cpp): 29 tests — n=1 (single token), dim=0, NaN input, Inf input, all-same values, all-zero, n=10000 large sequence.
Numerical stability: 4 tests for overflow-safe norm computation and NaN/Inf input guards.

Benchmark Scripts

bench/ablation_test.sh: Divergence analysis at 50-300 tokens across KV types.
bench/long_quality_test.sh: Coherence at 200/500/1000 tokens.
bench/sampling_test.sh: Temperature sampling (T=0.3, T=0.7) comparison.
bench/quant_time_bench.sh: Quantization timing wrapper.
bench/bench_kv_overhead.cpp: Microbenchmark — uniform 148 ns, 1b 659 ns, 3b 11066 ns per vector.
bench/attention_dist_test.sh: Attention distribution analysis wrapper.
scripts/sanitize.sh: ASan + UBSan build and full test run.

Fixed

Q4 dequant NEON nibble interleaving bug: Lo/hi nibbles were written contiguously instead of interleaved, causing MSE 0.525 (300x worse than correct). Fixed with vzip_u8 interleave.
QJL sign bias: proj >= 0.0f → proj > 0.0f across 11 occurrences (CPU, CUDA, Metal). Eliminates asymmetric bias at zero projection boundary.
Norm overflow: QJL norm computation now uses max-abs rescaling to prevent float overflow on large vectors.
NaN/Inf input guard: Quantization functions zero-fill output block on NaN/Inf input instead of producing undefined output.

Changed

Thread safety: Global Q8 workspace (g_q8_buf) and sampler probability index (g_probindex) protected by mutex against concurrent realloc races.
RHT NEON vectorized: Walsh-Hadamard butterfly uses float32x4_t for stages with len >= 4.
Q4 dequant NEON restored: Properly vectorized with vzip_u8 after bug fix (was scalar fallback).
Test suite count: 23 → 26. Edge case count: 16 → 29.

Measured Results

Metric	Value	Source
Total K+V compression (1b K + Q4 V)	4.9x	`quant -M`
Total K+V compression (1b K + Q2 V)	7.1x	`quant -M`
32K context savings (Q4 V)	3.4 GB	calculated
Attention cosine (uniform_4b)	0.996	`test_attention_distribution`
Attention cosine (turbo_kv_3b)	0.918	`test_attention_distribution`
Attention cosine (turbo_kv_1b)	0.634 (= 2/pi)	`test_attention_distribution`
Random K cosine	0.089	`test_attention_distribution`
Lloyd-Max MSE vs theory	< 1.18x	`test_codebook_theory`
RHT overhead	147 ns/vec	`bench_kv_overhead`
1-bit attention	1.2 ns/key	`bench_kv_overhead`
ASan + UBSan	26/26 clean	`scripts/sanitize.sh`

[v0.1.0] — 2026-03-31

Highlights

Initial release — pure C inference engine with quant.cpp KV cache compression. 1-bit keys, 10.7x key compression, byte-identical greedy output at 100 tokens.

Added

Core Engine

Complete transformer inference engine in pure C11 (10,000+ lines).
Multi-architecture support: Gemma 3 (sliding window, GeGLU, dual RoPE) + Qwen3.5 (DeltaNet hybrid).
Multi-shard safetensors loading (Gemma 4B = 2 shards, 883 tensors).
Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect.
TQM binary format: pre-quantized mmap, instant loading.

KV Cache Quantization (11 types)

quant.cpp KV 1-bit: Sign-only after RHT. XOR + popcount attention (NEON vcntq_u8).
quant.cpp KV 3-bit: 2-bit Lloyd-Max codebook + 1-bit QJL residual.
quant.cpp KV 4-bit: 3-bit codebook + 1-bit QJL.
Uniform 4-bit / 2-bit: Standard min-max quantization.
PolarQuant: Polar coordinate (theta + radius) quantization.
QJL: Quantized Johnson-Lindenstrauss sign hash.
Mixed / quant.cpp base: Combined polar + QJL.

Weight Quantization

Q4 weight quantization (4-bit per-block).
Q2 weight quantization (2-bit Lloyd-Max codebook, Q2xQ8 integer matmul).
BF16 weight support.

Performance

NEON vectorized: 2-row matmul batching, fused dot products, Hamming distance.
Thread pool with configurable thread count.
Apple Silicon optimized.

Quality Verification

30/30 byte-identical greedy matches (K-only, 100 tokens, 10 diverse prompts).
23 test suites (Google Test).
Qwen3.5: 0.999 cosine similarity vs PyTorch reference.
Gemma 270M: per-layer exact match.

Models Verified

Model	Params	Speed (Q4, 6T)
Gemma 3 4B	4B	20.2 tok/s
Qwen3.5-0.8B	752M	80.1 tok/s
Gemma 3 270M	270M	176 tok/s

Release Process

Version Scheme

v{MAJOR}.{MINOR}.{PATCH}

MAJOR: Breaking API changes
MINOR: New features, backward-compatible
PATCH: Bug fixes, performance improvements

Checklist for New Releases

Update version in CMakeLists.txt (project(turboquant VERSION x.y.z))
Add release section to this file (newest first)
Update badge version in README.md and README.ko.md

Run full validation:

cmake --build build -j$(nproc) && ctest --test-dir build
bash scripts/sanitize.sh
./build/quant gemma3-4b.tqm -p "The capital of France is" -j 6 -n 20 -T 0.0 -k turbo_kv_1b -v q4

Tag: git tag -a v0.x.0 -m "Release v0.x.0"
Push: git push origin v0.x.0
Create GitHub release with this section's content

What Goes in Release Notes

Added: New features, new tests, new benchmarks
Fixed: Bug fixes (with root cause and impact)
Changed: Behavior changes, performance improvements
Measured Results: Table of key metrics with source (test name or script)
Breaking: API changes that require user code modification

FilesExpand file tree

RELEASE_NOTES.md

Latest commit

History

RELEASE_NOTES.md

File metadata and controls

Release Notes

[v0.28.0] — 2026-04-22 ★★★ (MoE softmax temperature breaks the Qwen3.6-35B 117-tok cliff)

Headline

The fix

Why it works — 26-round investigation summary

What v0.28.0 does NOT fix

Safety

Default behavior

Recommended Qwen3.6-35B recipe

Files changed

Scope

[v0.27.0] — 2026-04-21 ★★ (BPE encode+decode UTF-8 fix — international text silent quality disaster RESOLVED)

Headline

The bugs

Measured impact

Discovery

Files changed

Scope

[v0.26.0] — 2026-04-21 ★ (L2-norm formulation matches ggml — Qwen3.6 +36% coherence window)

Headline

The bug

Fix

A/B on Qwen3.6-35B IQ4_XS, auto-serial, "Write a 300-word essay about AI." + 300 tok gen

Why this was missed

Honest status

Regression

Methodology note

[v0.25.0] — 2026-04-20 (Qwen3.6 Auto-Serial Quality Mode — Determinism + Longer Coherence)

Honest headline

What this ships

Verified

Honest limits — NOT a full fix

Why it's still worth shipping

What's still needed (honest)

Session arc (2026-04-20)

Regression

[v0.24.0] — 2026-04-20 (MoE SwiGLU Exact expf — Coherence Margin)

Headline

Change

A/B (Qwen3.6-35B IQ4_XS, 400-word prose + "In summary,")

Speed

Known remaining

Regression

[v0.23.0] — 2026-04-20 ★★ (Prompt Buffer + MoE Long-Context Isolation)

Headline

The bug

Diagnostic path (OpenMythos-inspired reference diff)

Fix

Validation (same 561-word document + "In summary,")

Remaining bug (isolated)

Regression

Lesson

[v0.22.0] — 2026-04-20 (Qwen3.6 Chunked Batched Prefill — +30% TTFT)

Headline

The idea

Change

Speed (Qwen3.6-35B IQ4_XS, CPU-only, TQ_NO_METAL=1)

Correctness

Known limits

Regression

Compatibility

[v0.21.0] — 2026-04-20 ★★★ (Qwen3.6-35B Practically Usable)

Headline

The fix

Before/after evidence (Qwen3.6 IQ4_XS, 44-word natural prose + "Summarize in one sentence.")

Broad validation (8-prompt matrix)

Trade-off

Complete session arc (2026-04-20)

Known remaining

Compatibility

[v0.20.0] — 2026-04-20 ★★ (NEOX RoPE ROOT-CAUSE — Qwen3 Long-Prompt Fix)

Headline

Two fixes

Symptom (before/after, Qwen3-0.6B Q4, 50-word synthetic input)