Skip to content

Commit ad30813

Browse files
unamedkrclaude
andcommitted
docs: Phase 2 tier classification — honest supported-models matrix
docs/supported_models_tier.md — measured tiers by observed behavior: Tier 1 (production): Llama-3.2 1B/3B, Phi-3.5-mini, Qwen3-0.6B, Gemma-4-e2b Tier 2 (medium-gen): Qwen3.5-4B, Gemma-4-e4b Tier 3 (experimental): Qwen3.6-35B variants (117-tok drift, use --rep-penalty 1.3) Not tested: Qwen2.5, Mistral, Gemma-4 26B Each row has measured decode rate, TTFT, recommended quant, known drift boundary, and a practical-config note where applicable. Explicitly calls out the v0.27.0 BPE UTF-8 fix as applicable to every GPT-2-byte-BPE tier (which is most of them). Addresses task #196 (tier classification doc). Closes the "what should I use" gap that the growth-strategy memory flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 18223d8 commit ad30813

1 file changed

Lines changed: 88 additions & 0 deletions

File tree

docs/supported_models_tier.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Supported Models — Honest Tier Matrix
2+
3+
*Last updated: 2026-04-21 (post v0.27.0)*
4+
5+
This page groups supported GGUF models into tiers based on **measured**
6+
quality and stability on a 16 GB M1 Pro Mac, not marketing claims. If a
7+
model isn't listed, we haven't validated it end-to-end and you're on the
8+
edge of supported.
9+
10+
## Tier 1 — Production (coherent + stable long-form)
11+
12+
Works out of the box for chat, code, and multi-hundred-token generation.
13+
Regression suite covers all of these.
14+
15+
| Model | Quant | Decode | TTFT | Notes |
16+
|---|---|---:|---:|---|
17+
| Llama-3.2-1B-Instruct | Q8_0 | 53-57 t/s | 0.12s | Best speed, small vocab |
18+
| Llama-3.2-3B-Instruct | Q8→Q4 | 26-29 t/s | 0.97s | Balanced |
19+
| Phi-3.5-mini-instruct | Q4_K_M | 14-16 t/s | 0.95-2.3s | SentencePiece; best chat quality at ~3.8B |
20+
| Qwen3-0.6B | Q4_K_M | 50-60 t/s | 0.17s | Smallest Qwen3 |
21+
| Gemma-4-e2b | Q8 | 24-25 t/s | 0.46s | Dual-FFN + PLE |
22+
23+
**Recommended for**: user-facing chat, code completion, embedding server,
24+
any single-call or short-multi-turn workload.
25+
26+
## Tier 2 — Stable at short-to-medium generation
27+
28+
Coherent output up to ~200-400 tokens; long-form quality degrades
29+
gradually but remains usable.
30+
31+
| Model | Quant | Decode | Notes |
32+
|---|---|---:|---|
33+
| Qwen3.5-4B | Q4_K_M | 18-23 t/s | Dense DeltaNet hybrid; 561-word prompt coherent |
34+
| Gemma-4-e4b | Q8 | ~3.5 t/s | Slow but stable; research use |
35+
36+
**Recommended for**: one-shot Q&A, short essays. Avoid long narrative
37+
without explicit length guardrails.
38+
39+
## Tier 3 — Experimental / short-generation only
40+
41+
Runs and produces English, but hits repetition loops on long generation.
42+
Use with user-facing guards (`--rep-penalty`, shorter `-n`).
43+
44+
| Model | Quant | Decode | Practical config | Drift boundary |
45+
|---|---|---:|---|---:|
46+
| Qwen3.6-35B-A3B | UD-IQ4_XS | 12-16 t/s warm | `--rep-penalty 1.3` | ~117 tok default; ~200 tok with rep-penalty |
47+
| Qwen3.6-35B-A3B | UD-Q5_K_M | 10-13 t/s warm | `--rep-penalty 1.3` | 200+ tok (hits -n budget, graceful tail degrade) |
48+
| Qwen3.6-35B-A3B | UD-Q3_K_S | 14 t/s warm | shorter `-n` | ~100 tok |
49+
50+
**Status**: The 117-token repetition cliff on Qwen3.6-35B is a
51+
distributed multi-layer DeltaNet-state accumulation phenomenon (see
52+
`.claude/state.md` R16-R19). No single-line fix applies. The
53+
`--rep-penalty 1.3` mitigation is the best user-facing option today.
54+
55+
**Recommended for**: short narrative continuations, summarization of
56+
moderate documents, technical Q&A. Not for >200-token open-ended
57+
generation without guards.
58+
59+
## Not Yet Tested / Not Supported
60+
61+
- Qwen2.5 family — likely works (BPE fix applies) but not in the
62+
regression suite
63+
- Mistral / Mixtral — no GGUF loader path exercised
64+
- Gemma-4 26B variants — known to fail Metal path (auto-CPU works but
65+
slow on 16 GB)
66+
67+
## BPE UTF-8 (v0.27.0, all tiers)
68+
69+
Until v0.27.0, international text (accents, CJK, Cyrillic, byte-fallback
70+
emoji) was silently double-encoded on both encode and decode for every
71+
Llama-3 and Qwen3-family model. If you were running a pre-v0.27.0 build
72+
on non-English prompts, please upgrade — prior outputs were in a
73+
different token distribution than the model was trained on.
74+
75+
See `bench/results/2026-04-21_bpe_utf8_fix_proof.md` for the end-to-end
76+
proof and 11/11 HF parity measurements.
77+
78+
## Verification command
79+
80+
Reproducible regression:
81+
82+
```bash
83+
bash scripts/test_models.sh # 15 coherence tier + 11 tokenizer UTF-8
84+
# → PASS: 15 / 11, FAIL: 0 / 0
85+
```
86+
87+
All numbers above are from 16 GB M1 Pro CPU-only, warm cache. Cold-start
88+
TTFT can be 3-10× higher on first call.

0 commit comments

Comments
 (0)