|
| 1 | +# Supported Models — Honest Tier Matrix |
| 2 | + |
| 3 | +*Last updated: 2026-04-21 (post v0.27.0)* |
| 4 | + |
| 5 | +This page groups supported GGUF models into tiers based on **measured** |
| 6 | +quality and stability on a 16 GB M1 Pro Mac, not marketing claims. If a |
| 7 | +model isn't listed, we haven't validated it end-to-end and you're on the |
| 8 | +edge of supported. |
| 9 | + |
| 10 | +## Tier 1 — Production (coherent + stable long-form) |
| 11 | + |
| 12 | +Works out of the box for chat, code, and multi-hundred-token generation. |
| 13 | +Regression suite covers all of these. |
| 14 | + |
| 15 | +| Model | Quant | Decode | TTFT | Notes | |
| 16 | +|---|---|---:|---:|---| |
| 17 | +| Llama-3.2-1B-Instruct | Q8_0 | 53-57 t/s | 0.12s | Best speed, small vocab | |
| 18 | +| Llama-3.2-3B-Instruct | Q8→Q4 | 26-29 t/s | 0.97s | Balanced | |
| 19 | +| Phi-3.5-mini-instruct | Q4_K_M | 14-16 t/s | 0.95-2.3s | SentencePiece; best chat quality at ~3.8B | |
| 20 | +| Qwen3-0.6B | Q4_K_M | 50-60 t/s | 0.17s | Smallest Qwen3 | |
| 21 | +| Gemma-4-e2b | Q8 | 24-25 t/s | 0.46s | Dual-FFN + PLE | |
| 22 | + |
| 23 | +**Recommended for**: user-facing chat, code completion, embedding server, |
| 24 | +any single-call or short-multi-turn workload. |
| 25 | + |
| 26 | +## Tier 2 — Stable at short-to-medium generation |
| 27 | + |
| 28 | +Coherent output up to ~200-400 tokens; long-form quality degrades |
| 29 | +gradually but remains usable. |
| 30 | + |
| 31 | +| Model | Quant | Decode | Notes | |
| 32 | +|---|---|---:|---| |
| 33 | +| Qwen3.5-4B | Q4_K_M | 18-23 t/s | Dense DeltaNet hybrid; 561-word prompt coherent | |
| 34 | +| Gemma-4-e4b | Q8 | ~3.5 t/s | Slow but stable; research use | |
| 35 | + |
| 36 | +**Recommended for**: one-shot Q&A, short essays. Avoid long narrative |
| 37 | +without explicit length guardrails. |
| 38 | + |
| 39 | +## Tier 3 — Experimental / short-generation only |
| 40 | + |
| 41 | +Runs and produces English, but hits repetition loops on long generation. |
| 42 | +Use with user-facing guards (`--rep-penalty`, shorter `-n`). |
| 43 | + |
| 44 | +| Model | Quant | Decode | Practical config | Drift boundary | |
| 45 | +|---|---|---:|---|---:| |
| 46 | +| Qwen3.6-35B-A3B | UD-IQ4_XS | 12-16 t/s warm | `--rep-penalty 1.3` | ~117 tok default; ~200 tok with rep-penalty | |
| 47 | +| Qwen3.6-35B-A3B | UD-Q5_K_M | 10-13 t/s warm | `--rep-penalty 1.3` | 200+ tok (hits -n budget, graceful tail degrade) | |
| 48 | +| Qwen3.6-35B-A3B | UD-Q3_K_S | 14 t/s warm | shorter `-n` | ~100 tok | |
| 49 | + |
| 50 | +**Status**: The 117-token repetition cliff on Qwen3.6-35B is a |
| 51 | +distributed multi-layer DeltaNet-state accumulation phenomenon (see |
| 52 | +`.claude/state.md` R16-R19). No single-line fix applies. The |
| 53 | +`--rep-penalty 1.3` mitigation is the best user-facing option today. |
| 54 | + |
| 55 | +**Recommended for**: short narrative continuations, summarization of |
| 56 | +moderate documents, technical Q&A. Not for >200-token open-ended |
| 57 | +generation without guards. |
| 58 | + |
| 59 | +## Not Yet Tested / Not Supported |
| 60 | + |
| 61 | +- Qwen2.5 family — likely works (BPE fix applies) but not in the |
| 62 | + regression suite |
| 63 | +- Mistral / Mixtral — no GGUF loader path exercised |
| 64 | +- Gemma-4 26B variants — known to fail Metal path (auto-CPU works but |
| 65 | + slow on 16 GB) |
| 66 | + |
| 67 | +## BPE UTF-8 (v0.27.0, all tiers) |
| 68 | + |
| 69 | +Until v0.27.0, international text (accents, CJK, Cyrillic, byte-fallback |
| 70 | +emoji) was silently double-encoded on both encode and decode for every |
| 71 | +Llama-3 and Qwen3-family model. If you were running a pre-v0.27.0 build |
| 72 | +on non-English prompts, please upgrade — prior outputs were in a |
| 73 | +different token distribution than the model was trained on. |
| 74 | + |
| 75 | +See `bench/results/2026-04-21_bpe_utf8_fix_proof.md` for the end-to-end |
| 76 | +proof and 11/11 HF parity measurements. |
| 77 | + |
| 78 | +## Verification command |
| 79 | + |
| 80 | +Reproducible regression: |
| 81 | + |
| 82 | +```bash |
| 83 | +bash scripts/test_models.sh # 15 coherence tier + 11 tokenizer UTF-8 |
| 84 | +# → PASS: 15 / 11, FAIL: 0 / 0 |
| 85 | +``` |
| 86 | + |
| 87 | +All numbers above are from 16 GB M1 Pro CPU-only, warm cache. Cold-start |
| 88 | +TTFT can be 3-10× higher on first call. |
0 commit comments