Authoritative assignments for enum ggml_type and enum llama_ftype in
this fork. Every cherry-pick from a contributing fork MUST renumber to
match this table before landing on a branch.
This document is normative. If reality disagrees with this document, fix the code, not the document — unless the contract itself is being revised, in which case bump the version at the top and call out the change.
Five contributing forks have independently extended ggml_type in mutually
incompatible ways. Concrete known collisions:
| Slot | Mainline | TheTom (TQ-KV HEAD) | TheTom (alpha-scaling, stale) | Buun (master) | Carlosfundora (1-bit-turbo) | Turbo-tan (main) | ik_llama (main) |
|---|---|---|---|---|---|---|---|
| 41 | Q1_0 |
Q1_0 (mainline-aligned) |
TURBO3_0 |
Q1_0 |
(gap) | Q1_0 |
Q1_0_G128 |
| 42 | — | TURBO2_0 |
TURBO4_0 |
TURBO3_0 |
Q1_0 |
— | — |
| 43 | — | TURBO3_0 |
TURBO2_0 |
TURBO4_0 |
Q1_0_g128 (removed; slot returned to mainline reserve) |
— | — |
| 44 | — | TURBO4_0 |
TQ3_1S |
TURBO2_0 |
PLANAR3_0 |
TQ3_1S (different layout from TheTom's) |
— |
| 45 | — | TQ3_1S |
TQ4_1S |
TURBO3_TCQ |
PLANAR4_0 |
— | — |
| 46 | — | TQ4_1S |
— | TURBO2_TCQ |
ISO3_0 |
TQ3_4S |
— |
| 47 | — | — | — | — | ISO4_0 |
— | — |
Note (2026-05-12, recon/06): TheTom HEAD branch is feature/turboquant-kv-cache; alpha-scaling is now stale and superseded. Between alpha-scaling and TQ-KV, the TURBO_*_0 trio was reordered (TURBO2/3/4 = 42/43/44 instead of 43/41/42) and the TQ_1S types shifted up by one slot. This affects the cherry-pick recipe's "FROM" mapping but not this fork's own slot assignments (60–95 zone unchanged).
GGUF files produced by any one fork are silently misread by any other. Cherry-picking without renumbering would propagate this hazard into this fork.
ggml_type is a uint32_t-valued enum. We partition the address space into
fixed-purpose zones:
| Range | Purpose | Owner |
|---|---|---|
| 0–41 | Mainline core types | upstream ggml-org/llama.cpp |
| 42–59 | Mainline growth reserve — DO NOT USE | upstream (future) |
| 60–95 | Fork extensions — new types from contributing forks | this project |
| 96–199 | ik_llama compatibility zone | preserve ik_llama assignments |
| 200–255 | Row-interleaved / packed variants | preserve ik_llama R-suffix layout |
Why a mainline growth reserve. Mainline added Q1_0 at slot 41 after
several forks had already placed their own types at 41. Mainline will keep
adding types in this range. This fork refuses to play tug-of-war for these
slots. We accept whatever mainline assigns; we never compete.
Why a high-number zone (60–95). Same reason ik_llama did it: collisions with mainline's next 10 additions become impossible.
Why preserve ik_llama's 96+ assignments. Pragmatic: ik_llama GGUFs are the most numerous fork-quantized files in the wild. Renumbering them would require either (a) a loader compatibility shim or (b) forcing users to re-quantize. Preserving the IDs lets us read existing ik_llama GGUFs without modification.
Source-fork canonical branch confirmed by recon recon/06-thetom-branches.md (2026-05-12). Earlier drafts of this document named feature/alpha-scaling as the source; alpha-scaling has since been superseded by TQ-KV (1 substantive unique commit, the optional TURBO_ALPHA env var knob). All "TheTom name (renamed)" slot numbers below reflect TQ-KV HEAD as of 5aeb2fdbe.
| Slot | Type name | TheTom name (renamed) | Description |
|---|---|---|---|
| 60 | GGML_TYPE_TURBOQ2_0 |
TURBO2_0 (42) |
2-bit PolarQuant, no QJL |
| 61 | GGML_TYPE_TURBOQ3_0 |
TURBO3_0 (43) |
2-bit PolarQuant + 1-bit QJL |
| 62 | GGML_TYPE_TURBOQ4_0 |
TURBO4_0 (44) |
4-bit PolarQuant (default TURBOQ4_USE_4BIT=1; legacy 3-bit+QJL mode available via TURBOQ4_USE_4BIT=0) |
| 63 | GGML_TYPE_TURBOQ8_0 |
buun TURBO8_0 |
8-bit KV: FWHT + uniform 256-level grid (centroid[i]=(i-127.5)/127.5) + per-block absmax, no QJL, no PolarQuant codebook. CLI string turboq8; block block_turboq8_0 = 130 bytes (fp16 absmax + 128×uint8), 8.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan kernel yet. |
| 64 | GGML_TYPE_TURBOQ5_0 |
ygg (TODO 250) | 5-bit KV: FWHT + uniform 32-level grid (centroid[i]=(i-15.5)/15.5) + per-block absmax, no QJL, no PolarQuant codebook. Extends turboq8 design; q5_0-style index split (low nibble in qs, high 1 bit in qh). CLI string turboq5; block block_turboq5_0 = 82 bytes, 5.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan yet. See features/turboquant-hibit-kv.md. |
| 65 | GGML_TYPE_TURBOQ6_0 |
ygg (TODO 250) | 6-bit KV: FWHT + uniform 64-level grid (centroid[i]=(i-31.5)/31.5) + per-block absmax, no QJL, no PolarQuant codebook. Extends turboq8 design; q6_K-style index split (low nibble in qs, high 2 bits in qh). CLI string turboq6; block block_turboq6_0 = 98 bytes, 6.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan yet. See features/turboquant-hibit-kv.md. |
Slot-65 reassignment (2026-06-22, TODO 250). Slot 65 was previously a doc-only reservation for
GGML_TYPE_TURBOQ3_NATIVE(turbo-tanTQ3_0, 200). That type was never landed inggml/include/ggml.h(zero code references fork-wide), so slot 65 was free in the enum and is now assigned toGGML_TYPE_TURBOQ6_0. If turbo-tanTQ3_0is ported later it must take a fresh free slot, not 65.
Symbol prefix: turboq_ (kernels), TURBOQ_ (constants). The Q suffix
disambiguates from the TURBO*_0 collisions in contributing forks.
| Slot | Type name | Source name (renamed) | Description | Block size | Vulkan |
|---|---|---|---|---|---|
| 66 | GGML_TYPE_TURBOQ2_TCQ |
TURBO2_TCQ (buun 46) |
TCQ k=2, L=8, 256 states | 36 bytes | CPU fallback |
| 67 | GGML_TYPE_TURBOQ3_TCQ |
TURBO3_TCQ (buun 45) |
TCQ k=3, Viterbi-decoded | 52 bytes | CPU fallback |
| 68 | GGML_TYPE_TURBOQ2_INNERQ |
TURBO2_INNERQ (ft2 67) |
2-bit + InnerQ per-channel equalization; block_turboq2_0 | 34 bytes | CPU fallback (no .comp shaders in ft2) |
| 69 | GGML_TYPE_TURBOQ3_INNERQ |
TURBO3_INNERQ (ft2 68) |
3-bit + InnerQ per-channel equalization; block_turboq3_0 | 50 bytes | CPU fallback |
| 70 | (retired/reserved) | — | Was GGML_TYPE_TURBOQ4_INNERQ: 4-bit InnerQ alias of TURBOQ4_0; InnerQ equalization regresses quality at 4-bit (PPL 9.08 vs 7.47, ft2 ccfe39d675). Slot permanently retired — do not reuse. |
— | — |
| 71 | GGML_TYPE_KV_OSCAR_INT2 |
new — OScaR Phase 1 | FHT + per-block min-max uniform INT2 (arXiv:2605.19660); Phase 1 CUDA-only; Phase 2 adds F16 residual window (R=128) + hybrid-memory-chain constructor-chain propagation | 36 bytes | CPU fallback |
Note: InnerQ is K-cache runtime quantization only (not weight quantization); calibration state is per-session, no GGUF persistence.
Symbol prefix: turboq_tcq_ (TCQ), turboq_innerq_ (InnerQ). TCQ extends the TurboQuant family
conceptually but uses Viterbi-coded trellises instead of scalar codebooks. InnerQ applies
per-channel K-cache equalization before WHT rotation; wire format identical to the corresponding
TURBOQ_0 block structs.
Slots 72–75 previously held the RotorQuant KV family (RQ_PLANAR3_0, RQ_PLANAR4_0,
RQ_ISO3_0, RQ_ISO4_0) ported from carlosfundora 1-bit-turbo. The family was
removed (55bb0d418): ISO3_0 was strictly dominated (+23.5% PPL vs
comparable TurboQ types), and all four types were zero-rotation scalar duplicates with
no recoverable advantage at identical or lower bpw.
| Slot | Status |
|---|---|
| 72 | reserved (formerly GGML_TYPE_RQ_PLANAR3_0; removed) |
| 73 | reserved (formerly GGML_TYPE_RQ_PLANAR4_0; removed) |
| 74 | reserved (formerly GGML_TYPE_RQ_ISO3_0; removed) |
| 75 | reserved (formerly GGML_TYPE_RQ_ISO4_0; removed) |
| 76–79 | reserved |
Originally drafted against pr/tq4-weight-compression; that branch is fully subsumed by feature/turboquant-kv-cache (zero unique commits by subject — see recon/06-thetom-branches.md). All slot numbers below reflect TQ-KV HEAD as of 5aeb2fdbe.
| Slot | Type name | TheTom name (renamed) | Description |
|---|---|---|---|
| 80 | GGML_TYPE_WHT3_0 |
TQ3_1S (45) |
WHT-rotated 8-level Lloyd-Max, block_size=32 |
| 81 | GGML_TYPE_WHT4_0 |
TQ4_1S (46) |
WHT-rotated 16-level Lloyd-Max, block_size=32 |
| 82 | GGML_TYPE_WHT5_0 |
— (yggdrasil extension) | WHT-rotated 32-level Lloyd-Max, block_size=32 (6.0 bpw) |
| 83 | GGML_TYPE_WHT6_0 |
— (yggdrasil extension) | WHT-rotated 64-level Lloyd-Max, block_size=32 (7.0 bpw) |
| 84 | GGML_TYPE_WHT8_0 |
— (yggdrasil extension) | WHT-rotated 256-level Lloyd-Max, block_size=32 (9.0 bpw) |
| 85 | reserved | future WHT variant |
Symbol prefix: wht_. The TQ prefix in TheTom's naming collided with
turbo-tan's RaBitQ TQ3 family; renaming to WHT reflects the actual
transform (Walsh-Hadamard) and breaks the collision.
WHT5_0/WHT6_0/WHT8_0 (slots 82/83/84, 2026-06-22): yggdrasil extensions of
TheTom's WHT lineage to wider Lloyd-Max codebooks (no upstream TheTom counterpart
— the rotation + dual-half-scale block design and quantizer are TheTom's, the
5/6/8-bit codebooks and index packings are new). FTYPEs MOSTLY_WHT5_0=59,
MOSTLY_WHT6_0=60, MOSTLY_WHT8_0=61. Status: functional (CPU + CUDA/HIP
dequant→cuBLAS path); fused mmvq + Vulkan deferred. Credit: TheTom (WHT method).
| Slot | Type name | Turbo-tan name (renamed) | Description |
|---|---|---|---|
| 86 | GGML_TYPE_RBQ3_1S |
TQ3_1S (44) |
RaBitQ 3-bit, two half-block scales |
| 87 | GGML_TYPE_RBQ3_4S |
TQ3_4S (46) |
RaBitQ 3-bit, four u8 per-8 scales (4.0 bpw) |
| 88–91 | reserved | future RaBitQ variants |
Symbol prefix: rbq_. Disambiguates from TheTom's WHT family.
| Slot | Name | Source | Notes |
|---|---|---|---|
| 92 | GGML_TYPE_WQ3_TCQ |
buun feat/tcq-wq3-ffn-fusion |
TurboQuant 3-bit weight quant: TCQ (k=3, L=9, 512 states) + FWHT rotation. Re-slotted from buun's upstream 46 to avoid a mid-enum renumber of our relocated KV types. GPU-only dequant; reuses the 52-byte block_turboq3_tcq layout (128 elems, 3.25 bpv). CUDA-first (Ph1); CPU/HIP/Vulkan + quantizer in Ph2–4. See docs/features/wq3-tcq.md. |
| 93–95 | reserved | future weight quant extensions |
WQ3_TCQ landed here (not the 80–85 WHT zone) because it is neither a WHT nor a RaBitQ variant — it is a trellis-coded (TCQ) weight quant, the first of its kind, so it takes the dedicated unanticipated-weight reserve. This also keeps the in-flight WHT5/6/8 (83–85) reservations free.
ik_llama's chosen IDs are preserved verbatim. Renumbering would break existing ik_llama-quantized GGUFs. The full list of preserved assignments:
| Slot | Name | Source |
|---|---|---|
| 97 | Q8_0_X4 |
interleaved 8-bit, ×4 packing |
| 98 | Q8_1_X4 |
interleaved 8-bit (signed-bias), ×4 packing |
| 99 | Q8_2_X4 |
interleaved 8-bit (variant), ×4 packing |
| 133 | Q6_0 |
revived legacy format |
| 134 | IQ1_BN |
BitNet 1-bit |
| 135 | IQ2_BN |
BitNet 2-bit |
| 136 | Q8_K64 |
K-quant with 64-element blocks |
| 137 | IQ2_K |
IQK 2-bit imatrix-aware weight quant (2.375 bpw) — Phase 5b-1a; ygg canonical |
| 138 | IQ3_K |
IQK 3-bit imatrix-aware weight quant (3.44 bpw) — Phase 5b-1a; ygg canonical |
| 139 | IQ4_K |
IQK 4-bit imatrix-aware weight quant (4.50 bpw) — Phase 5b-1a; ygg canonical |
| 140 | IQ5_K |
IQK 5-bit imatrix-aware weight quant — Phase 5b-2 S1 f7a489de5; ygg canonical |
| 141 | IQ6_K |
IQK 6-bit imatrix-aware weight quant — Phase 5b-2 S1 f7a489de5; ygg canonical |
| 144 | IQ4_KS |
IK-quant small; row_meta=4 bytes (float row-scale) — Phase 5b-1b; ygg canonical |
| 145 | IQ2_KS |
|
| 146 | IQ4_KSS |
IK-quant small-small; row_meta=4 bytes — Phase 5b-1b; ygg canonical |
| 147–151 | Q8_K16, Q8_K32, Q8_KR8, Q8_K128, Q8_KV |
Q8 K-block variants |
| 152 | IQ5_KS |
|
| 153–154 | IQ2_KT, IQ3_KT |
trellis weight quants (dormant; preserve IDs) |
| 155 | IQ4_KT |
IK trellis 4-bit weight quant; row_meta=4 bytes — Phase 5b-1b; ygg canonical (differs from buun TCQ: IQ4_KT is a weight quant, TCQ is a KV-cache quant) |
| 156 | IQ3_KS |
IK-quant small 3-bit; row_meta=2 bytes (uint16_t half-row-scale) — Phase 5b-1b; ygg canonical |
| 157 | IQ2_KL |
IQK 2-bit low-bpw (2.6875 bpw) imatrix-aware weight quant — Phase 5b-1c S1 e404274b9; ygg canonical |
| 158 | IQ1_KT |
trellis 1-bit (dormant) |
Preserve ik_llama's R-suffix layout verbatim:
| Slot | Name |
|---|---|
| 202 | Q4_0_R8 |
| 206 | Q5_0_R4 |
| 208 | Q8_0_R8 |
| 210–214 | Q2_K_R4, Q3_K_R4, Q4_K_R4, Q5_K_R4, Q6_K_R4 |
| 216–223 | IQ2_XXS_R4, IQ2_XS_R4, IQ3_XXS_R4, IQ1_S_R4, IQ4_NL_R4, IQ3_S_R4, IQ2_S_R4, IQ4_XS_R8 |
| 229 | IQ1_M_R4 |
| 230 | BF16_R16 |
Slots 200–201, 203–205, 207, 209, 215, 224–228, 231–255 are reserved for future packed-variant additions.
Turbo-tan's TQ3_0 = 200 (KV-cache only) was tentatively earmarked for
the TurboQuant KV zone, but the earlier draft assignment GGML_TYPE_TURBOQ3_NATIVE = 65
was never landed in ggml/include/ggml.h (doc-only reservation, zero code references).
Slot 65 has since been assigned to GGML_TYPE_TURBOQ6_0 (2026-06-22, TODO 250). If turbo-tan's
TQ3_0 is ported later it must take a fresh free slot, not 65 — and revisit whether it is a
TurboQuant variant at all.
enum llama_ftype values for the MOSTLY_* variants are derived from
ggml_type via:
LLAMA_FTYPE_MOSTLY_<NAME> = <next-available-slot>
Assignment order follows ggml_type numeric order, starting at the first
unused mainline ftype slot (currently 41 after mainline's Q1_0=40).
llama_ftype assignments are mechanical; this document does not enumerate
them. They are settled at the moment each ggml_type lands. The
implementation that adds a new ggml_type MUST also add the
corresponding LLAMA_FTYPE_MOSTLY_* in the same commit.
We will NOT silently re-interpret legacy fork-specific GGUFs. If a user
brings a GGUF quantized with (e.g.) buun's TURBO3_0=42, this fork's
loader will fail with an explicit error:
unrecognized ggml_type 42 in <file.gguf>. This appears to be a buun-fork
GGUF; this fork places TurboQuant3 at type 61. Re-quantize with
`llama-quantize <model> turboq3_0`.
A future optional loader flag (--legacy-fork-ids=<fork-name>) MAY
implement on-the-fly remapping. This is not part of the v1 contract.
When this fork grows a new quant type (whether ported from a fork or invented):
- Allocate the lowest-numbered available slot in the appropriate family zone (60–95). If the family zone is full, expand into 92–95 reserves before considering 96+ (which is owned by ik_llama compat).
- The naming MUST follow the family's symbol prefix.
- The
ggml_type,llama_ftype, type-traits row, and CPU vecdot must land in the same commit. Partial landings are rejected. - Update this document in the same PR.
- The PPL regression harness must include the new type before the PR can merge. No exceptions.
-
TheTom's
TURBO3_0shipped existing GGUFs at slot 41. Production TheTom-quantized models exist with this ID. The v1 reader rejects them. A--legacy-fork-ids=thetomflag is a likely v2 addition. -
Mainline may at some point claim slots 42–59 with types whose names collide with our renames. E.g., mainline could add a future
GGML_TYPE_PLANAR3_0unrelated to the (now-removed) RotorQuant family. Our policy is: rename ours, never theirs. TheTURBOQprefix is already there to absorb this. (TheRQ_prefix was used by RotorQuant, which was removed; those slots 72–75 are now reserved.) -
GGUF metadata format (the part outside the type-ID enum) may also need fork-specific keys (e.g., turbo-tan's WHT rotation tables, buun's TCQ codebook indices). Out of scope for this document — to be addressed in a separate GGUF_METADATA_KEYS.md.
- v1 (2026-05-12) — initial contract. Authored before any cherry-picks land. Authoritative for Phase 0+.
- v2 (2026-05-22 to 2026-05-24) — Phase 5b-1a landed: IQ2_K=137, IQ3_K=138, IQ4_K=139 annotated with ygg canonical + Phase 5b-1a tag. Phase 5b-1b landed: IQ4_KS=144, IQ4_KSS=146, IQ4_KT=155, IQ3_KS=156 annotated with Phase 5b-1b tag + row_meta byte sizes. IQ4_KT separated from dormant IQ2/IQ3_KT in table entry. IQ5_K/IQ6_K noted as Phase 5b-2 recon in-flight.
- v3 (2026-05-24) — Phase 5b-2 S1 landed: IQ5_K=140, IQ6_K=141 annotated with ygg canonical + Phase 5b-2 S1 tag. Phase 5b-1c S1 landed: IQ2_KL=157 annotated with ygg canonical + Phase 5b-1c S1 tag.