Skip to content

Latest commit

 

History

History
288 lines (225 loc) · 17.2 KB

File metadata and controls

288 lines (225 loc) · 17.2 KB

GGUF type-ID contract — v2

Authoritative assignments for enum ggml_type and enum llama_ftype in this fork. Every cherry-pick from a contributing fork MUST renumber to match this table before landing on a branch.

This document is normative. If reality disagrees with this document, fix the code, not the document — unless the contract itself is being revised, in which case bump the version at the top and call out the change.

Why this exists

Five contributing forks have independently extended ggml_type in mutually incompatible ways. Concrete known collisions:

Slot Mainline TheTom (TQ-KV HEAD) TheTom (alpha-scaling, stale) Buun (master) Carlosfundora (1-bit-turbo) Turbo-tan (main) ik_llama (main)
41 Q1_0 Q1_0 (mainline-aligned) TURBO3_0 Q1_0 (gap) Q1_0 Q1_0_G128
42 TURBO2_0 TURBO4_0 TURBO3_0 Q1_0
43 TURBO3_0 TURBO2_0 TURBO4_0 Q1_0_g128 (removed; slot returned to mainline reserve)
44 TURBO4_0 TQ3_1S TURBO2_0 PLANAR3_0 TQ3_1S (different layout from TheTom's)
45 TQ3_1S TQ4_1S TURBO3_TCQ PLANAR4_0
46 TQ4_1S TURBO2_TCQ ISO3_0 TQ3_4S
47 ISO4_0

Note (2026-05-12, recon/06): TheTom HEAD branch is feature/turboquant-kv-cache; alpha-scaling is now stale and superseded. Between alpha-scaling and TQ-KV, the TURBO_*_0 trio was reordered (TURBO2/3/4 = 42/43/44 instead of 43/41/42) and the TQ_1S types shifted up by one slot. This affects the cherry-pick recipe's "FROM" mapping but not this fork's own slot assignments (60–95 zone unchanged).

GGUF files produced by any one fork are silently misread by any other. Cherry-picking without renumbering would propagate this hazard into this fork.

Partitioning policy

ggml_type is a uint32_t-valued enum. We partition the address space into fixed-purpose zones:

Range Purpose Owner
0–41 Mainline core types upstream ggml-org/llama.cpp
42–59 Mainline growth reserve — DO NOT USE upstream (future)
60–95 Fork extensions — new types from contributing forks this project
96–199 ik_llama compatibility zone preserve ik_llama assignments
200–255 Row-interleaved / packed variants preserve ik_llama R-suffix layout

Why a mainline growth reserve. Mainline added Q1_0 at slot 41 after several forks had already placed their own types at 41. Mainline will keep adding types in this range. This fork refuses to play tug-of-war for these slots. We accept whatever mainline assigns; we never compete.

Why a high-number zone (60–95). Same reason ik_llama did it: collisions with mainline's next 10 additions become impossible.

Why preserve ik_llama's 96+ assignments. Pragmatic: ik_llama GGUFs are the most numerous fork-quantized files in the wild. Renumbering them would require either (a) a loader compatibility shim or (b) forcing users to re-quantize. Preserving the IDs lets us read existing ik_llama GGUFs without modification.

Fork extension zone (60–95) — canonical assignments

60–65: TurboQuant KV family (source: TheTom feature/turboquant-kv-cache)

Source-fork canonical branch confirmed by recon recon/06-thetom-branches.md (2026-05-12). Earlier drafts of this document named feature/alpha-scaling as the source; alpha-scaling has since been superseded by TQ-KV (1 substantive unique commit, the optional TURBO_ALPHA env var knob). All "TheTom name (renamed)" slot numbers below reflect TQ-KV HEAD as of 5aeb2fdbe.

Slot Type name TheTom name (renamed) Description
60 GGML_TYPE_TURBOQ2_0 TURBO2_0 (42) 2-bit PolarQuant, no QJL
61 GGML_TYPE_TURBOQ3_0 TURBO3_0 (43) 2-bit PolarQuant + 1-bit QJL
62 GGML_TYPE_TURBOQ4_0 TURBO4_0 (44) 4-bit PolarQuant (default TURBOQ4_USE_4BIT=1; legacy 3-bit+QJL mode available via TURBOQ4_USE_4BIT=0)
63 GGML_TYPE_TURBOQ8_0 buun TURBO8_0 8-bit KV: FWHT + uniform 256-level grid (centroid[i]=(i-127.5)/127.5) + per-block absmax, no QJL, no PolarQuant codebook. CLI string turboq8; block block_turboq8_0 = 130 bytes (fp16 absmax + 128×uint8), 8.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan kernel yet.
64 GGML_TYPE_TURBOQ5_0 ygg (TODO 250) 5-bit KV: FWHT + uniform 32-level grid (centroid[i]=(i-15.5)/15.5) + per-block absmax, no QJL, no PolarQuant codebook. Extends turboq8 design; q5_0-style index split (low nibble in qs, high 1 bit in qh). CLI string turboq5; block block_turboq5_0 = 82 bytes, 5.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan yet. See features/turboquant-hibit-kv.md.
65 GGML_TYPE_TURBOQ6_0 ygg (TODO 250) 6-bit KV: FWHT + uniform 64-level grid (centroid[i]=(i-31.5)/31.5) + per-block absmax, no QJL, no PolarQuant codebook. Extends turboq8 design; q6_K-style index split (low nibble in qs, high 2 bits in qh). CLI string turboq6; block block_turboq6_0 = 98 bytes, 6.125 bpw. CPU + CUDA/HIP fattn-vec; no Vulkan yet. See features/turboquant-hibit-kv.md.

Slot-65 reassignment (2026-06-22, TODO 250). Slot 65 was previously a doc-only reservation for GGML_TYPE_TURBOQ3_NATIVE (turbo-tan TQ3_0, 200). That type was never landed in ggml/include/ggml.h (zero code references fork-wide), so slot 65 was free in the enum and is now assigned to GGML_TYPE_TURBOQ6_0. If turbo-tan TQ3_0 is ported later it must take a fresh free slot, not 65.

Symbol prefix: turboq_ (kernels), TURBOQ_ (constants). The Q suffix disambiguates from the TURBO*_0 collisions in contributing forks.

66–71: TCQ + InnerQ KV family (source: buun master (TCQ) / TheTom (InnerQ))

Slot Type name Source name (renamed) Description Block size Vulkan
66 GGML_TYPE_TURBOQ2_TCQ TURBO2_TCQ (buun 46) TCQ k=2, L=8, 256 states 36 bytes CPU fallback
67 GGML_TYPE_TURBOQ3_TCQ TURBO3_TCQ (buun 45) TCQ k=3, Viterbi-decoded 52 bytes CPU fallback
68 GGML_TYPE_TURBOQ2_INNERQ TURBO2_INNERQ (ft2 67) 2-bit + InnerQ per-channel equalization; block_turboq2_0 34 bytes CPU fallback (no .comp shaders in ft2)
69 GGML_TYPE_TURBOQ3_INNERQ TURBO3_INNERQ (ft2 68) 3-bit + InnerQ per-channel equalization; block_turboq3_0 50 bytes CPU fallback
70 (retired/reserved) Was GGML_TYPE_TURBOQ4_INNERQ: 4-bit InnerQ alias of TURBOQ4_0; InnerQ equalization regresses quality at 4-bit (PPL 9.08 vs 7.47, ft2 ccfe39d675). Slot permanently retired — do not reuse.
71 GGML_TYPE_KV_OSCAR_INT2 new — OScaR Phase 1 FHT + per-block min-max uniform INT2 (arXiv:2605.19660); Phase 1 CUDA-only; Phase 2 adds F16 residual window (R=128) + hybrid-memory-chain constructor-chain propagation 36 bytes CPU fallback

Note: InnerQ is K-cache runtime quantization only (not weight quantization); calibration state is per-session, no GGUF persistence.

Symbol prefix: turboq_tcq_ (TCQ), turboq_innerq_ (InnerQ). TCQ extends the TurboQuant family conceptually but uses Viterbi-coded trellises instead of scalar codebooks. InnerQ applies per-channel K-cache equalization before WHT rotation; wire format identical to the corresponding TURBOQ_0 block structs.

72–79: Reserved (formerly RotorQuant KV family — removed)

Slots 72–75 previously held the RotorQuant KV family (RQ_PLANAR3_0, RQ_PLANAR4_0, RQ_ISO3_0, RQ_ISO4_0) ported from carlosfundora 1-bit-turbo. The family was removed (55bb0d418): ISO3_0 was strictly dominated (+23.5% PPL vs comparable TurboQ types), and all four types were zero-rotation scalar duplicates with no recoverable advantage at identical or lower bpw.

Slot Status
72 reserved (formerly GGML_TYPE_RQ_PLANAR3_0; removed)
73 reserved (formerly GGML_TYPE_RQ_PLANAR4_0; removed)
74 reserved (formerly GGML_TYPE_RQ_ISO3_0; removed)
75 reserved (formerly GGML_TYPE_RQ_ISO4_0; removed)
76–79 reserved

80–85: WHT weight family (source: TheTom feature/turboquant-kv-cache)

Originally drafted against pr/tq4-weight-compression; that branch is fully subsumed by feature/turboquant-kv-cache (zero unique commits by subject — see recon/06-thetom-branches.md). All slot numbers below reflect TQ-KV HEAD as of 5aeb2fdbe.

Slot Type name TheTom name (renamed) Description
80 GGML_TYPE_WHT3_0 TQ3_1S (45) WHT-rotated 8-level Lloyd-Max, block_size=32
81 GGML_TYPE_WHT4_0 TQ4_1S (46) WHT-rotated 16-level Lloyd-Max, block_size=32
82 GGML_TYPE_WHT5_0 — (yggdrasil extension) WHT-rotated 32-level Lloyd-Max, block_size=32 (6.0 bpw)
83 GGML_TYPE_WHT6_0 — (yggdrasil extension) WHT-rotated 64-level Lloyd-Max, block_size=32 (7.0 bpw)
84 GGML_TYPE_WHT8_0 — (yggdrasil extension) WHT-rotated 256-level Lloyd-Max, block_size=32 (9.0 bpw)
85 reserved future WHT variant

Symbol prefix: wht_. The TQ prefix in TheTom's naming collided with turbo-tan's RaBitQ TQ3 family; renaming to WHT reflects the actual transform (Walsh-Hadamard) and breaks the collision.

WHT5_0/WHT6_0/WHT8_0 (slots 82/83/84, 2026-06-22): yggdrasil extensions of TheTom's WHT lineage to wider Lloyd-Max codebooks (no upstream TheTom counterpart — the rotation + dual-half-scale block design and quantizer are TheTom's, the 5/6/8-bit codebooks and index packings are new). FTYPEs MOSTLY_WHT5_0=59, MOSTLY_WHT6_0=60, MOSTLY_WHT8_0=61. Status: functional (CPU + CUDA/HIP dequant→cuBLAS path); fused mmvq + Vulkan deferred. Credit: TheTom (WHT method).

86–91: RaBitQ weight family (source: turbo-tan main)

Slot Type name Turbo-tan name (renamed) Description
86 GGML_TYPE_RBQ3_1S TQ3_1S (44) RaBitQ 3-bit, two half-block scales
87 GGML_TYPE_RBQ3_4S TQ3_4S (46) RaBitQ 3-bit, four u8 per-8 scales (4.0 bpw)
88–91 reserved future RaBitQ variants

Symbol prefix: rbq_. Disambiguates from TheTom's WHT family.

92–95: unanticipated weight quants

Slot Name Source Notes
92 GGML_TYPE_WQ3_TCQ buun feat/tcq-wq3-ffn-fusion TurboQuant 3-bit weight quant: TCQ (k=3, L=9, 512 states) + FWHT rotation. Re-slotted from buun's upstream 46 to avoid a mid-enum renumber of our relocated KV types. GPU-only dequant; reuses the 52-byte block_turboq3_tcq layout (128 elems, 3.25 bpv). CUDA-first (Ph1); CPU/HIP/Vulkan + quantizer in Ph2–4. See docs/features/wq3-tcq.md.
93–95 reserved future weight quant extensions

WQ3_TCQ landed here (not the 80–85 WHT zone) because it is neither a WHT nor a RaBitQ variant — it is a trellis-coded (TCQ) weight quant, the first of its kind, so it takes the dedicated unanticipated-weight reserve. This also keeps the in-flight WHT5/6/8 (83–85) reservations free.

ik_llama compatibility zone (96–199) — preserved IDs

ik_llama's chosen IDs are preserved verbatim. Renumbering would break existing ik_llama-quantized GGUFs. The full list of preserved assignments:

Slot Name Source
97 Q8_0_X4 interleaved 8-bit, ×4 packing
98 Q8_1_X4 interleaved 8-bit (signed-bias), ×4 packing
99 Q8_2_X4 interleaved 8-bit (variant), ×4 packing
133 Q6_0 revived legacy format
134 IQ1_BN BitNet 1-bit
135 IQ2_BN BitNet 2-bit
136 Q8_K64 K-quant with 64-element blocks
137 IQ2_K IQK 2-bit imatrix-aware weight quant (2.375 bpw) — Phase 5b-1a; ygg canonical
138 IQ3_K IQK 3-bit imatrix-aware weight quant (3.44 bpw) — Phase 5b-1a; ygg canonical
139 IQ4_K IQK 4-bit imatrix-aware weight quant (4.50 bpw) — Phase 5b-1a; ygg canonical
140 IQ5_K IQK 5-bit imatrix-aware weight quant — Phase 5b-2 S1 f7a489de5; ygg canonical
141 IQ6_K IQK 6-bit imatrix-aware weight quant — Phase 5b-2 S1 f7a489de5; ygg canonical
144 IQ4_KS IK-quant small; row_meta=4 bytes (float row-scale) — Phase 5b-1b; ygg canonical
145 IQ2_KS
146 IQ4_KSS IK-quant small-small; row_meta=4 bytes — Phase 5b-1b; ygg canonical
147–151 Q8_K16, Q8_K32, Q8_KR8, Q8_K128, Q8_KV Q8 K-block variants
152 IQ5_KS
153–154 IQ2_KT, IQ3_KT trellis weight quants (dormant; preserve IDs)
155 IQ4_KT IK trellis 4-bit weight quant; row_meta=4 bytes — Phase 5b-1b; ygg canonical (differs from buun TCQ: IQ4_KT is a weight quant, TCQ is a KV-cache quant)
156 IQ3_KS IK-quant small 3-bit; row_meta=2 bytes (uint16_t half-row-scale) — Phase 5b-1b; ygg canonical
157 IQ2_KL IQK 2-bit low-bpw (2.6875 bpw) imatrix-aware weight quant — Phase 5b-1c S1 e404274b9; ygg canonical
158 IQ1_KT trellis 1-bit (dormant)

Row-interleaved / packed variants (200–255)

Preserve ik_llama's R-suffix layout verbatim:

Slot Name
202 Q4_0_R8
206 Q5_0_R4
208 Q8_0_R8
210–214 Q2_K_R4, Q3_K_R4, Q4_K_R4, Q5_K_R4, Q6_K_R4
216–223 IQ2_XXS_R4, IQ2_XS_R4, IQ3_XXS_R4, IQ1_S_R4, IQ4_NL_R4, IQ3_S_R4, IQ2_S_R4, IQ4_XS_R8
229 IQ1_M_R4
230 BF16_R16

Slots 200–201, 203–205, 207, 209, 215, 224–228, 231–255 are reserved for future packed-variant additions.

Turbo-tan's TQ3_0 = 200 (KV-cache only) was tentatively earmarked for the TurboQuant KV zone, but the earlier draft assignment GGML_TYPE_TURBOQ3_NATIVE = 65 was never landed in ggml/include/ggml.h (doc-only reservation, zero code references). Slot 65 has since been assigned to GGML_TYPE_TURBOQ6_0 (2026-06-22, TODO 250). If turbo-tan's TQ3_0 is ported later it must take a fresh free slot, not 65 — and revisit whether it is a TurboQuant variant at all.

llama_ftype assignments

enum llama_ftype values for the MOSTLY_* variants are derived from ggml_type via:

LLAMA_FTYPE_MOSTLY_<NAME> = <next-available-slot>

Assignment order follows ggml_type numeric order, starting at the first unused mainline ftype slot (currently 41 after mainline's Q1_0=40).

llama_ftype assignments are mechanical; this document does not enumerate them. They are settled at the moment each ggml_type lands. The implementation that adds a new ggml_type MUST also add the corresponding LLAMA_FTYPE_MOSTLY_* in the same commit.

Reader compatibility for legacy fork GGUFs

We will NOT silently re-interpret legacy fork-specific GGUFs. If a user brings a GGUF quantized with (e.g.) buun's TURBO3_0=42, this fork's loader will fail with an explicit error:

unrecognized ggml_type 42 in <file.gguf>. This appears to be a buun-fork
GGUF; this fork places TurboQuant3 at type 61. Re-quantize with
`llama-quantize <model> turboq3_0`.

A future optional loader flag (--legacy-fork-ids=<fork-name>) MAY implement on-the-fly remapping. This is not part of the v1 contract.

Policy for adding new types

When this fork grows a new quant type (whether ported from a fork or invented):

  1. Allocate the lowest-numbered available slot in the appropriate family zone (60–95). If the family zone is full, expand into 92–95 reserves before considering 96+ (which is owned by ik_llama compat).
  2. The naming MUST follow the family's symbol prefix.
  3. The ggml_type, llama_ftype, type-traits row, and CPU vecdot must land in the same commit. Partial landings are rejected.
  4. Update this document in the same PR.
  5. The PPL regression harness must include the new type before the PR can merge. No exceptions.

Open issues

  • TheTom's TURBO3_0 shipped existing GGUFs at slot 41. Production TheTom-quantized models exist with this ID. The v1 reader rejects them. A --legacy-fork-ids=thetom flag is a likely v2 addition.

  • Mainline may at some point claim slots 42–59 with types whose names collide with our renames. E.g., mainline could add a future GGML_TYPE_PLANAR3_0 unrelated to the (now-removed) RotorQuant family. Our policy is: rename ours, never theirs. The TURBOQ prefix is already there to absorb this. (The RQ_ prefix was used by RotorQuant, which was removed; those slots 72–75 are now reserved.)

  • GGUF metadata format (the part outside the type-ID enum) may also need fork-specific keys (e.g., turbo-tan's WHT rotation tables, buun's TCQ codebook indices). Out of scope for this document — to be addressed in a separate GGUF_METADATA_KEYS.md.

Version log

  • v1 (2026-05-12) — initial contract. Authored before any cherry-picks land. Authoritative for Phase 0+.
  • v2 (2026-05-22 to 2026-05-24) — Phase 5b-1a landed: IQ2_K=137, IQ3_K=138, IQ4_K=139 annotated with ygg canonical + Phase 5b-1a tag. Phase 5b-1b landed: IQ4_KS=144, IQ4_KSS=146, IQ4_KT=155, IQ3_KS=156 annotated with Phase 5b-1b tag + row_meta byte sizes. IQ4_KT separated from dormant IQ2/IQ3_KT in table entry. IQ5_K/IQ6_K noted as Phase 5b-2 recon in-flight.
  • v3 (2026-05-24) — Phase 5b-2 S1 landed: IQ5_K=140, IQ6_K=141 annotated with ygg canonical + Phase 5b-2 S1 tag. Phase 5b-1c S1 landed: IQ2_KL=157 annotated with ygg canonical + Phase 5b-1c S1 tag.