Skip to content

Commit 4e1e852

Browse files
committed
update readme
1 parent 28ed59f commit 4e1e852

1 file changed

Lines changed: 108 additions & 2 deletions

File tree

README.md

Lines changed: 108 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepS
145145
## What is GPT-QModel?
146146
GPT-QModel is a production-ready LLM model compression/quantization toolkit with hw-accelerated inference support for both CPU/GPU via HF Transformers, vLLM, and SGLang.
147147

148-
GPT-QModel currently supports GPTQ, AWQ, QQQ, GPTAQ, EoRa, GAR, with more quantization methods and enhancements planned.
148+
GPT-QModel currently supports GPTQ, AWQ, QQQ, GGUF, FP8, EXL3, GPTAQ, EoRa, and GAR, with more quantization methods and enhancements planned.
149149

150150
## Quantization Support
151151

@@ -155,16 +155,21 @@ GPT-QModel is a modular design supporting multiple quantization methods and feat
155155
|---------------------------|------------|---|---|---|---------------|
156156
| GPTQ ||||||
157157
| AWQ ||||||
158+
| GGUF || x | x | x | x |
159+
| FP8 || x | x | x | x |
160+
| Exllama V3 / EXL3 || x | x | x | x |
158161
| EoRA ||||| x |
159162
| Group Aware Act Reordering ||||||
160163
| QQQ || x | x | x | x |
161164
| Rotation || x | x | x | x |
162165
| GPTAQ ||||||
163166

167+
`GGUF`, `FP8`, and `EXL3` are currently native GPT-QModel quantization/runtime paths. `vLLM` and `SGLang` integration currently targets `GPTQ` and `AWQ`.
168+
164169
## Features
165170
* ✨ Native integration with HF [Transformers](https://github.com/huggingface/transformers), [Optimum](https://github.com/huggingface/optimum), and [Peft](https://github.com/huggingface/peft)
166171
* 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized models with format = `FORMAT.[GPTQ/AWQ]`
167-
* ✨ GPTQ, AWQ, and QQQ quantization format with hardware-accelerated inference kernels.
172+
* ✨ GPTQ, AWQ, QQQ, GGUF, FP8, and EXL3 quantization support.
168173
* 🚀 Quantize MoE models with ease even with extreme routing activation bias via `Moe.Routing` and/or `FailSafe`.
169174
* 🚀 Data Parallelism for 80%+ quantization speed reduction with Multi-GPU.
170175
* 🚀 Optimized for Python >= 3.13t (free threading) with lock-free threading.
@@ -177,6 +182,15 @@ GPT-QModel is a modular design supporting multiple quantization methods and feat
177182
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) optimized tile based inference.
178183
* 💯 CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
179184

185+
## Who's Using GPT-QModel?
186+
187+
Selected public references where teams or companies explicitly mention `GPTQModel` in documentation, integration notes, or quantized model usage. This is not an exhaustive customer list.
188+
189+
* <img src="https://cdn.simpleicons.org/huggingface/FFD21E" alt="Hugging Face logo" height="14"> Hugging Face
190+
* <img src="https://cdn.simpleicons.org/intel/0071C5" alt="Intel logo" height="14"> Intel
191+
* <img src="https://cdn.simpleicons.org/nvidia/76B900" alt="NVIDIA logo" height="14"> NVIDIA
192+
* <img src="https://cdn.simpleicons.org/alibabacloud/FF6A00" alt="Alibaba Cloud logo" height="14"> Alibaba Cloud
193+
180194

181195
## Quality: GPTQ 4bit can match native BF16:
182196
🤗 [ModelCloud quantized Vortex models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
@@ -289,6 +303,76 @@ model.quantize(calibration_dataset, batch_size=1)
289303
model.save(quant_path)
290304
```
291305

306+
#### Other Quantization Formats
307+
308+
`GPTQ`, `AWQ`, and `EXL3` are calibration-based. `GGUF` and `FP8` are weight-only and should be quantized with `calibration=None`.
309+
310+
##### GGUF Example: Llama 3.2 1B Instruct
311+
312+
```py
313+
from gptqmodel import BACKEND, GGUFConfig, GPTQModel
314+
315+
model_id = "meta-llama/Llama-3.2-1B-Instruct"
316+
quant_path = "Llama-3.2-1B-Instruct-GGUF-Q4_K_M"
317+
318+
qcfg = GGUFConfig(
319+
bits=4,
320+
format="q_k_m",
321+
)
322+
323+
model = GPTQModel.load(model_id, qcfg)
324+
model.quantize(calibration=None, backend=BACKEND.GGUF_TORCH)
325+
model.save(quant_path)
326+
```
327+
328+
##### FP8 Example: Llama 3.2 1B Instruct
329+
330+
```py
331+
from gptqmodel import BACKEND, GPTQModel, QuantizeConfig
332+
333+
model_id = "meta-llama/Llama-3.2-1B-Instruct"
334+
quant_path = "Llama-3.2-1B-Instruct-FP8-E4M3"
335+
336+
qcfg = QuantizeConfig(
337+
method="fp8",
338+
format="float8_e4m3fn", # or "float8_e5m2"
339+
bits=8,
340+
weight_scale_method="row",
341+
)
342+
343+
model = GPTQModel.load(model_id, qcfg)
344+
model.quantize(calibration=None, backend=BACKEND.TORCH)
345+
model.save(quant_path)
346+
```
347+
348+
##### Exllama V3 / EXL3 Example: Llama 3.2 1B Instruct
349+
350+
```py
351+
from datasets import load_dataset
352+
from gptqmodel import BACKEND, GPTQModel, QuantizeConfig
353+
354+
model_id = "meta-llama/Llama-3.2-1B-Instruct"
355+
quant_path = "Llama-3.2-1B-Instruct-EXL3"
356+
357+
calibration_dataset = load_dataset(
358+
"allenai/c4",
359+
data_files="en/c4-train.00001-of-01024.json.gz",
360+
split="train",
361+
).select(range(1024))["text"]
362+
363+
qcfg = QuantizeConfig(
364+
method="exl3",
365+
format="exl3",
366+
bits=4.0, # target average bits-per-weight
367+
head_bits=6.0, # optional higher bitrate for attention heads / sensitive tensors
368+
codebook="mcg", # one of: mcg, mul1, 3inst
369+
)
370+
371+
model = GPTQModel.load(model_id, qcfg)
372+
model.quantize(calibration_dataset, batch_size=1, backend=BACKEND.EXLLAMA_V3)
373+
model.save(quant_path)
374+
```
375+
292376
#### MoE Quantization
293377

294378
Some MoE (mixture of experts) models have extremely uneven/biased routing (distribution of tokens) to the `experts` causing some expert modules to receive close-to-zero activated tokens, thus failing to complete calibration-based quantization (GPTQ/AWQ).
@@ -489,6 +573,17 @@ Models quantized by GPT-QModel are inference compatible with HF Transformers (mi
489573
year={2023}
490574
}
491575
576+
# GGUF / llama.cpp
577+
@misc{ggerganov2023gguf,
578+
author = {Georgi Gerganov and ggml-org contributors},
579+
title = {llama.cpp and the GGUF model format},
580+
publisher = {GitHub},
581+
journal = {GitHub repository},
582+
howpublished = {\url{https://github.com/ggml-org/llama.cpp}},
583+
note = {Canonical GGUF implementation and format reference; see also \url{https://github.com/ggml-org/llama.cpp/wiki/dev-notes}},
584+
year = {2023}
585+
}
586+
492587
# EoRA
493588
@article{liu2024eora,
494589
title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
@@ -528,6 +623,17 @@ Models quantized by GPT-QModel are inference compatible with HF Transformers (mi
528623
journal={arXiv preprint arXiv:2406.09904},
529624
year={2024}
530625
}
626+
627+
# ExLlama V3 / EXL3
628+
@misc{turboderp2026exllamav3,
629+
author = {turboderp and exllamav3 contributors},
630+
title = {ExLlamaV3 and the EXL3 quantization format},
631+
publisher = {GitHub},
632+
journal = {GitHub repository},
633+
howpublished = {\url{https://github.com/turboderp-org/exllamav3}},
634+
note = {Project repository and EXL3 format documentation: \url{https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md}},
635+
year = {2026}
636+
}
531637
```
532638

533639
## Quick Notes

0 commit comments

Comments
 (0)