You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+108-2Lines changed: 108 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -145,7 +145,7 @@ Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepS
145
145
## What is GPT-QModel?
146
146
GPT-QModel is a production-ready LLM model compression/quantization toolkit with hw-accelerated inference support for both CPU/GPU via HF Transformers, vLLM, and SGLang.
147
147
148
-
GPT-QModel currently supports GPTQ, AWQ, QQQ, GPTAQ, EoRa, GAR, with more quantization methods and enhancements planned.
148
+
GPT-QModel currently supports GPTQ, AWQ, QQQ, GGUF, FP8, EXL3, GPTAQ, EoRa, and GAR, with more quantization methods and enhancements planned.
149
149
150
150
## Quantization Support
151
151
@@ -155,16 +155,21 @@ GPT-QModel is a modular design supporting multiple quantization methods and feat
`GGUF`, `FP8`, and `EXL3` are currently native GPT-QModel quantization/runtime paths. `vLLM` and `SGLang` integration currently targets `GPTQ` and `AWQ`.
168
+
164
169
## Features
165
170
* ✨ Native integration with HF [Transformers](https://github.com/huggingface/transformers), [Optimum](https://github.com/huggingface/optimum), and [Peft](https://github.com/huggingface/peft)
166
171
* 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized models with format = `FORMAT.[GPTQ/AWQ]`
167
-
* ✨ GPTQ, AWQ, and QQQ quantization format with hardware-accelerated inference kernels.
* 🚀 Quantize MoE models with ease even with extreme routing activation bias via `Moe.Routing` and/or `FailSafe`.
169
174
* 🚀 Data Parallelism for 80%+ quantization speed reduction with Multi-GPU.
170
175
* 🚀 Optimized for Python >= 3.13t (free threading) with lock-free threading.
@@ -177,6 +182,15 @@ GPT-QModel is a modular design supporting multiple quantization methods and feat
177
182
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) optimized tile based inference.
178
183
* 💯 CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
179
184
185
+
## Who's Using GPT-QModel?
186
+
187
+
Selected public references where teams or companies explicitly mention `GPTQModel` in documentation, integration notes, or quantized model usage. This is not an exhaustive customer list.
188
+
189
+
* <imgsrc="https://cdn.simpleicons.org/huggingface/FFD21E"alt="Hugging Face logo"height="14"> Hugging Face
Some MoE (mixture of experts) models have extremely uneven/biased routing (distribution of tokens) to the `experts` causing some expert modules to receive close-to-zero activated tokens, thus failing to complete calibration-based quantization (GPTQ/AWQ).
@@ -489,6 +573,17 @@ Models quantized by GPT-QModel are inference compatible with HF Transformers (mi
489
573
year={2023}
490
574
}
491
575
576
+
# GGUF / llama.cpp
577
+
@misc{ggerganov2023gguf,
578
+
author = {Georgi Gerganov and ggml-org contributors},
0 commit comments