Skip to content

Commit 4eefb8f

Browse files
authored
Merge pull request #342 from maxwbuckley/add-quantize-flag
Add quantize/compile support for ~1.9x GPU speedup
2 parents 1f4a3cc + e664ca8 commit 4eefb8f

6 files changed

Lines changed: 724 additions & 13 deletions

File tree

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,46 @@ UEFA Nations League => competitions
7878
European Championship => competitions
7979
```
8080

81+
### Quantization and Compilation
82+
83+
Use `quantize=True` and `compile_torch_model=True` for up to ~1.9x faster GPU inference with zero quality loss:
84+
85+
```python
86+
model = GLiNER.from_pretrained(
87+
"urchade/gliner_medium-v2.1",
88+
map_location="cuda",
89+
quantize=True, # or "fp16", "bf16"
90+
compile_torch_model=True,
91+
)
92+
```
93+
94+
Or apply after loading:
95+
96+
```python
97+
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
98+
model.quantize() # fp16 half-precision (default)
99+
model.quantize("bf16") # bfloat16 — better numerical stability, slightly less speedup
100+
model.compile() # torch.compile with dynamic shapes
101+
```
102+
103+
Benchmarked on CoNLL-2003 (strict F1, `gliner_medium-v2.1`, RTX 5090):
104+
105+
| Condition | F1 | Speedup |
106+
|-----------|:---:|:---:|
107+
| GPU fp32 (baseline) | 0.8107 | 1.00x |
108+
| + quantize | 0.8107 | 1.35x |
109+
| + compile | 0.8107 | 1.31x |
110+
| **+ quantize + compile** | **0.8107** | **1.94x** |
111+
112+
**Quantization options:**
113+
- `quantize=True` or `quantize="fp16"` — float16 half-precision. Best GPU speedup (~1.35x).
114+
- `quantize="bf16"` — bfloat16. Better numerical stability, slightly less speedup (~1.2x).
115+
- `quantize="int8"` — int8 quantization. On CPU, uses built-in FBGEMM int8 kernels (~1.6x speedup). On GPU, uses [torchao](https://github.com/pytorch/ao) int8 weight-only quantization (~50% memory reduction, no speed gain). Intended for models fine-tuned with quantization-aware training (QAT). Stock DeBERTa-based models lose accuracy with int8.
116+
- On CPU, fp16/bf16 quantization reduces memory usage but does not improve speed.
117+
118+
**Compilation notes:**
119+
- `compile_torch_model=True` uses [torch.compile](https://pytorch.org/docs/stable/torch.compiler.html) which JIT-compiles the model via [Triton](https://github.com/triton-lang/triton) kernels. The first inference call will be slower due to compilation, but all subsequent calls benefit from the compiled graph. This is only available on **Linux and WSL** (not native Windows or macOS).
120+
81121
## 👨‍💻 Model Authors
82122
GLiNER was originally developed by:
83123
* [Urchade Zaratiana](urchade.github.io)

benchmarks/bench_int8.py

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
"""Benchmark int8 vs fp16 vs fp32 across input lengths on GPU and CPU.
2+
3+
Measures latency and model memory footprint. Interleaves conditions within
4+
the same process to avoid warm-cache bias (per CLAUDE.md benchmarking rules).
5+
"""
6+
7+
import gc
8+
import json
9+
import os
10+
import statistics
11+
import time
12+
from datetime import datetime
13+
14+
import torch
15+
from gliner import GLiNER
16+
17+
MODEL_NAME = "urchade/gliner_small-v2.1"
18+
LABELS = ["person", "organization", "location", "date", "event"]
19+
N_REPS = 40
20+
N_WARMUP = 5
21+
22+
# Inputs of increasing length
23+
INPUTS = {
24+
"short (~20w)": (
25+
"Elon Musk founded SpaceX in Hawthorne, California in 2002."
26+
),
27+
"medium (~80w)": (
28+
"The United Nations General Assembly convened in New York City on "
29+
"September 15, 2024, where Secretary-General Antonio Guterres "
30+
"addressed delegates from 193 member states. Key topics included "
31+
"climate change mitigation, the ongoing conflict in Eastern Europe, "
32+
"and global economic recovery following the pandemic. Representatives "
33+
"from the European Union, African Union, and ASEAN presented joint "
34+
"proposals for sustainable development goals. The World Health "
35+
"Organization also provided updates on disease surveillance programs "
36+
"across Sub-Saharan Africa and Southeast Asia."
37+
),
38+
"long (~200w)": (
39+
"In a landmark announcement on March 15, 2024, the European Space "
40+
"Agency and NASA jointly revealed plans for the Artemis-Europa "
41+
"collaborative mission, scheduled for launch from Kennedy Space "
42+
"Center in late 2028. The mission, overseen by project director "
43+
"Dr. Maria Chen and deputy director Professor James Okafor from "
44+
"the University of Cambridge, aims to deploy an autonomous "
45+
"submarine probe beneath the ice crust of Jupiter's moon Europa. "
46+
"The probe, named Poseidon, was developed by a consortium including "
47+
"Lockheed Martin, Airbus Defence, and the Japan Aerospace "
48+
"Exploration Agency. Testing began at the Jet Propulsion Laboratory "
49+
"in Pasadena in January 2023 and continued at facilities in "
50+
"Toulouse, France and Tsukuba, Japan. The European Commission has "
51+
"allocated 2.3 billion euros to the project through the Horizon "
52+
"Europe framework. Meanwhile, the National Science Foundation "
53+
"contributed an additional 800 million dollars. Critics from the "
54+
"Planetary Society and the International Astronomical Union have "
55+
"raised concerns about contamination protocols. A review panel "
56+
"chaired by Dr. Sarah Williams of MIT published findings in Nature "
57+
"Astronomy suggesting the mission's sterilization procedures exceed "
58+
"those used in the Viking and Curiosity missions. President Biden "
59+
"praised the initiative during a ceremony at the White House, "
60+
"calling it a triumph of international cooperation."
61+
),
62+
"very long (~400w)": (
63+
"The 2024 Global Technology Summit, hosted by the World Economic "
64+
"Forum in Davos, Switzerland from January 15 to January 19, brought "
65+
"together over 2,800 leaders from industry, government, and "
66+
"academia. Microsoft CEO Satya Nadella delivered the opening keynote, "
67+
"outlining the company's vision for artificial intelligence "
68+
"integration across enterprise software. Google DeepMind's CEO "
69+
"Demis Hassabis presented breakthroughs in protein structure "
70+
"prediction following their AlphaFold 3 release. Tesla and SpaceX "
71+
"founder Elon Musk participated in a panel discussion on autonomous "
72+
"systems with Waymo CEO Tekedra Mawakana and General Motors "
73+
"president Mark Reuss. The European Commission's Executive "
74+
"Vice-President Margrethe Vestager announced new regulatory "
75+
"frameworks for AI governance under the EU AI Act, which had been "
76+
"formally adopted in December 2023. China's Ministry of Science and "
77+
"Technology sent a delegation led by Minister Yin Hejun, who "
78+
"presented China's national AI development roadmap through 2030. "
79+
"Japan's Prime Minister Fumio Kishida announced a 5 billion dollar "
80+
"investment in semiconductor manufacturing, with new facilities "
81+
"planned in Kumamoto and Hokkaido in partnership with Taiwan "
82+
"Semiconductor Manufacturing Company. Samsung Electronics vice "
83+
"chairman Jay Y. Lee discussed the company's 230 billion dollar "
84+
"investment plan for chip fabrication plants in Taylor, Texas and "
85+
"Pyeongtaek, South Korea. The Bill and Melinda Gates Foundation "
86+
"unveiled a 500 million dollar initiative for AI-powered healthcare "
87+
"diagnostics in Sub-Saharan Africa, developed in collaboration with "
88+
"the World Health Organization and Doctors Without Borders. "
89+
"Stanford University's Institute for Human-Centered AI released "
90+
"their annual AI Index Report, compiled by researchers including "
91+
"Professor Fei-Fei Li and Dr. Erik Brynjolfsson. The report "
92+
"highlighted that global AI investment reached 189 billion dollars "
93+
"in 2023, with the United States, China, and the United Kingdom "
94+
"accounting for 75 percent of total spending. OpenAI CEO Sam "
95+
"Altman and Anthropic CEO Dario Amodei held a joint session on AI "
96+
"safety research, discussing alignment techniques and the need for "
97+
"international cooperation on frontier model evaluation. The summit "
98+
"concluded with the Davos AI Accord, signed by representatives "
99+
"from 47 nations, establishing shared principles for responsible "
100+
"AI development and deployment across borders."
101+
),
102+
}
103+
104+
105+
def get_model_size_mb(model):
106+
"""Estimate model parameter memory in MB."""
107+
total = 0
108+
for p in model.parameters():
109+
total += p.nelement() * p.element_size()
110+
for b in model.buffers():
111+
total += b.nelement() * b.element_size()
112+
return total / (1024 * 1024)
113+
114+
115+
def get_torchao_model_size_mb(model):
116+
"""Estimate size including torchao quantized tensors."""
117+
total = 0
118+
for name, p in model.named_parameters():
119+
total += p.nelement() * p.element_size()
120+
for name, b in model.named_buffers():
121+
total += b.nelement() * b.element_size()
122+
# torchao int8 stores weights as module attributes, not always as parameters
123+
for mod in model.modules():
124+
if hasattr(mod, "weight") and not isinstance(mod.weight, torch.nn.Parameter):
125+
w = mod.weight
126+
if hasattr(w, "nelement"):
127+
total += w.nelement() * w.element_size()
128+
return total / (1024 * 1024)
129+
130+
131+
def measure_latency(model, text, labels, n_warmup, n_reps):
132+
"""Measure inference latency with warmup, return list of times in ms."""
133+
for _ in range(n_warmup):
134+
model.predict_entities(text, labels)
135+
136+
if model.device.type == "cuda":
137+
torch.cuda.synchronize()
138+
139+
times = []
140+
for _ in range(n_reps):
141+
if model.device.type == "cuda":
142+
torch.cuda.synchronize()
143+
t0 = time.perf_counter()
144+
model.predict_entities(text, labels)
145+
if model.device.type == "cuda":
146+
torch.cuda.synchronize()
147+
times.append((time.perf_counter() - t0) * 1000)
148+
return times
149+
150+
151+
def run_benchmark(device: str):
152+
print(f"\n{'='*70}")
153+
print(f" DEVICE: {device.upper()}")
154+
print(f" Model: {MODEL_NAME}")
155+
print(f" Reps: {N_REPS} (warmup: {N_WARMUP})")
156+
print(f"{'='*70}")
157+
158+
results = {}
159+
160+
# --- Load models ---
161+
conditions = {}
162+
163+
# fp32
164+
print("\nLoading fp32 model...")
165+
conditions["fp32"] = GLiNER.from_pretrained(MODEL_NAME, map_location=device)
166+
167+
# fp16
168+
print("Loading fp16 model...")
169+
conditions["fp16"] = GLiNER.from_pretrained(
170+
MODEL_NAME, map_location=device, quantize="fp16"
171+
)
172+
173+
# int8
174+
print("Loading int8 model...")
175+
conditions["int8"] = GLiNER.from_pretrained(
176+
MODEL_NAME, map_location=device, quantize="int8"
177+
)
178+
179+
# --- Memory ---
180+
print("\n--- Model Size (parameters + buffers) ---")
181+
for cond_name, model in conditions.items():
182+
if cond_name == "int8":
183+
size = get_torchao_model_size_mb(model.model)
184+
else:
185+
size = get_model_size_mb(model.model)
186+
results.setdefault(cond_name, {})["size_mb"] = round(size, 1)
187+
print(f" {cond_name:>5}: {size:>8.1f} MB")
188+
189+
# --- Latency per input length ---
190+
for input_name, text in INPUTS.items():
191+
word_count = len(text.split())
192+
print(f"\n--- {input_name} ({word_count} words) ---")
193+
header = f" {'cond':>5} {'mean':>8} {'median':>8} {'stdev':>8} {'min':>8} {'max':>8}"
194+
print(header)
195+
196+
for cond_name, model in conditions.items():
197+
times = measure_latency(model, text, LABELS, N_WARMUP, N_REPS)
198+
mean = statistics.mean(times)
199+
med = statistics.median(times)
200+
sd = statistics.stdev(times)
201+
mn = min(times)
202+
mx = max(times)
203+
204+
results.setdefault(cond_name, {})[input_name] = {
205+
"mean_ms": round(mean, 2),
206+
"median_ms": round(med, 2),
207+
"stdev_ms": round(sd, 2),
208+
"min_ms": round(mn, 2),
209+
"max_ms": round(mx, 2),
210+
"n": N_REPS,
211+
"word_count": word_count,
212+
}
213+
print(
214+
f" {cond_name:>5} {mean:>7.2f}ms {med:>7.2f}ms "
215+
f"{sd:>7.2f}ms {mn:>7.2f}ms {mx:>7.2f}ms"
216+
)
217+
218+
# --- Speedup summary ---
219+
print(f"\n--- Speedup vs fp32 (median latency) ---")
220+
header = f" {'input':>20}"
221+
for cond_name in conditions:
222+
header += f" {cond_name:>10}"
223+
print(header)
224+
225+
for input_name in INPUTS:
226+
fp32_med = results["fp32"][input_name]["median_ms"]
227+
row = f" {input_name:>20}"
228+
for cond_name in conditions:
229+
med = results[cond_name][input_name]["median_ms"]
230+
speedup = fp32_med / med
231+
row += f" {speedup:>9.2f}x"
232+
print(row)
233+
234+
# Cleanup
235+
for model in conditions.values():
236+
del model
237+
gc.collect()
238+
if device == "cuda":
239+
torch.cuda.empty_cache()
240+
241+
return results
242+
243+
244+
def main():
245+
all_results = {"timestamp": datetime.now().isoformat(), "model": MODEL_NAME}
246+
247+
# GPU benchmark
248+
if torch.cuda.is_available():
249+
all_results["gpu"] = run_benchmark("cuda")
250+
gc.collect()
251+
torch.cuda.empty_cache()
252+
253+
# CPU benchmark
254+
all_results["cpu"] = run_benchmark("cpu")
255+
256+
# Save results
257+
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
258+
outfile = os.path.join(os.path.dirname(__file__), f"bench_int8_{ts}.json")
259+
with open(outfile, "w") as f:
260+
json.dump(all_results, f, indent=2)
261+
print(f"\nResults saved to {outfile}")
262+
263+
264+
if __name__ == "__main__":
265+
main()

0 commit comments

Comments
 (0)