Quantization is often assumed to universally reduce energy consumption by lowering memory bandwidth requirements. However, systematic benchmarking reveals that the relationship between quantization and energy efficiency is more nuanced than commonly assumed. This guide helps you understand when quantization improves energy efficiency — and when it may not.
The default LLM.int8() implementation uses a mixed-precision decomposition scheme (llm_int8_threshold=6.0) that routes outlier features through FP16 while quantizing normal features to INT8. This design preserves model accuracy but introduces data movement overhead from continuous INT8↔FP16 type conversions.
Measured impact on energy consumption (RTX 4090D, batch size=1):
| Model | FP16 Energy (J/1k tok) | INT8 Default Energy (J/1k tok) | Energy Change |
|---|---|---|---|
| Yi-1.5-6B | 4,716 | 6,258 | +32.7% |
| Mistral-7B | 5,661 | 7,401 | +30.7% |
| Phi-3-mini (3.8B) | 3,003 | 3,940 | +31.2% |
| Qwen2.5-7B | 5,217 | 6,127 | +17.4% |
The energy overhead is the cost of preserving accuracy. Perplexity measurements confirm the default threshold works as intended:
| Configuration | Perplexity (Yi-1.5-6B) | Δ vs FP16 |
|---|---|---|
| FP16 (baseline) | 11.16 | — |
| INT8 Default (threshold=6.0) | 11.20 | +0.33% |
| INT8 Pure (threshold=0.0) | 14.00 | +25.38% |
Setting llm_int8_threshold=0.0 disables mixed-precision decomposition entirely, forcing all columns through INT8 quantization — including outlier activation channels that INT8 cannot represent accurately. While this eliminates the type conversion overhead, it causes significant accuracy degradation (+25% perplexity increase) that outweighs the marginal energy savings (−3%).
# ✅ Recommended: default threshold preserves accuracy
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
# llm_int8_threshold defaults to 6.0
# ❌ Not recommended for quality-sensitive workloads
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=0.0, # Significant accuracy loss
)If your primary concern is accuracy: use default INT8 (threshold=6.0). The +0.33% perplexity increase is negligible for most applications.
If your primary concern is energy efficiency: consider using FP16 instead of INT8 when GPU memory allows. FP16 avoids the mixed-precision decomposition overhead while maintaining full model accuracy.
If your primary concern is memory: INT8 reduces memory usage by approximately 45% compared to FP16 (e.g., 6.7 GB vs 12.1 GB for Yi-1.5-6B), making it valuable when models need to fit within GPU memory constraints.
For models smaller than approximately 5 billion parameters on fast GPUs, NF4 quantization can increase energy consumption despite reducing memory usage. This occurs because the dequantization compute cost outweighs the memory bandwidth savings when the model already fits comfortably in GPU memory.
Measured impact (RTX 5090, batch size=1):
| Model | FP16 Energy (J/1k tok) | NF4 Energy (J/1k tok) | Energy Change |
|---|---|---|---|
| TinyLlama-1.1B | 1,659 | 2,098 | +26.5% |
| Qwen2-1.5B | 2,411 | 3,120 | +29.4% |
| Qwen2.5-3B | 3,383 | 3,780 | +11.7% |
| Qwen2-7B | 5,509 | 4,878 | −11.4% |
Energy savings from NF4 quantization begin at approximately 5 billion parameters, validated across both RTX 5090 (Blackwell) and RTX 4090D (Ada Lovelace) architectures. For models above this threshold, NF4 consistently reduces energy consumption:
RTX 4090D results (models ≥6B):
| Model | NF4 Energy Change vs FP16 |
|---|---|
| Yi-1.5-6B | −30.2% |
| Mistral-7B | −34.5% |
| Qwen2.5-7B | −32.7% |
Energy efficiency improves dramatically with larger batch sizes. Single-request inference (batch size=1) wastes significant GPU capacity:
A800 + Mistral-7B + Pure INT8 (threshold=0.0):
| Batch Size | Energy per Request (J) | Δ vs BS=1 | GPU Utilization |
|---|---|---|---|
| 1 | 1,768 | — | 45% |
| 8 | 284 | −84% | 50% |
| 16 | 205 | −88% | 77% |
| 64 | 76 | −96% | 91% |
For production deployments, using batch size ≥8 provides the most significant energy reduction regardless of quantization configuration.
Memory-constrained (model doesn't fit in FP16):
- Use NF4 for ≥5B parameter models
- Use INT8 when NF4 is not available or when you need higher accuracy than NF4
Accuracy-first (most production workloads):
- Use default INT8 (
threshold=6.0) — only +0.33% PPL increase - Or use FP16 if memory allows
Energy-first (cost-sensitive batch processing):
- Use FP16 when memory allows (avoids INT8 mixed-precision overhead)
- Use NF4 for models ≥5B parameters (best energy efficiency)
- Maximize batch size (BS≥8 gives 84%+ energy reduction vs BS=1)
| Model Size | Recommended for Energy Efficiency |
|---|---|
| < 3B parameters | FP16 (quantization adds overhead on fast GPUs) |
| 3B–5B parameters | FP16 or NF4 (test on your hardware) |
| ≥ 5B parameters | NF4 (consistent energy savings of 30–35%) |
All measurements were collected using NVML-based power monitoring at 10 Hz sampling rate, with n=10 repetitions per configuration and coefficient of variation < 3%. Hardware platforms: RTX 5090 (Blackwell), RTX 4090D (Ada Lovelace), A800 (Ampere). Perplexity was measured on WikiText-2 (test split).
Full benchmark data, scripts, and interactive dashboard are available at: