Mapping the Efficiency Landscape of Small Language Models
Fabian Reichwald, Lukas Schiesser, Christiane Plociennik, Leonhard Kunz, Simon Pukrop, Martin Ruskowski, Oliver Thomas
Large language models (LLMs) dominate both everyday and specialized applications, but their high computational demand, energy consumption, and privacy risks are increasingly critiqued. Small language models (SLMs) mitigate these drawbacks and are gaining momentum in scenarios where full LLM capabilities are not required, such as agents, industrial systems, or edge devices. Nevertheless, a systematic comparison of model capabilities, energy usage, and scaling behavior has not been conducted yet. We evaluate 70+ SLMs from 2023–2025 on five task-specific benchmarks and compare them with two popular LLMs, revealing key trade-offs between energy, performance, and model selection. Our findings challenge common assumptions: First, smaller models are not automatically more efficient, and energy increases do not guarantee performance gains. Second, newer SLMs show clear improvements in performance–energy trade-offs, though the progress begins to plateau. Last, the efficiency landscape forms a clear Pareto frontier: initial energy increases yield substantial gains, but the last percentage points of performance need orders of magnitude more energy. These results highlight diminishing returns of scaling and emphasize the need for informed, task-aware model selection rather than size-driven choices.
International Joint Conference on Artificial Intelligence (IJCAI) 2026
.
├── requirements.txt
└── scripts/
├── 1_load_benchmark_data.py
├── 2_1_inference_SLM.py
├── 2_2_inference_vLLM.py
├── 3_1_evaluation.py
├── 3_2_evaluation_summary.py
└── utils/
requirements.txtlists the Python dependencies.scripts/1_load_benchmark_data.pydownloads and samples the benchmark datasets, then writes them into a unified CSV format.scripts/2_1_inference_SLM.pyruns Hugging Face model inference and records outputs, runtime metadata, and CodeCarbon energy measurements.scripts/2_2_inference_vLLM.pyruns inference through a vLLM server and records outputs, runtime metadata, and CodeCarbon energy measurements.scripts/3_1_evaluation.pyscores model outputs against the benchmark references.scripts/3_2_evaluation_summary.pyaggregates evaluated result files into compact metric summaries.scripts/utils/contains shared helpers, the evaluated SLM list, model release dates, and benchmark-specific scoring logic for XSum, MMLU-Redux, GSM8K, Berkeley Function Calling, BigCodeBench, and CoNLL-2003.