Prerequisites
Feature Description
The draft MTP PR #22673 adds speculative decoding / MTP support to:
llama-cli
llama-server
using flags such as:
--spec-type mtp
--spec-draft-n-max 3
--spec-draft-ngl 99
These flags work correctly in:
llama-cli
llama-server
(have only tested on these two)
but are unrecognized in llama-bench
Example:
./build/bin/llama-bench \
-m model.gguf \
--spec-type mtp
-- spec-draft-n-max 3,4,5
-- spec-draft-ngl 99,100
returns:
error: invalid parameter for argument: --spec-type
Motivation
Currently MTP benchmarking requires custom scripts or llama-server setups since speculative decoding flags are unavailable in llama-bench.
Adding support directly to llama-bench would make it much easier for users to quickly test and compare speculative decoding configurations such as different n-max values, context depths, throughput scaling, and VRAM tradeoffs using the built-in benchmarking tooling.
Possible Implementation
I did some digging and it looks like llama-bench currently uses its own parser / benchmark loop instead of the shared common argument infrastructure used by llama-cli and llama-server, which is likely why the speculative decoding flags are currently unavailable there.
Prerequisites
Feature Description
The draft MTP PR #22673 adds speculative decoding / MTP support to:
llama-cli
llama-server
using flags such as:
--spec-type mtp--spec-draft-n-max 3--spec-draft-ngl 99These flags work correctly in:
llama-clillama-server(have only tested on these two)
but are unrecognized in
llama-benchExample:
returns:
Motivation
Currently MTP benchmarking requires custom scripts or
llama-serversetups since speculative decoding flags are unavailable inllama-bench.Adding support directly to
llama-benchwould make it much easier for users to quickly test and compare speculative decoding configurations such as differentn-maxvalues, context depths, throughput scaling, and VRAM tradeoffs using the built-in benchmarking tooling.Possible Implementation
I did some digging and it looks like llama-bench currently uses its own parser / benchmark loop instead of the shared common argument infrastructure used by
llama-cliandllama-server, which is likely why the speculative decoding flags are currently unavailable there.