Skip to content

Feature Request: Add MTP / speculative decoding support to llama-bench for PR #22673 #22947

@Ajay9o9

Description

@Ajay9o9

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The draft MTP PR #22673 adds speculative decoding / MTP support to:

llama-cli
llama-server

using flags such as:
--spec-type mtp
--spec-draft-n-max 3
--spec-draft-ngl 99

These flags work correctly in:

llama-cli
llama-server

(have only tested on these two)

but are unrecognized in llama-bench
Example:

./build/bin/llama-bench \
  -m model.gguf \
  --spec-type mtp
 -- spec-draft-n-max 3,4,5
 -- spec-draft-ngl 99,100

returns:

error: invalid parameter for argument: --spec-type

Motivation

Currently MTP benchmarking requires custom scripts or llama-server setups since speculative decoding flags are unavailable in llama-bench.

Adding support directly to llama-bench would make it much easier for users to quickly test and compare speculative decoding configurations such as different n-max values, context depths, throughput scaling, and VRAM tradeoffs using the built-in benchmarking tooling.

Possible Implementation

I did some digging and it looks like llama-bench currently uses its own parser / benchmark loop instead of the shared common argument infrastructure used by llama-cli and llama-server, which is likely why the speculative decoding flags are currently unavailable there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions