Eval bug: Qwen3-Code-Next not support speculative decoding

### Name and Version

./llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 59386 MiB):
  Device 0: Thor, compute capability 10.1, VMM: yes, VRAM: 59386 MiB (50310 MiB free)
version: 0 (unknown)
built with GNU 13.3.0 for Linux aarch64

### Operating systems

Other? (Please let us know in description), Linux

### GGML backends

CUDA

### Hardware

ThorU

### Models

_No response_

### Problem description & steps to reproduce

not support speculative  decoding

### First Bad Commit

_No response_

### Relevant log output

common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 1632.00 MiB
llama_kv_cache: size = 1632.00 MiB (131072 cells, 12 layers, 1/1 seqs), K (q8_0): 816.00 MiB, V (q8_0): 816.00 MiB
llama_memory_recurrent: CUDA0 RS buffer size = 75.38 MiB
llama_memory_recurrent: size = 75.38 MiB ( 1 cells, 48 layers, 1 seqs), R (f32): 3.38 MiB, S (f32): 72.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 420.01 MiB
sched_reserve: CUDA_Host compute buffer size = 264.01 MiB
sched_reserve: graph nodes = 5013
sched_reserve: graph splits = 2
sched_reserve: reserve took 135.03 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
srv load_model: prompt cache is enabled, size limit: 2048 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Qwen3-Code-Next not support speculative decoding #21840

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Qwen3-Code-Next not support speculative decoding #21840

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions