Name and Version
./llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 59386 MiB):
Device 0: Thor, compute capability 10.1, VMM: yes, VRAM: 59386 MiB (50310 MiB free)
version: 0 (unknown)
built with GNU 13.3.0 for Linux aarch64
Operating systems
Other? (Please let us know in description), Linux
GGML backends
CUDA
Hardware
ThorU
Models
No response
Problem description & steps to reproduce
not support speculative decoding
First Bad Commit
No response
Relevant log output
common_init_result: added logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 1632.00 MiB
llama_kv_cache: size = 1632.00 MiB (131072 cells, 12 layers, 1/1 seqs), K (q8_0): 816.00 MiB, V (q8_0): 816.00 MiB
llama_memory_recurrent: CUDA0 RS buffer size = 75.38 MiB
llama_memory_recurrent: size = 75.38 MiB ( 1 cells, 48 layers, 1 seqs), R (f32): 3.38 MiB, S (f32): 72.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 420.01 MiB
sched_reserve: CUDA_Host compute buffer size = 264.01 MiB
sched_reserve: graph nodes = 5013
sched_reserve: graph splits = 2
sched_reserve: reserve took 135.03 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
srv load_model: prompt cache is enabled, size limit: 2048 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see #16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
Name and Version
./llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 59386 MiB):
Device 0: Thor, compute capability 10.1, VMM: yes, VRAM: 59386 MiB (50310 MiB free)
version: 0 (unknown)
built with GNU 13.3.0 for Linux aarch64
Operating systems
Other? (Please let us know in description), Linux
GGML backends
CUDA
Hardware
ThorU
Models
No response
Problem description & steps to reproduce
not support speculative decoding
First Bad Commit
No response
Relevant log output
common_init_result: added logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 1632.00 MiB
llama_kv_cache: size = 1632.00 MiB (131072 cells, 12 layers, 1/1 seqs), K (q8_0): 816.00 MiB, V (q8_0): 816.00 MiB
llama_memory_recurrent: CUDA0 RS buffer size = 75.38 MiB
llama_memory_recurrent: size = 75.38 MiB ( 1 cells, 48 layers, 1 seqs), R (f32): 3.38 MiB, S (f32): 72.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 420.01 MiB
sched_reserve: CUDA_Host compute buffer size = 264.01 MiB
sched_reserve: graph nodes = 5013
sched_reserve: graph splits = 2
sched_reserve: reserve took 135.03 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
srv load_model: prompt cache is enabled, size limit: 2048 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see #16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>