Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
7212142
SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocat…
PMZFX May 14, 2026
610058a
vulkan: fix matmul integer pipeline selection (llama/23005)
0cc4m May 14, 2026
6c50207
ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llam…
alex-spacemit May 14, 2026
f4f2d70
logs : reduce (llama/23021)
ggerganov May 14, 2026
b333d4d
ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040)
ArberSephirotheca May 14, 2026
9b73285
HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880)
JohannesGaessler May 14, 2026
8436fb2
ggml-hexagon: cpy: add contiguous fast-path in reshape copy (llama/23…
pdhinaka May 14, 2026
1e50c6c
llama + spec: MTP Support (llama/22673)
am17an May 16, 2026
130cd40
ggml : bump version to 0.12.0 (ggml/1494)
ggerganov May 16, 2026
2b85f66
ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (…
Dev-X25874 May 21, 2026
bb01f45
ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)
OriPekelman May 21, 2026
fcd6110
vulkan: removed duplicate #include <memory> in headers (llama/23144)
winstonma May 16, 2026
17d6153
vulkan: fuse SSM_CONV + BIAS + SILU (llama/22653)
jeffbolznv May 17, 2026
621cbd8
vulkan: Support unaligned tensors for ROPE (llama/22637)
jeffbolznv May 17, 2026
ac16306
vulkan: add cpy bf16 -> f32 pipelines (llama/22677)
ServeurpersoCom May 17, 2026
7187e1f
ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (llama/22009)
jeeb May 17, 2026
703eda1
CUDA: Continue directly including cuda/iterator (llama/23102)
ORippler May 17, 2026
323bc2d
feat: Support d_conv=15 for ssm-conv.cu (llama/23017)
gabe-l-hart May 17, 2026
6a5a499
sycl: route small f32 matmuls to oneMKL, bypass oneDNN (llama/22150)
aicss-genai May 18, 2026
01bf2af
sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (llama/22156)
aicss-genai May 18, 2026
b94493a
ggml-hexagon: add PAD op HVX kernel (llama/23078)
pdhinaka May 18, 2026
c4631bb
hexagon: add support for TRI op (llama/22822)
pdhinaka May 18, 2026
3d73095
rpc : keep last_graph_uid in the device context (llama/23273)
rgerganov May 19, 2026
092d466
sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (llama/22153)
aicss-genai May 19, 2026
3fbd4a7
ggml-webgpu : extend GDN for K>1 (llama/23299)
reeselevine May 19, 2026
05bf9c4
hexagon: enable support for NORM op (llama/23319)
aparmp-quic May 19, 2026
752744d
hexagon: add MROPE and IMROPE support in HTP rope op (llama/23317)
aparmp-quic May 19, 2026
21612e6
opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (llama/23303)
shaofeiqi May 19, 2026
89f3135
ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (llama/23349)
ravel7524 May 20, 2026
34d3c6b
metal : optimize pad + cpy (llama/23354)
ggerganov May 20, 2026
a34c024
Programmatic Dependent Launch (PDL) for more performance on newer NVI…
aendk May 20, 2026
64bdb60
hexagon: HMX quantized matmul rework (llama/23368)
max-krasnyansky May 20, 2026
e9b7cc8
vulkan: optimize operations in the IM2COL shader (llama/22685)
daniandtheweb May 20, 2026
6ce303b
opencl: refactor backend initilization (llama/23318)
lhez May 20, 2026
3d596af
hexagon: ssm-conv fix for large prompts (llama/23307)
tboinovski1 May 21, 2026
2b98710
ggml : Check the right iface method before using the fallback 2d get …
TheBlueMatt May 21, 2026
10254e3
metal : optimize concat kernel and fix set kernel threads (llama/23411)
ggerganov May 21, 2026
0e74cab
fix(flash-attn): replace f32 with kv_type and q_type (llama/23372)
Constannnnnt May 21, 2026
9c206f7
vulkan: fuse snake activation (mul, sin, sqr, mul, add) (llama/22855)
ServeurpersoCom May 21, 2026
ad494e3
CUDA: fix PDL CC check for JIT compilation (llama/23471)
JohannesGaessler May 21, 2026
f86ab6f
ggml-zendnn : add Q8_0 quantization support (llama/23414)
z-sachin May 22, 2026
e3988d4
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (lla…
PMZFX May 22, 2026
4e99dde
SYCL : gated_delta_net K>1 (llama/23174)
karavayev May 22, 2026
4dbad75
sycl : Level Zero detection in ggml_sycl_init (llama/23097)
sanmai May 22, 2026
7084cf0
SYCL: improve MoE prefill throughput (llama/23142)
sanmai May 22, 2026
5947753
opencl: generalize Adreno MoE kernels on M (llama/23449)
shawngu-quic May 23, 2026
2282c7f
vulkan: fix windows find_package of SPIRV-Headers (llama/23215)
jeffbolznv May 23, 2026
945fb5f
ggml : Check the right iface method before using the fallback 2d get …
dskwe May 23, 2026
89cb85f
hexagon: apply repl optimization in flash attn softmax as #22993 (lla…
njsyw1997 May 24, 2026
56ee086
opencl: batch profiling to improve speed and prevent memory leaks (ll…
shaofeiqi May 24, 2026
2bb19cb
TP: fix entirely zero-sized slices per device (llama/23525)
JohannesGaessler May 24, 2026
a16e642
ggml : Parallelize quant LUT init (llama/23595)
jeffbolznv May 25, 2026
624bac1
ggml : bump version to 0.12.1 (ggml/1508)
ggerganov May 25, 2026
77ab0a0
sync : ggml
ggerganov May 25, 2026
9ff9972
talk-llama : sync llama.cpp
ggerganov May 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 19 additions & 8 deletions examples/talk-llama/llama-arch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -757,14 +757,15 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
{LLM_TENSOR_INDEXER_PROJ, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_INDEXER_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_INDEXER_ATTN_Q_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
// NextN/MTP tensors are currently ignored (reserved for future MTP support)
// These tensors only exist in the last layer(s) and are treated as output tensors
{LLM_TENSOR_NEXTN_EH_PROJ, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
{LLM_TENSOR_NEXTN_EMBED_TOKENS, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_GET_ROWS}},
{LLM_TENSOR_NEXTN_ENORM, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_GET_ROWS}},
{LLM_TENSOR_NEXTN_HNORM, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL}},
{LLM_TENSOR_NEXTN_SHARED_HEAD_HEAD, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
{LLM_TENSOR_NEXTN_SHARED_HEAD_NORM, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL}},
// NextN/MTP tensors are stored per-block (blk.%d.nextn.*) even though only the
// last nextn_predict_layers blocks carry them. Classify as LAYER_REPEATING so
// the model loader doesn't fault on the block index.
{LLM_TENSOR_NEXTN_EH_PROJ, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_NEXTN_EMBED_TOKENS, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_GET_ROWS}},
{LLM_TENSOR_NEXTN_ENORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
{LLM_TENSOR_NEXTN_HNORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
{LLM_TENSOR_NEXTN_SHARED_HEAD_HEAD, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_NEXTN_SHARED_HEAD_NORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
// Nemotron 3 Super
{LLM_TENSOR_FFN_LATENT_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
{LLM_TENSOR_FFN_LATENT_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
Expand Down Expand Up @@ -877,6 +878,16 @@ bool llm_arch_is_diffusion(const llm_arch & arch) {
}
}

bool llm_arch_supports_rs_rollback(const llm_arch & arch) {
switch (arch) {
case LLM_ARCH_QWEN35:
case LLM_ARCH_QWEN35MOE:
return true;
default:
return false;
}
}

bool llm_arch_supports_sm_tensor(const llm_arch & arch) {
switch (arch) {
case LLM_ARCH_GROK:
Expand Down
1 change: 1 addition & 0 deletions examples/talk-llama/llama-arch.h
Original file line number Diff line number Diff line change
Expand Up @@ -637,3 +637,4 @@ bool llm_arch_is_recurrent (const llm_arch & arch);
bool llm_arch_is_hybrid (const llm_arch & arch);
bool llm_arch_is_diffusion (const llm_arch & arch);
bool llm_arch_supports_sm_tensor(const llm_arch & arch);
bool llm_arch_supports_rs_rollback(const llm_arch & arch);
8 changes: 4 additions & 4 deletions examples/talk-llama/llama-chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ static const std::map<std::string, llm_chat_template> LLM_CHAT_TEMPLATES = {
{ "hunyuan-moe", LLM_CHAT_TEMPLATE_HUNYUAN_MOE },
{ "gpt-oss", LLM_CHAT_TEMPLATE_OPENAI_MOE },
{ "hunyuan-dense", LLM_CHAT_TEMPLATE_HUNYUAN_DENSE },
{ "hunyuan-ocr", LLM_CHAT_TEMPLATE_HUNYUAN_OCR },
{ "hunyuan-vl", LLM_CHAT_TEMPLATE_HUNYUAN_VL },
{ "kimi-k2", LLM_CHAT_TEMPLATE_KIMI_K2 },
{ "seed_oss", LLM_CHAT_TEMPLATE_SEED_OSS },
{ "grok-2", LLM_CHAT_TEMPLATE_GROK_2 },
Expand Down Expand Up @@ -218,7 +218,7 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
} else if (tmpl_contains("<|start|>") && tmpl_contains("<|channel|>")) {
return LLM_CHAT_TEMPLATE_OPENAI_MOE;
} else if (tmpl_contains("<|hy_Assistant|>") && tmpl_contains("<|hy_begin▁of▁sentence|>")) {
return LLM_CHAT_TEMPLATE_HUNYUAN_OCR;
return LLM_CHAT_TEMPLATE_HUNYUAN_VL;
} else if (tmpl_contains("<|hy_Assistant|>") && tmpl_contains("<|hy_place▁holder▁no▁3|>")) {
return LLM_CHAT_TEMPLATE_HUNYUAN_DENSE;
} else if (tmpl_contains("<|im_assistant|>assistant<|im_middle|>")) {
Expand Down Expand Up @@ -825,8 +825,8 @@ int32_t llm_chat_apply_template(
ss << "<|hy_User|>" << chat[i]->content << "<|hy_Assistant|>";
}
}
} else if (tmpl == LLM_CHAT_TEMPLATE_HUNYUAN_OCR) {
// tencent/HunyuanOCR
} else if (tmpl == LLM_CHAT_TEMPLATE_HUNYUAN_VL) {
// tencent/HunyuanOCR & tencent/HunyuanVL
ss << "<|hy_begin▁of▁sentence|>";
for (size_t i = 0; i < chat.size(); i++) {
std::string role(chat[i]->role);
Expand Down
2 changes: 1 addition & 1 deletion examples/talk-llama/llama-chat.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ enum llm_chat_template {
LLM_CHAT_TEMPLATE_HUNYUAN_MOE,
LLM_CHAT_TEMPLATE_OPENAI_MOE,
LLM_CHAT_TEMPLATE_HUNYUAN_DENSE,
LLM_CHAT_TEMPLATE_HUNYUAN_OCR,
LLM_CHAT_TEMPLATE_HUNYUAN_VL,
LLM_CHAT_TEMPLATE_KIMI_K2,
LLM_CHAT_TEMPLATE_SEED_OSS,
LLM_CHAT_TEMPLATE_GROK_2,
Expand Down
Loading
Loading