Gemma4 + MTP drop rate of tg/s #24264

sswtodo · 2026-06-07T13:47:28Z

sswtodo
Jun 7, 2026

Hi,

Thank you @am17an and everyone for your hard work on getting MTP working with Gemma 4 - #23398 . It increases tg/s significantly. However, I’ve found that for agentic workloads the tg/s drops, and the slowdown is 100% reproducible.

To reproduce:
Start llama-server with Gemma 4 + it-assistant-MTP as usual, then paste the file into the chat server (I used the web browser UI).

Test model:

LLM: unsloth/Gemma4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
MTP: google it-assistant converted to Q4

Test file:

https://raw.githubusercontent.com/ggml-org/llama.cpp/refs/heads/master/src/llama-context.cpp

1.1 Without MTP:

0.39.508.761 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  45684, progress = 0.98, t =  22.14 s / 2062.99 tokens per second
0.40.025.134 I slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 43636, pos_max = 45683, n_tokens = 45684, size = 200.012 MiB)
0.40.077.564 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  46704, progress = 1.00, t =  22.71 s / 2056.24 tokens per second
0.40.804.646 I slot create_check: id  0 | task 0 | created context checkpoint 3 of 32 (pos_min = 44656, pos_max = 46703, n_tokens = 46704, size = 200.012 MiB)
0.41.909.832 I slot print_timing: id  0 | task 0 | n_decoded =    100, tg =  94.26 t/s
0.44.912.065 I slot print_timing: id  0 | task 0 | n_decoded =    383, tg =  94.26 t/s
0.47.913.929 I slot print_timing: id  0 | task 0 | n_decoded =    666, tg =  94.27 t/s
0.50.736.897 I slot print_timing: id  0 | task 0 | prompt eval time =   23484.67 ms / 46708 tokens (    0.50 ms per token,  1988.87 tokens per second)
0.50.736.900 I slot print_timing: id  0 | task 0 |        eval time =    9888.00 ms /   928 tokens (   10.66 ms per token,    93.85 tokens per second)
0.50.736.901 I slot print_timing: id  0 | task 0 |       total time =   33372.68 ms / 47636 tokens
0.50.736.901 I slot print_timing: id  0 | task 0 |    graphs reused =        922
0.50.737.862 I slot      release: id  0 | task 0 | stop processing: n_tokens = 47635, truncated = 0

1.2 With MTP

0.35.491.204 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  45684, progress = 0.98, t =  23.59 s / 1936.31 tokens per second
0.35.606.999 I slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 43636, pos_max = 45683, n_tokens = 45684, size = 200.012 MiB)
0.36.317.010 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =  46704, progress = 1.00, t =  24.42 s / 1912.60 tokens per second
0.36.430.337 I slot create_check: id  0 | task 0 | created context checkpoint 3 of 32 (pos_min = 44656, pos_max = 46703, n_tokens = 46704, size = 200.012 MiB)
0.37.906.392 I slot print_timing: id  0 | task 0 | n_decoded =    101, tg =  70.55 t/s
0.40.921.544 I slot print_timing: id  0 | task 0 | n_decoded =    313, tg =  70.39 t/s
0.43.927.332 I slot print_timing: id  0 | task 0 | n_decoded =    531, tg =  71.25 t/s
0.46.945.275 I slot print_timing: id  0 | task 0 | n_decoded =    744, tg =  71.06 t/s
0.49.960.693 I slot print_timing: id  0 | task 0 | n_decoded =    954, tg =  70.74 t/s
0.51.083.708 I slot print_timing: id  0 | task 0 | prompt eval time =   24576.86 ms / 46708 tokens (    0.53 ms per token,  1900.49 tokens per second)
0.51.083.711 I slot print_timing: id  0 | task 0 |        eval time =   14608.91 ms /  1037 tokens (   14.09 ms per token,    70.98 tokens per second)
0.51.083.711 I slot print_timing: id  0 | task 0 |       total time =   39185.76 ms / 47745 tokens
0.51.083.711 I slot print_timing: id  0 | task 0 |    graphs reused =        580
0.51.083.712 I slot print_timing: id  0 | task 0 | draft acceptance = 0.77265 (  452 accepted /   585 generated)
0.51.083.733 I statistics        draft-mtp: #calls(b,g,a) =    1    585    585, #gen drafts =    585, #acc drafts =   452, #gen tokens =    585, #acc tokens =   452, dur(b,g,a) = 0.001, 3968.542, 0.265 ms
0.51.084.646 I slot      release: id  0 | task 0 | stop processing: n_tokens = 47745, truncated = 0

It appears that the decoding path diverges from the one used when generating long, high‑entropy narratives.

sswtodo · 2026-06-07T17:08:42Z

sswtodo
Jun 7, 2026
Author

Small fix proposal as preliminary groundwork to improve MTP on Gemma 4

PR#24270

0 replies

sswtodo · 2026-06-07T18:42:36Z

sswtodo
Jun 7, 2026
Author

Here is PR #24277 that addresses and fixes the described issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 + MTP drop rate of tg/s #24264

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gemma4 + MTP drop rate of tg/s #24264

Uh oh!

Uh oh!

sswtodo Jun 7, 2026

Replies: 2 comments

Uh oh!

sswtodo Jun 7, 2026 Author

Uh oh!

sswtodo Jun 7, 2026 Author

sswtodo
Jun 7, 2026

sswtodo
Jun 7, 2026
Author

sswtodo
Jun 7, 2026
Author