perf: EXL3 performance tuning on GeForce Blackwell by AlpinDale · Pull Request #1652 · dphnAI/aphrodite-engine

AlpinDale · 2026-04-27T18:58:20Z

Some shape adjustments and kernel work for sm_120, results:

Trinity-Nano-Preview-4.0bpw:

Context	main tok/s	PR tok/s	upstream EXL3 tok/s	PR vs main	PR vs EXL3
0	124.61	190.48	115.06	+52.9%	+65.5%
256	119.49	180.18	115.15	+50.8%	+56.5%
512	115.56	170.91	114.28	+47.9%	+49.6%
1024	107.55	154.84	114.90	+44.0%	+34.8%
2048	95.24	130.15	111.20	+36.7%	+17.0%
4096	95.01	129.77	110.28	+36.6%	+17.7%
8192	93.96	128.33	108.49	+36.6%	+18.3%
16384	92.83	126.18	107.28	+35.9%	+17.6%
32512	90.24	121.59	103.65	+34.7%	+17.3%

Qwen3-0.6B-4.0bpw:

Context	main tok/s	PR tok/s	upstream EXL3 tok/s	PR vs main	PR vs EXL3
0	341.77	409.24	326.00	+19.7%	+25.5%
256	339.78	407.39	317.59	+19.9%	+28.3%
512	335.70	405.03	315.23	+20.7%	+28.5%
1024	370.11	395.70	287.63	+6.9%	+37.6%
2048	345.01	370.00	278.49	+7.2%	+32.9%
4096	318.14	337.84	274.87	+6.2%	+22.9%
8192	264.08	286.83	233.59	+8.6%	+22.8%
16384	212.45	218.89	162.50	+3.0%	+34.7%
32512	135.60	151.04	125.47	+11.4%	+20.4%

Signed-off-by: AlpinDale <alpindale@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a878380265

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T19:02:56Z

+  if (K == 4 && size_k == 1024 && size_n == 256) {
+    *shape_idx = 4;
+    *num_sms = 32;


Do not select shape 4 for n=256 Blackwell override

select_blackwell_gemm_override forces shape_idx = 4 when K == 4, size_k == 1024, and size_n == 256, but shape 4 is a 512-wide N tile (EXL3_GEMM_TILESIZE_N), so exl3_gemm_kernel_inner computes tiles_n = size_n / TILESIZE_N = 0 and exits without producing output tiles. On Blackwell, this override can return incorrect/empty GEMM results for that projection shape instead of just changing performance.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T19:02:56Z

+                    get_mgemm_kernel_ptr(K, override_shape_idx, c_fp32, cb),
+                    override_shape_idx, override_num_sms,
+                    exl3_gemm_blockdim[override_shape_idx]};
+      _tuning_cache[key] = tr;


Include size_m when caching Blackwell-specific overrides

This override path is explicitly conditioned on size_m == 1, but the cached entry it writes is keyed only by size_k/size_n/cc/K/c_fp32, so later calls with size_m > 1 can incorrectly reuse single-token tuning results. That bypasses the intended size_m guard and can regress non-decode workloads after the cache is warmed by decode shapes.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af26558dfd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T19:18:21Z

+  } else if (cc == CC_BLACKWELL && K == 4 && size_m == 1 && size_k == 1024 &&
+             size_n == 256 && bszm_out <= 32) {
+    shape_idx = 4;


Stop forcing shape 4 when it yields zero tiles

For the Blackwell override path (K==4, size_m==1, size_k==1024, size_n==256), hard-coding shape_idx = 4 selects a kernel with TILESIZE_N=512, so this shape has zero N tiles. In select_exl3_mgemm_kernel, that drives max_slices to 0 and sets num_sms to 0, and exl3_mgemm_gr then computes total_sms / num_sms, which is a divide-by-zero on this decode shape rather than a pure performance tweak.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

AlpinDale added 3 commits April 27, 2026 21:47

perf: tune EXL3 GEMM selection for Blackwell

b9a3930

Signed-off-by: AlpinDale <alpindale@gmail.com>

perf: expand EXL3 Blackwell GEMM overrides

3672282

Signed-off-by: AlpinDale <alpindale@gmail.com>

perf: tune EXL3 MoE mgemm on Blackwell

a878380

Signed-off-by: AlpinDale <alpindale@gmail.com>

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

perf: scope EXL3 Blackwell tuning to MoE mgemm

af26558

Signed-off-by: AlpinDale <alpindale@gmail.com>

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

perf: tune EXL3 dense mgemm on Blackwell

1b29391

Signed-off-by: AlpinDale <alpindale@gmail.com>

AlpinDale changed the title ~~perf: EXL3 MoE performance on GeForce Blackwell~~ perf: EXL3 performance tuning on GeForce Blackwell Apr 27, 2026

AlpinDale added 2 commits April 28, 2026 00:19

perf: reduce EXL3 MoE decode overhead

32897de

Signed-off-by: AlpinDale <alpindale@gmail.com>

perf: fuse EXL3 MoE gate up projection

5b0b2d1

Signed-off-by: AlpinDale <alpindale@gmail.com>

AlpinDale merged commit 3917cd8 into main Apr 27, 2026
1 check passed

AlpinDale deleted the perf/exl3-blackwell-selector branch April 27, 2026 20:18

AlpinDale mentioned this pull request Apr 28, 2026

perf: exl3 decode kernel optimization experiments #1655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: EXL3 performance tuning on GeForce Blackwell#1652

perf: EXL3 performance tuning on GeForce Blackwell#1652
AlpinDale merged 7 commits into
mainfrom
perf/exl3-blackwell-selector

AlpinDale commented Apr 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

AlpinDale commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Apr 27, 2026 •

edited

Loading