Skip to content

perf: EXL3 performance tuning on GeForce Blackwell#1652

Merged
AlpinDale merged 7 commits into
mainfrom
perf/exl3-blackwell-selector
Apr 27, 2026
Merged

perf: EXL3 performance tuning on GeForce Blackwell#1652
AlpinDale merged 7 commits into
mainfrom
perf/exl3-blackwell-selector

Conversation

@AlpinDale

@AlpinDale AlpinDale commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Some shape adjustments and kernel work for sm_120, results:

Trinity-Nano-Preview-4.0bpw:

Context main tok/s PR tok/s upstream EXL3 tok/s PR vs main PR vs EXL3
0 124.61 190.48 115.06 +52.9% +65.5%
256 119.49 180.18 115.15 +50.8% +56.5%
512 115.56 170.91 114.28 +47.9% +49.6%
1024 107.55 154.84 114.90 +44.0% +34.8%
2048 95.24 130.15 111.20 +36.7% +17.0%
4096 95.01 129.77 110.28 +36.6% +17.7%
8192 93.96 128.33 108.49 +36.6% +18.3%
16384 92.83 126.18 107.28 +35.9% +17.6%
32512 90.24 121.59 103.65 +34.7% +17.3%

Qwen3-0.6B-4.0bpw:

Context main tok/s PR tok/s upstream EXL3 tok/s PR vs main PR vs EXL3
0 341.77 409.24 326.00 +19.7% +25.5%
256 339.78 407.39 317.59 +19.9% +28.3%
512 335.70 405.03 315.23 +20.7% +28.5%
1024 370.11 395.70 287.63 +6.9% +37.6%
2048 345.01 370.00 278.49 +7.2% +32.9%
4096 318.14 337.84 274.87 +6.2% +22.9%
8192 264.08 286.83 233.59 +8.6% +22.8%
16384 212.45 218.89 162.50 +3.0% +34.7%
32512 135.60 151.04 125.47 +11.4% +20.4%

Signed-off-by: AlpinDale <alpindale@gmail.com>
Signed-off-by: AlpinDale <alpindale@gmail.com>
Signed-off-by: AlpinDale <alpindale@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a878380265

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +43 to +45
if (K == 4 && size_k == 1024 && size_n == 256) {
*shape_idx = 4;
*num_sms = 32;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not select shape 4 for n=256 Blackwell override

select_blackwell_gemm_override forces shape_idx = 4 when K == 4, size_k == 1024, and size_n == 256, but shape 4 is a 512-wide N tile (EXL3_GEMM_TILESIZE_N), so exl3_gemm_kernel_inner computes tiles_n = size_n / TILESIZE_N = 0 and exits without producing output tiles. On Blackwell, this override can return incorrect/empty GEMM results for that projection shape instead of just changing performance.

Useful? React with 👍 / 👎.

get_mgemm_kernel_ptr(K, override_shape_idx, c_fp32, cb),
override_shape_idx, override_num_sms,
exl3_gemm_blockdim[override_shape_idx]};
_tuning_cache[key] = tr;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include size_m when caching Blackwell-specific overrides

This override path is explicitly conditioned on size_m == 1, but the cached entry it writes is keyed only by size_k/size_n/cc/K/c_fp32, so later calls with size_m > 1 can incorrectly reuse single-token tuning results. That bypasses the intended size_m guard and can regress non-decode workloads after the cache is warmed by decode shapes.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af26558dfd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +165 to +167
} else if (cc == CC_BLACKWELL && K == 4 && size_m == 1 && size_k == 1024 &&
size_n == 256 && bszm_out <= 32) {
shape_idx = 4;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop forcing shape 4 when it yields zero tiles

For the Blackwell override path (K==4, size_m==1, size_k==1024, size_n==256), hard-coding shape_idx = 4 selects a kernel with TILESIZE_N=512, so this shape has zero N tiles. In select_exl3_mgemm_kernel, that drives max_slices to 0 and sets num_sms to 0, and exl3_mgemm_gr then computes total_sms / num_sms, which is a divide-by-zero on this decode shape rather than a pure performance tweak.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>
@AlpinDale AlpinDale changed the title perf: EXL3 MoE performance on GeForce Blackwell perf: EXL3 performance tuning on GeForce Blackwell Apr 27, 2026
Signed-off-by: AlpinDale <alpindale@gmail.com>
Signed-off-by: AlpinDale <alpindale@gmail.com>
@AlpinDale AlpinDale merged commit 3917cd8 into main Apr 27, 2026
1 check passed
@AlpinDale AlpinDale deleted the perf/exl3-blackwell-selector branch April 27, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant