Skip to content

perf: exl3 decode kernel optimization experiments#1655

Merged
AlpinDale merged 2 commits into
mainfrom
perf/exl3-decode-kernel-experiments
Apr 28, 2026
Merged

perf: exl3 decode kernel optimization experiments#1655
AlpinDale merged 2 commits into
mainfrom
perf/exl3-decode-kernel-experiments

Conversation

@AlpinDale

@AlpinDale AlpinDale commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Fixed a regression from #1652

Trinity-Nano-Preview-4.0bpw:

Context main tok/s PR tok/s upstream EXL3 tok/s PR vs main PR vs EXL3
0 190.48 145.30 115.06 -23.7% +26.3%
256 180.18 139.23 115.15 -22.7% +20.9%
512 170.91 133.50 114.28 -21.9% +16.8%
1024 154.84 123.44 114.90 -20.3% +7.4%
2048 130.15 107.99 111.20 -17.0% -2.9%
4096 129.77 107.87 110.28 -16.9% -2.2%
8192 128.33 106.76 108.49 -16.8% -1.6%
16384 126.18 105.15 107.28 -16.7% -2.0%
32512 121.59 102.09 103.65 -16.0% -1.5%

Signed-off-by: AlpinDale <alpindale@gmail.com>
@AlpinDale AlpinDale force-pushed the perf/exl3-decode-kernel-experiments branch from d872c50 to 6fe5c1c Compare April 27, 2026 23:22
@AlpinDale AlpinDale changed the title Perf/exl3 decode kernel experiments perf: exl3 decode kernel optimization experiments Apr 27, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fe5c1c6ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread aphrodite/config/model.py
Comment on lines +672 to +676
self.quantization == "exl3"
and isinstance(requested_dtype, str)
and requested_dtype.lower() == "auto"
and self.dtype != torch.float16
and "moe" in self.hf_config.model_type.lower()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use structural MoE detection for EXL3 fp16 fallback

The new EXL3 auto-dtype override is gated by "moe" in self.hf_config.model_type.lower(), which misses valid MoE models whose model_type does not contain that substring (for example, MixtralForCausalLM maps to mixtral in the model registry). In those cases quantization=exl3 with dtype=auto will keep bf16 and skip this new fp16 default, causing inconsistent decode behavior/performance across MoE architectures. This condition should use a structural check like self.is_moe/get_num_experts() instead of name matching.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>
@AlpinDale AlpinDale merged commit 6c59bc7 into main Apr 28, 2026
1 check failed
@AlpinDale AlpinDale deleted the perf/exl3-decode-kernel-experiments branch April 28, 2026 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant