perf: exl3 decode kernel optimization experiments by AlpinDale · Pull Request #1655 · dphnAI/aphrodite-engine

AlpinDale · 2026-04-27T23:22:16Z

Fixed a regression from #1652

Trinity-Nano-Preview-4.0bpw:

Context	main tok/s	PR tok/s	upstream EXL3 tok/s	PR vs main	PR vs EXL3
0	190.48	145.30	115.06	-23.7%	+26.3%
256	180.18	139.23	115.15	-22.7%	+20.9%
512	170.91	133.50	114.28	-21.9%	+16.8%
1024	154.84	123.44	114.90	-20.3%	+7.4%
2048	130.15	107.99	111.20	-17.0%	-2.9%
4096	129.77	107.87	110.28	-16.9%	-2.2%
8192	128.33	106.76	108.49	-16.8%	-1.6%
16384	126.18	105.15	107.28	-16.7%	-2.0%
32512	121.59	102.09	103.65	-16.0%	-1.5%

Signed-off-by: AlpinDale <alpindale@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fe5c1c6ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T23:27:13Z

+            self.quantization == "exl3"
+            and isinstance(requested_dtype, str)
+            and requested_dtype.lower() == "auto"
+            and self.dtype != torch.float16
+            and "moe" in self.hf_config.model_type.lower()


Use structural MoE detection for EXL3 fp16 fallback

The new EXL3 auto-dtype override is gated by "moe" in self.hf_config.model_type.lower(), which misses valid MoE models whose model_type does not contain that substring (for example, MixtralForCausalLM maps to mixtral in the model registry). In those cases quantization=exl3 with dtype=auto will keep bf16 and skip this new fp16 default, causing inconsistent decode behavior/performance across MoE architectures. This condition should use a structural check like self.is_moe/get_num_experts() instead of name matching.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

perf: exl3 decode kernel optimization experiments

6fe5c1c

Signed-off-by: AlpinDale <alpindale@gmail.com>

AlpinDale force-pushed the perf/exl3-decode-kernel-experiments branch from d872c50 to 6fe5c1c Compare April 27, 2026 23:22

AlpinDale changed the title ~~Perf/exl3 decode kernel experiments~~ perf: exl3 decode kernel optimization experiments Apr 27, 2026

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

fix: remove unsafe EXL3 shape overrides

bcc64e5

Signed-off-by: AlpinDale <alpindale@gmail.com>

AlpinDale merged commit 6c59bc7 into main Apr 28, 2026
1 check failed

AlpinDale deleted the perf/exl3-decode-kernel-experiments branch April 28, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: exl3 decode kernel optimization experiments#1655

perf: exl3 decode kernel optimization experiments#1655
AlpinDale merged 2 commits into
mainfrom
perf/exl3-decode-kernel-experiments

AlpinDale commented Apr 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

AlpinDale commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Apr 27, 2026 •

edited

Loading