Skip to content

Adreno optimization for MoE - MxFP4#22301

Open
shawngu-quic wants to merge 8 commits intoggml-org:masterfrom
qualcomm:sg/moe-clc-upstream-mxfp4
Open

Adreno optimization for MoE - MxFP4#22301
shawngu-quic wants to merge 8 commits intoggml-org:masterfrom
qualcomm:sg/moe-clc-upstream-mxfp4

Conversation

@shawngu-quic
Copy link
Copy Markdown
Contributor

Overview

This PR contains a redesigned pipeline for MoE with a focus on optimizations for MxFP4 data type. Added GPU code to reorder router table, pre-transpose expert weights and distinguish prefill and decode MoE kernels. The optimizations are primarily for Adreno GPU and will be fallback to original generic implementations when running on other vendors.

Requirements

@shawngu-quic shawngu-quic requested review from a team and CISC as code owners April 23, 2026 22:47
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 23, 2026

Hi @shawngu-quic, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Apr 23, 2026
Comment thread src/llama-model.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants