Skip to content

[safetensors] Compute MoE active parameter count#2182

Draft
mishig25 wants to merge 1 commit into
mainfrom
feat/safetensors-moe
Draft

[safetensors] Compute MoE active parameter count#2182
mishig25 wants to merge 1 commit into
mainfrom
feat/safetensors-moe

Conversation

@mishig25
Copy link
Copy Markdown
Collaborator

@mishig25 mishig25 commented May 20, 2026

For MoE models, compute active params from tensor headers + config.json:

routed_params = sum(numel(t) for t in tensors if isRoutedExpert(t))
always_active = total_params - routed_params
active_params = always_active + top_k * (routed_params / num_experts)

parseSafetensorsMetadata returns this on its output as moe?: MoeInfo:

export interface MoeInfo {
	numExperts: number;
	topK: number;
	/** Average parameter count per routed expert (= sum-of-routed / numExperts). */
	perExpert: number;
	/** Everything that runs on every token: embeddings, attention, norms, lm_head, router, shared experts, … */
	alwaysActive: number;
	/** alwaysActive + topK * perExpert */
	active: number;
	/** True when the model has a dense shared-expert MLP alongside routed experts (Deepseek, Qwen-MoE, Command-A, …). */
	hasSharedExpert: boolean;
}

parseSafetensorsMetadata already downloads config.json to read the quantization config (added in #1673), so reading num_experts_per_tok / num_local_experts from the same payload is free.

Detection of routed-expert tensors relies on transformers naming conventions:

Test

Added 4 new tests pinned to specific commits (Mixtral / Qwen3-30B-A3B / DeepSeek-V2-Lite / BERT). Validated against the following models — every advertised "active" number lands within bucket-rounding:

Repo E top_k Total Routed Per-expert Always-active Active Advertised in modelcard
Mixtral-8x7B-v0.1 8 2 46.70B 45.10B 5.64B 1.61B 12.88B ~12.9B ✅
Qwen1.5-MoE-A2.7B 60 4 14.32B 12.46B 207.6M 1.86B 2.69B A2.7B ✅
DeepSeek-V2-Lite 64 6 15.71B 14.39B 224.9M 1.31B 2.66B ~2.7B ✅
OLMoE-1B-7B-0924 64 8 6.92B 6.44B 100.7M 476.7M 1.28B 1B-7B ✅
Phi-3.5-MoE-instruct 16 2 41.87B 40.27B 2.52B 1.61B 6.64B 6.6B ✅
Qwen3-30B-A3B 128 8 30.53B 28.99B 226.5M 1.54B 3.35B A3B ✅
Qwen3-235B-A22B 128 8 235.09B 227.10B 1.77B 8.00B 22.19B A22B ✅
Qwen3-Next-80B-A3B-Instruct 512 10 81.32B 78.92B 154.1M 2.40B 3.95B A3B (≈ — a bit over)
Qwen3-Coder-30B-A3B-Instruct 128 8 30.53B 28.99B 226.5M 1.54B 3.35B A3B ✅
command-a-plus-05-2026-bf16 128 8 218.75B 206.16B 1.61B 12.59B 25.48B 25B active / 218B total ✅

🤖 Generated with Claude Code

@mishig25 mishig25 force-pushed the feat/safetensors-moe branch from bfed6cc to 2dcc55e Compare May 20, 2026 19:44
Extend parseSafetensorsMetadata to return a `moe` breakdown for
Mixture-of-Experts models, computed from tensor headers + config.json
(already fetched by the parser for quantization config).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mishig25 mishig25 force-pushed the feat/safetensors-moe branch from 2dcc55e to f876b86 Compare May 20, 2026 19:51
Comment on lines +387 to 426
interface MoeConfigFields {
/** Common across Mixtral, Qwen2/3-MoE, Llama4, GPT-OSS, … */
num_experts_per_tok?: number;
/** Alternative spelling (some checkpoints) */
num_experts_per_token?: number;
num_local_experts?: number;
num_experts?: number;
/** DeepSeek family */
n_routed_experts?: number;
n_shared_experts?: number;
/** Multi-modal Ernie 4.5 */
moe_num_shared_experts?: number;
}

export interface ModelConfig extends MoeConfigFields {
quantization_config?: QuantizationConfig;
text_config?: { quantization_config?: QuantizationConfig };
text_config?: { quantization_config?: QuantizationConfig } & MoeConfigFields;
}

/**
* Active-parameter breakdown for Mixture-of-Experts models.
*
* For MoE models, only `topK` of `numExperts` routed experts run per token, so the
* usable ("active") parameter count is much smaller than the total stored on disk.
* `active = alwaysActive + topK * perExpert`. Returned by `parseSafetensorsMetadata`
* when the model's `config.json` exposes MoE fields and tensor names indicate a
* supported expert layout.
*/
export interface MoeInfo {
numExperts: number;
topK: number;
/** Average parameter count per routed expert (= sum-of-routed / numExperts). */
perExpert: number;
/** Everything that runs on every token: embeddings, attention, norms, lm_head, router, shared experts, … */
alwaysActive: number;
/** alwaysActive + topK * perExpert */
active: number;
/** True when the model has a dense shared-expert MLP alongside routed experts (Deepseek, Qwen-MoE, Command-A, …). */
hasSharedExpert: boolean;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define those earlier in the file (with other types) please

@julien-c
Copy link
Copy Markdown
Member

I am not a huge fan of this strong coupling between config.json and safety in source parsing, but if it's an important feature, i don't object to the principle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants