[safetensors] Compute MoE active parameter count by mishig25 · Pull Request #2182 · huggingface/huggingface.js

mishig25 · 2026-05-20T19:38:41Z

For MoE models, compute active params from tensor headers + config.json:

routed_params = sum(numel(t) for t in tensors if isRoutedExpert(t))
always_active = total_params - routed_params
active_params = always_active + top_k * (routed_params / num_experts)

parseSafetensorsMetadata returns this on its output as moe?: MoeInfo:

export interface MoeInfo {
	numExperts: number;
	topK: number;
	/** Average parameter count per routed expert (= sum-of-routed / numExperts). */
	perExpert: number;
	/** Everything that runs on every token: embeddings, attention, norms, lm_head, router, shared experts, … */
	alwaysActive: number;
	/** alwaysActive + topK * perExpert */
	active: number;
	/** True when the model has a dense shared-expert MLP alongside routed experts (Deepseek, Qwen-MoE, Command-A, …). */
	hasSharedExpert: boolean;
}

parseSafetensorsMetadata already downloads config.json to read the quantization config (added in #1673), so reading num_experts_per_tok / num_local_experts from the same payload is free.

Detection of routed-expert tensors relies on transformers naming conventions:

Modern MoE blocks store experts in stacked 3D tensors named experts.gate_up_proj / experts.down_proj, leading dim = num_experts.
Legacy checkpoints use per-expert keys like experts.{j}.w1.weight; transformers' save_pretrained defaults to save_original_format=True which inverts the in-memory stacking via per-model rules in conversion_mapping.py, so most Hub checkpoints serialize per-expert.
Shared experts are dense MLPs named shared_experts.*; they're always-active and excluded by substring match.

Test

Added 4 new tests pinned to specific commits (Mixtral / Qwen3-30B-A3B / DeepSeek-V2-Lite / BERT). Validated against the following models — every advertised "active" number lands within bucket-rounding:

Repo	E	top_k	Total	Routed	Per-expert	Always-active	Active	Advertised in modelcard
Mixtral-8x7B-v0.1	8	2	46.70B	45.10B	5.64B	1.61B	12.88B	~12.9B ✅
Qwen1.5-MoE-A2.7B	60	4	14.32B	12.46B	207.6M	1.86B	2.69B	A2.7B ✅
DeepSeek-V2-Lite	64	6	15.71B	14.39B	224.9M	1.31B	2.66B	~2.7B ✅
OLMoE-1B-7B-0924	64	8	6.92B	6.44B	100.7M	476.7M	1.28B	1B-7B ✅
Phi-3.5-MoE-instruct	16	2	41.87B	40.27B	2.52B	1.61B	6.64B	6.6B ✅
Qwen3-30B-A3B	128	8	30.53B	28.99B	226.5M	1.54B	3.35B	A3B ✅
Qwen3-235B-A22B	128	8	235.09B	227.10B	1.77B	8.00B	22.19B	A22B ✅
Qwen3-Next-80B-A3B-Instruct	512	10	81.32B	78.92B	154.1M	2.40B	3.95B	A3B (≈ — a bit over)
Qwen3-Coder-30B-A3B-Instruct	128	8	30.53B	28.99B	226.5M	1.54B	3.35B	A3B ✅
command-a-plus-05-2026-bf16	128	8	218.75B	206.16B	1.61B	12.59B	25.48B	25B active / 218B total ✅

🤖 Generated with Claude Code

Extend parseSafetensorsMetadata to return a `moe` breakdown for Mixture-of-Experts models, computed from tensor headers + config.json (already fetched by the parser for quantization config). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

julien-c · 2026-05-22T13:15:34Z

+interface MoeConfigFields {
+	/** Common across Mixtral, Qwen2/3-MoE, Llama4, GPT-OSS, … */
+	num_experts_per_tok?: number;
+	/** Alternative spelling (some checkpoints) */
+	num_experts_per_token?: number;
+	num_local_experts?: number;
+	num_experts?: number;
+	/** DeepSeek family */
+	n_routed_experts?: number;
+	n_shared_experts?: number;
+	/** Multi-modal Ernie 4.5 */
+	moe_num_shared_experts?: number;
+}
+
+export interface ModelConfig extends MoeConfigFields {
 	quantization_config?: QuantizationConfig;
-	text_config?: { quantization_config?: QuantizationConfig };
+	text_config?: { quantization_config?: QuantizationConfig } & MoeConfigFields;
+}
+
+/**
+ * Active-parameter breakdown for Mixture-of-Experts models.
+ *
+ * For MoE models, only `topK` of `numExperts` routed experts run per token, so the
+ * usable ("active") parameter count is much smaller than the total stored on disk.
+ * `active = alwaysActive + topK * perExpert`. Returned by `parseSafetensorsMetadata`
+ * when the model's `config.json` exposes MoE fields and tensor names indicate a
+ * supported expert layout.
+ */
+export interface MoeInfo {
+	numExperts: number;
+	topK: number;
+	/** Average parameter count per routed expert (= sum-of-routed / numExperts). */
+	perExpert: number;
+	/** Everything that runs on every token: embeddings, attention, norms, lm_head, router, shared experts, … */
+	alwaysActive: number;
+	/** alwaysActive + topK * perExpert */
+	active: number;
+	/** True when the model has a dense shared-expert MLP alongside routed experts (Deepseek, Qwen-MoE, Command-A, …). */
+	hasSharedExpert: boolean;
 }


define those earlier in the file (with other types) please

julien-c · 2026-05-22T13:31:40Z

I am not a huge fan of this strong coupling between config.json and safety in source parsing, but if it's an important feature, i don't object to the principle

mishig25 force-pushed the feat/safetensors-moe branch from bfed6cc to 2dcc55e Compare May 20, 2026 19:44

mishig25 force-pushed the feat/safetensors-moe branch from 2dcc55e to f876b86 Compare May 20, 2026 19:51

julien-c reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[safetensors] Compute MoE active parameter count#2182

[safetensors] Compute MoE active parameter count#2182
mishig25 wants to merge 1 commit into
mainfrom
feat/safetensors-moe

mishig25 commented May 20, 2026 •

edited

Loading

Uh oh!

julien-c May 22, 2026

Uh oh!

julien-c commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mishig25 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Uh oh!

julien-c May 22, 2026

Choose a reason for hiding this comment

Uh oh!

julien-c commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mishig25 commented May 20, 2026 •

edited

Loading