[safetensors] Compute MoE active parameter count#2182
Draft
mishig25 wants to merge 1 commit into
Draft
Conversation
bfed6cc to
2dcc55e
Compare
Extend parseSafetensorsMetadata to return a `moe` breakdown for Mixture-of-Experts models, computed from tensor headers + config.json (already fetched by the parser for quantization config). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2dcc55e to
f876b86
Compare
julien-c
reviewed
May 22, 2026
Comment on lines
+387
to
426
| interface MoeConfigFields { | ||
| /** Common across Mixtral, Qwen2/3-MoE, Llama4, GPT-OSS, … */ | ||
| num_experts_per_tok?: number; | ||
| /** Alternative spelling (some checkpoints) */ | ||
| num_experts_per_token?: number; | ||
| num_local_experts?: number; | ||
| num_experts?: number; | ||
| /** DeepSeek family */ | ||
| n_routed_experts?: number; | ||
| n_shared_experts?: number; | ||
| /** Multi-modal Ernie 4.5 */ | ||
| moe_num_shared_experts?: number; | ||
| } | ||
|
|
||
| export interface ModelConfig extends MoeConfigFields { | ||
| quantization_config?: QuantizationConfig; | ||
| text_config?: { quantization_config?: QuantizationConfig }; | ||
| text_config?: { quantization_config?: QuantizationConfig } & MoeConfigFields; | ||
| } | ||
|
|
||
| /** | ||
| * Active-parameter breakdown for Mixture-of-Experts models. | ||
| * | ||
| * For MoE models, only `topK` of `numExperts` routed experts run per token, so the | ||
| * usable ("active") parameter count is much smaller than the total stored on disk. | ||
| * `active = alwaysActive + topK * perExpert`. Returned by `parseSafetensorsMetadata` | ||
| * when the model's `config.json` exposes MoE fields and tensor names indicate a | ||
| * supported expert layout. | ||
| */ | ||
| export interface MoeInfo { | ||
| numExperts: number; | ||
| topK: number; | ||
| /** Average parameter count per routed expert (= sum-of-routed / numExperts). */ | ||
| perExpert: number; | ||
| /** Everything that runs on every token: embeddings, attention, norms, lm_head, router, shared experts, … */ | ||
| alwaysActive: number; | ||
| /** alwaysActive + topK * perExpert */ | ||
| active: number; | ||
| /** True when the model has a dense shared-expert MLP alongside routed experts (Deepseek, Qwen-MoE, Command-A, …). */ | ||
| hasSharedExpert: boolean; | ||
| } |
Member
There was a problem hiding this comment.
define those earlier in the file (with other types) please
Member
|
I am not a huge fan of this strong coupling between config.json and safety in source parsing, but if it's an important feature, i don't object to the principle |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For MoE models, compute active params from tensor headers +
config.json:parseSafetensorsMetadatareturns this on its output asmoe?: MoeInfo:parseSafetensorsMetadataalready downloadsconfig.jsonto read the quantization config (added in #1673), so readingnum_experts_per_tok/num_local_expertsfrom the same payload is free.Detection of routed-expert tensors relies on transformers naming conventions:
experts.gate_up_proj/experts.down_proj, leading dim =num_experts.experts.{j}.w1.weight; transformers'save_pretraineddefaults tosave_original_format=Truewhich inverts the in-memory stacking via per-model rules inconversion_mapping.py, so most Hub checkpoints serialize per-expert.shared_experts.*; they're always-active and excluded by substring match.Test
Added 4 new tests pinned to specific commits (Mixtral / Qwen3-30B-A3B / DeepSeek-V2-Lite / BERT). Validated against the following models — every advertised "active" number lands within bucket-rounding:
🤖 Generated with Claude Code